SystemML Algorithms Reference

6. Survival Analysis

6.1. Kaplan-Meier Survival Analysis

Description

Survival analysis examines the time needed for a particular event of interest to occur. In medical research, for example, the prototypical such event is the death of a patient but the methodology can be applied to other application areas, e.g., completing a task by an individual in a psychological experiment or the failure of electrical components in engineering. Kaplan-Meier or (product limit) method is a simple non-parametric approach for estimating survival probabilities from both censored and uncensored survival times.

Usage

hadoop jar SystemML.jar -f KM.dml
                        -nvargs X=<file>
                                TE=<file>
                                GI=<file>
                                SI=<file>
                                O=<file>
                                M=<file>
                                T=<file>
                                alpha=[double]
                                etype=[greenwood|peto]
                                ctype=[plain|log|log-log]
                                ttype=[none|log-rank|wilcoxon]
                                fmt=[format]
$SPARK_HOME/bin/spark-submit --master yarn
                             --deploy-mode cluster
                             --conf spark.driver.maxResultSize=0
                             SystemML.jar
                             -f KM.dml
                             -config SystemML-config.xml
                             -exec hybrid_spark
                             -nvargs X=<file>
                                     TE=<file>
                                     GI=<file>
                                     SI=<file>
                                     O=<file>
                                     M=<file>
                                     T=<file>
                                     alpha=[double]
                                     etype=[greenwood|peto]
                                     ctype=[plain|log|log-log]
                                     ttype=[none|log-rank|wilcoxon]
                                     fmt=[format]

Arguments

X: Location (on HDFS) to read the input matrix of the survival data containing:

TE: Location (on HDFS) to read the 1-column matrix $TE$ that contains the column indices of the input matrix $X$ corresponding to timestamps (first entry) and event information (second entry)

GI: Location (on HDFS) to read the 1-column matrix $GI$ that contains the column indices of the input matrix $X$ corresponding to the factors (i.e., categorical features) to be used for grouping

SI: Location (on HDFS) to read the 1-column matrix $SI$ that contains the column indices of the input matrix $X$ corresponding to the factors (i.e., categorical features) to be used for grouping

O: Location (on HDFS) to write the matrix containing the results of the Kaplan-Meier analysis $KM$

M: Location (on HDFS) to write Matrix $M$ containing the following statistics: total number of events, median and its confidence intervals; if survival data for multiple groups and strata are provided each row of $M$ contains the above statistics per group and stratum.

T: If survival data from multiple groups is available and ttype=log-rank or ttype=wilcoxon, location (on HDFS) to write the two matrices that contains the result of the (stratified) test for comparing these groups; see below for details.

alpha: (default: 0.05) Parameter to compute $100(1-\alpha)\%$ confidence intervals for the survivor function and its median

etype: (default: "greenwood") Parameter to specify the error type according to greenwood or peto

ctype: (default: "log") Parameter to modify the confidence interval; plain keeps the lower and upper bound of the confidence interval unmodified, log corresponds to logistic transformation and log-log corresponds to the complementary log-log transformation

ttype: (default: "none") If survival data for multiple groups is available specifies which test to perform for comparing survival data across multiple groups: none, log-rank or wilcoxon test

fmt: (default:"text") Matrix file output format, such as text, mm, or csv; see read/write functions in SystemML Language Reference for details.

Examples

hadoop jar SystemML.jar -f KM.dml
                        -nvargs X=/user/ml/X.mtx
                                TE=/user/ml/TE
                                GI=/user/ml/GI
                                SI=/user/ml/SI
                                O=/user/ml/kaplan-meier.csv
                                M=/user/ml/model.csv
                                alpha=0.01
                                etype=greenwood
                                ctype=plain
                                fmt=csv
$SPARK_HOME/bin/spark-submit --master yarn
                             --deploy-mode cluster
                             --conf spark.driver.maxResultSize=0
                             SystemML.jar
                             -f KM.dml
                             -config SystemML-config.xml
                             -exec hybrid_spark
                             -nvargs X=/user/ml/X.mtx
                                     TE=/user/ml/TE
                                     GI=/user/ml/GI
                                     SI=/user/ml/SI
                                     O=/user/ml/kaplan-meier.csv
                                     M=/user/ml/model.csv
                                     alpha=0.01
                                     etype=greenwood
                                     ctype=plain
                                     fmt=csv
hadoop jar SystemML.jar -f KM.dml
                        -nvargs X=/user/ml/X.mtx
                                TE=/user/ml/TE
                                GI=/user/ml/GI
                                SI=/user/ml/SI
                                O=/user/ml/kaplan-meier.csv
                                M=/user/ml/model.csv
                                T=/user/ml/test.csv
                                alpha=0.01
                                etype=peto
                                ctype=log
                                ttype=log-rank
                                fmt=csv
$SPARK_HOME/bin/spark-submit --master yarn
                             --deploy-mode cluster
                             --conf spark.driver.maxResultSize=0
                             SystemML.jar
                             -f KM.dml
                             -config SystemML-config.xml
                             -exec hybrid_spark
                             -nvargs X=/user/ml/X.mtx
                                     TE=/user/ml/TE
                                     GI=/user/ml/GI
                                     SI=/user/ml/SI
                                     O=/user/ml/kaplan-meier.csv
                                     M=/user/ml/model.csv
                                     T=/user/ml/test.csv
                                     alpha=0.01
                                     etype=peto
                                     ctype=log
                                     ttype=log-rank
                                     fmt=csv

Details

The Kaplan-Meier estimate is a non-parametric maximum likelihood estimate (MLE) of the survival function $S(t)$, i.e., the probability of survival from the time origin to a given future time. As an illustration suppose that there are $n$ individuals with observed survival times $t_1,t_2,\ldots t_n$ out of which there are $r\leq n$ distinct death times —since some of the observations may be censored, in the sense that the end-point of interest has not been observed for those individuals, and there may be more than one individual with the same survival time. Let $S(t_j)$ denote the probability of survival until time $t_j$, $d_j$ be the number of events at time $t_j$, and $n_j$ denote the number of individual at risk (i.e., those who die at time $t_j$ or later). Assuming that the events occur independently, in Kaplan-Meier method the probability of surviving from $t_j$ to is estimated from $S(t_j)$ and given by

for , , for , and . Note that the value of $\hat{S}(t)$ is constant between times of event and therefore the estimate is a step function with jumps at observed event times. If there are no censored data this estimator would simply reduce to the empirical survivor function defined as $\frac{n_j}{n}$. Thus, the Kaplan-Meier estimate can be seen as the generalization of the empirical survivor function that handles censored observations.

The methodology used in our KM.dml script closely follows Section 2 of [Collett2003]. For completeness we briefly discuss the equations used in our implementation.

Standard error of the survivor function. The standard error of the estimated survivor function (controlled by parameter etype) can be calculated as

for . This equation is known as the Greenwood’s formula. An alternative approach is to apply the Petos’s expression

for . Once the standard error of $\hat{S}$ has been found we compute the following types of confidence intervals (controlled by parameter cctype): The plain $100(1-\alpha)\%$ confidence interval for $S(t)$ is computed using

where $z_{\alpha/2}$ is the upper $\alpha/2$-point of the standard normal distribution. Alternatively, we can apply the log transformation using

or the log-log transformation using

Median, its standard error and confidence interval. Denote by the estimated median of , i.e., , where is the observed survival time for individual . The standard error of is given by

where can be found from

Above, $\hat{u}(50)$ is the largest survival time for which $\hat{S}$ exceeds $0.5+\epsilon$, i.e., , and $\hat{l}(50)$ is the smallest survivor time for which $\hat{S}$ is less than $0.5-\epsilon$, i.e., , for small $\epsilon$.

Log-rank test and Wilcoxon test. Our implementation supports comparison of survival data from several groups using two non-parametric procedures (controlled by parameter ttype): the log-rank test and the Wilcoxon test (also known as the Breslow test). Assume that the survival times in $g\geq 2$ groups of survival data are to be compared. Consider the null hypothesis that there is no difference in the survival times of the individuals in different groups. One way to examine the null hypothesis is to consider the difference between the observed number of deaths with the numbers expected under the null hypothesis. In both tests we define the $U$-statistics ( for the log-rank test and for the Wilcoxon test) to compare the observed and the expected number of deaths in $1,2,\ldots,g-1$ groups as follows:

where is the of number deaths at time in group $k$, is the number of individuals at risk at time in group $k$, and $k=1,2,\ldots,g-1$ to form the vectors $U_L$ and $U_W$ with $(g-1)$ components. The covariance (variance) between and (when $k=k’$) is computed as

for $k,k’=1,2,\ldots,g-1$, with

These terms are combined in a variance-covariance matrix $V_L$ (referred to as the $V$-statistic). Similarly, the variance-covariance matrix for the Wilcoxon test $V_W$ is a matrix where the entry at position $(k,k’)$ is given by

Under the null hypothesis of no group differences, the test statistics $U_L^\top V_L^{-1} U_L$ for the log-rank test and $U_W^\top V_W^{-1} U_W$ for the Wilcoxon test have a Chi-squared distribution on $(g-1)$ degrees of freedom. Our KM.dml script also provides a stratified version of the log-rank or Wilcoxon test if requested. In this case, the values of the $U$- and $V$- statistics are computed for each stratum and then combined over all strata.

Returns

Below we list the results of the survival analysis computed by KM.dml. The calculated statistics are stored in matrix $KM$ with the following schema:

Note that if survival data for multiple groups and/or strata is available, each collection of 7 columns in $KM$ stores the results per group and/or per stratum. In this case $KM$ has $7g+7s$ columns, where $g\geq 1$ and $s\geq 1$ denote the number of groups and strata, respectively.

Additionally, KM.dml stores the following statistics in the 1-row matrix $M$ whose number of columns depends on the number of groups ($g$) and strata ($s$) in the data. Below $k$ denotes the number of factors used for grouping and $l$ denotes the number of factors used for stratifying.

If there is only 1 group and 1 stratum available $M$ will be a 1-row matrix with 5 columns where

If a comparison of the survival data across multiple groups needs to be performed, KM.dml computes two matrices $T$ and that contain a summary of the test. The 1-row matrix $T$ stores the following statistics:

Matrix contains the following statistics for each of $g$ groups:


6.2. Cox Proportional Hazard Regression Model

Description

The Cox (proportional hazard or PH) is a semi-parametric statistical approach commonly used for analyzing survival data. Unlike non-parametric approaches, e.g., the Kaplan-Meier estimates, which can be used to analyze single sample of survival data or to compare between groups of survival times, the Cox PH models the dependency of the survival times on the values of explanatory variables (i.e., covariates) recorded for each individual at the time origin. Our focus is on covariates that do not change value over time, i.e., time-independent covariates, and that may be categorical (ordinal or nominal) as well as continuous-valued.

Usage

Cox:

hadoop jar SystemML.jar -f Cox.dml
                        -nvargs X=<file>
                                TE=<file>
                                F=<file>
                                R=[file]
                                M=<file>
                                S=[file]
                                T=[file]
                                COV=<file>
                                RT=<file>
                                XO=<file>
                                MF=<file>
                                alpha=[double]
                                tol=[double]
                                moi=[int]
                                mii=[int]
                                fmt=[format]
$SPARK_HOME/bin/spark-submit --master yarn
                             --deploy-mode cluster
                             --conf spark.driver.maxResultSize=0
                             SystemML.jar
                             -f Cox.dml
                             -config SystemML-config.xml
                             -exec hybrid_spark
                             -nvargs X=<file>
                                     TE=<file>
                                     F=<file>
                                     R=[file]
                                     M=<file>
                                     S=[file]
                                     T=[file]
                                     COV=<file>
                                     RT=<file>
                                     XO=<file>
                                     MF=<file>
                                     alpha=[double]
                                     tol=[double]
                                     moi=[int]
                                     mii=[int]
                                     fmt=[format]

Cox Prediction:

hadoop jar SystemML.jar -f Cox-predict.dml
                        -nvargs X=<file>
                                RT=<file>
                                M=<file>
                                Y=<file>
                                COV=<file>
                                MF=<file>
                                P=<file>
                                fmt=[format]
$SPARK_HOME/bin/spark-submit --master yarn
                             --deploy-mode cluster
                             --conf spark.driver.maxResultSize=0
                             SystemML.jar
                             -f Cox-predict.dml
                             -config SystemML-config.xml
                             -exec hybrid_spark
                             -nvargs X=<file>
                                     RT=<file>
                                     M=<file>
                                     Y=<file>
                                     COV=<file>
                                     MF=<file>
                                     P=<file>
                                     fmt=[format]

Arguments - Cox Model Fitting/Prediction

X: Location (on HDFS) to read the input matrix of the survival data containing:

Y: Location (on HDFS) to the read matrix used for prediction

TE: Location (on HDFS) to read the 1-column matrix $TE$ that contains the column indices of the input matrix $X$ corresponding to timestamps (first entry) and event information (second entry)

F: Location (on HDFS) to read the 1-column matrix $F$ that contains the column indices of the input matrix $X$ corresponding to the features to be used for fitting the Cox model

R: (default: " ") If factors (i.e., categorical features) are available in the input matrix $X$, location (on HDFS) to read matrix $R$ containing the start (first column) and end (second column) indices of each factor in $X$; alternatively, user can specify the indices of the baseline level of each factor which needs to be removed from $X$. If $R$ is not provided by default all variables are considered to be continuous-valued.

M: Location (on HDFS) to store the results of Cox regression analysis including regression coefficients $\beta_j$s, their standard errors, confidence intervals, and $P$-values

S: (default: " ") Location (on HDFS) to store a summary of some statistics of the fitted model including number of records, number of events, log-likelihood, AIC, Rsquare (Cox & Snell), and maximum possible Rsquare

T: (default: " ") Location (on HDFS) to store the results of Likelihood ratio test, Wald test, and Score (log-rank) test of the fitted model

COV: Location (on HDFS) to store the variance-covariance matrix of $\beta_j$s; note that parameter COV needs to be provided as input to prediction.

RT: Location (on HDFS) to store matrix $RT$ containing the order-preserving recoded timestamps from $X$; note that parameter RT needs to be provided as input for prediction.

XO: Location (on HDFS) to store the input matrix $X$ ordered by the timestamps; note that parameter XO needs to be provided as input for prediction.

MF: Location (on HDFS) to store column indices of $X$ excluding the baseline factors if available; note that parameter MF needs to be provided as input for prediction.

P: Location (on HDFS) to store matrix $P$ containing the results of prediction

alpha: (default: 0.05) Parameter to compute a $100(1-\alpha)\%$ confidence interval for $\beta_j$s

tol: (default: 0.000001) Tolerance ($\epsilon$) used in the convergence criterion

moi: (default: 100) Maximum number of outer (Fisher scoring) iterations

mii: (default: 0) Maximum number of inner (conjugate gradient) iterations, or 0 if no maximum limit provided

fmt: (default: "text") Matrix file output format, such as text, mm, or csv; see read/write functions in SystemML Language Reference for details.

Examples

Cox:

hadoop jar SystemML.jar -f Cox.dml
                        -nvargs X=/user/ml/X.mtx
                                TE=/user/ml/TE
                                F=/user/ml/F
                                R=/user/ml/R
                                M=/user/ml/model.csv
                                T=/user/ml/test.csv
                                COV=/user/ml/var-covar.csv
                                XO=/user/ml/X-sorted.mtx
                                fmt=csv
$SPARK_HOME/bin/spark-submit --master yarn
                             --deploy-mode cluster
                             --conf spark.driver.maxResultSize=0
                             SystemML.jar
                             -f Cox.dml
                             -config SystemML-config.xml
                             -exec hybrid_spark
                             -nvargs X=/user/ml/X.mtx
                                     TE=/user/ml/TE
                                     F=/user/ml/F
                                     R=/user/ml/R
                                     M=/user/ml/model.csv
                                     T=/user/ml/test.csv
                                     COV=/user/ml/var-covar.csv
                                     XO=/user/ml/X-sorted.mtx
                                     fmt=csv
hadoop jar SystemML.jar -f Cox.dml
                        -nvargs X=/user/ml/X.mtx
                                TE=/user/ml/TE
                                F=/user/ml/F
                                R=/user/ml/R
                                M=/user/ml/model.csv
                                T=/user/ml/test.csv
                                COV=/user/ml/var-covar.csv
                                RT=/user/ml/recoded-timestamps.csv
                                XO=/user/ml/X-sorted.csv
                                MF=/user/ml/baseline.csv
                                alpha=0.01
                                tol=0.000001
                                moi=100
                                mii=20
                                fmt=csv
$SPARK_HOME/bin/spark-submit --master yarn
                             --deploy-mode cluster
                             --conf spark.driver.maxResultSize=0
                             SystemML.jar
                             -f Cox.dml
                             -config SystemML-config.xml
                             -exec hybrid_spark
                             -nvargs X=/user/ml/X.mtx
                                     TE=/user/ml/TE
                                     F=/user/ml/F
                                     R=/user/ml/R
                                     M=/user/ml/model.csv
                                     T=/user/ml/test.csv
                                     COV=/user/ml/var-covar.csv
                                     RT=/user/ml/recoded-timestamps.csv
                                     XO=/user/ml/X-sorted.csv
                                     MF=/user/ml/baseline.csv
                                     alpha=0.01
                                     tol=0.000001
                                     moi=100
                                     mii=20
                                     fmt=csv

Cox Prediction:

hadoop jar SystemML.jar -f Cox-predict.dml
                        -nvargs X=/user/ml/X-sorted.mtx
                                RT=/user/ml/recoded-timestamps.csv
                                M=/user/ml/model.csv
                                Y=/user/ml/Y.mtx
                                COV=/user/ml/var-covar.csv
                                MF=/user/ml/baseline.csv
                                P=/user/ml/predictions.csv
                                fmt=csv
$SPARK_HOME/bin/spark-submit --master yarn
                             --deploy-mode cluster
                             --conf spark.driver.maxResultSize=0
                             SystemML.jar
                             -f Cox-predict.dml
                             -config SystemML-config.xml
                             -exec hybrid_spark
                             -nvargs X=/user/ml/X-sorted.mtx
                                     RT=/user/ml/recoded-timestamps.csv
                                     M=/user/ml/model.csv
                                     Y=/user/ml/Y.mtx
                                     COV=/user/ml/var-covar.csv
                                     MF=/user/ml/baseline.csv
                                     P=/user/ml/predictions.csv
                                     fmt=csv

Details

In the Cox PH regression model, the relationship between the hazard function — i.e., the probability of event occurrence at a given time — and the covariates is described as

where the hazard function for the $i$th individual () depends on a set of $p$ covariates , whose importance is measured by the magnitude of the corresponding coefficients . The term is the baseline hazard and is related to a hazard value if all covariates equal

  1. In the Cox PH model the hazard function for the individuals may vary over time, however the baseline hazard is estimated non-parametrically and can take any form. Note that re-writing (1) we have

Thus, the Cox PH model is essentially a linear model for the logarithm of the hazard ratio and the hazard of event for any individual is a constant multiple of the hazard of any other. We follow similar notation and methodology as in Section 3 of [Collett2003]. For completeness we briefly discuss the equations used in our implementation.

Factors in the model. Note that if some of the feature variables are factors they need to dummy code as follows. Let $\alpha$ be such a variable (i.e., a factor) with $a$ levels. We introduce $a-1$ indicator (or dummy coded) variables $X_2,X_3\ldots,X_a$ with $X_j=1$ if $\alpha=j$ and 0 otherwise, for . In particular, one of $a$ levels of $\alpha$ will be considered as the baseline and is not included in the model. In our implementation, user can specify a baseline level for each of the factor (as selecting the baseline level for each factor is arbitrary). On the other hand, if for a given factor $\alpha$ no baseline is specified by the user, the most frequent level of $\alpha$ will be considered as the baseline.

Fitting the model. We estimate the coefficients of the Cox model via negative log-likelihood method. In particular the Cox PH model is fitted by using trust region Newton method with conjugate gradient [Nocedal2006]. Define the risk set $R(t_j)$ at time $t_j$ to be the set of individuals who die at time $t_i$ or later. The PH model assumes that survival times are distinct. In order to handle tied observations we use the Breslow approximation of the likelihood function

where $d_j$ is number individuals who die at time $t_j$ and $s_j$ denotes the element-wise sum of the covariates for those individuals who die at time $t_j$, $j=1,2,\ldots,r$, i.e., the $h$th element of $s_j$ is given by , where $x_{hjk}$ is the value of $h$th variable () for the $k$th of the $d_j$ individuals () who die at the $j$th death time ().

Standard error and confidence interval for coefficients. Note that the variance-covariance matrix of the estimated coefficients $\hat{\beta}$ can be approximated by the inverse of the Hessian evaluated at $\hat{\beta}$. The square root of the diagonal elements of this matrix are the standard errors of estimated coefficients. Once the standard errors of the coefficients $se(\hat{\beta})$ is obtained we can compute a $100(1-\alpha)\%$ confidence interval using , where $z_{\alpha/2}$ is the upper $\alpha/2$-point of the standard normal distribution. In Cox.dml, we utilize the built-in function inv() to compute the inverse of the Hessian. Note that this build-in function can be used only if the Hessian fits in the main memory of a single machine.

Wald test, likelihood ratio test, and log-rank test. In order to test the null hypothesis that all of the coefficients $\beta_j$s are 0, our implementation provides three statistical test: Wald test, likelihood ratio test, the log-rank test (also known as the score test). Let $p$ be the number of coefficients. The Wald test is based on the test statistic ${\hat{\beta}}^2/{se(\hat{\beta})}^2$, which is compared to percentage points of the Chi-squared distribution to obtain the $P$-value. The likelihood ratio test relies on the test statistic ($\textbf{0}$ denotes a zero vector of size $p$ ) which has an approximate Chi-squared distribution with $p$ degrees of freedom under the null hypothesis that all $\beta_j$s are 0. The Log-rank test is based on the test statistic $l=\nabla^\top L(\textbf{0}) {\mathcal{H}}^{-1}(\textbf{0}) \nabla L(\textbf{0})$, where $\nabla L(\textbf{0})$ is the gradient of $L$ and $\mathcal{H}(\textbf{0})$ is the Hessian of $L$ evaluated at 0. Under the null hypothesis that $\beta=\textbf{0}$, $l$ has a Chi-squared distribution on $p$ degrees of freedom.

Prediction. Once the parameters of the model are fitted, we compute the following predictions together with their standard errors

Given feature vector for individual , we obtain the above predictions at time as follows. The linear predictors (denoted as ) as well as the risk (denoted as $\mathcal{R}$) are computed relative to a baseline whose feature values are the mean of the values in the corresponding features. Let , where is a row vector that contains the mean values for each feature. We have and . The standard errors of the linear predictors are computed as the square root of and the standard error of the risk are given by the square root of , where is the variance-covariance matrix of the coefficients and is the element-wise multiplication.

We estimate the cumulative hazard function for individual $i$ by

where is the Breslow estimate of the cumulative baseline hazard given by

In the equation above, as before, $d_j$ is the number of deaths, and is the risk set at time , for , . The standard error of is obtained using the estimation

where

for .

Returns

Below we list the results of fitting a Cox regression model stored in matrix $M$ with the following schema:

Note that above $Z$ is the Wald test statistic which is asymptotically standard normal under the hypothesis that $\beta=\textbf{0}$.

Moreover, Cox.dml outputs two log files S and T containing a summary statistics of the fitted model as follows. File S stores the following information

Above, the AIC is computed as in Stepwise Linear Regression, the Cox & Snell Rsquare is equal to , where $l$ is the log-rank test statistic as discussed above and $n$ is total number of observations, and the maximum possible Rsquare computed as , where $L(\textbf{0})$ denotes the initial likelihood.

File T contains the following information

Additionally, the following matrices will be stored. Note that these matrices are required for prediction.

Prediction. Finally, the results of prediction is stored in Matrix $P$ with the following schema