Implementing the Cox model in R

Bridging the gap between math and code

Akshay Swaminathan
Analytics Vidhya

--

Photo by Florian Olivo on Unsplash

Motivation

There are many pedagogical resources that explain the theory behind the Cox model. There are also many resources that provide guidance on applying the Cox model using existing statistical software packages for survival analysis. However, there are few sources that bridge the gap between theory and practice by explaining the programmatic implementation of the Cox model. In this article, I 1) explain the key mathematical results that are necessary to implement the Cox model, 2) illustrate sample code that can be used to implement the Cox model, and 3) compare the output of the manually implemented function to that of survival::coxph().

Introduction

Estimating almost any (semi) parametric regression model follows a similar procedure:

  1. Specify a model with parameters β that we want to estimate
  2. Determine either a) a loss function to minimize, or b) a likelihood to maximize
  3. Find the maximum (for likelihood) or minimum (for loss function) of the expression above by taking the derivative with respect to the parameters of interest and setting it to zero (using calculus)
  4. Find the values of the parameters of interest (β̂) that yield a derivative of zero (using either algebra or numerical optimization)

Every regression model has its own quirks that may make any one of these steps more or less difficult. For instance, the loss function for linear regression (squared-error) has a “clean” derivative, which means step (4) is just a matter of solving for the optimal parameter values through algebra. Logistic regression, on the other hand, has a loss function with a “messy” derivative, which means step (4) requires the use of a numerical optimization algorithm as opposed to simple algebra.

Here is an outline of the above five steps, specific to the Cox proportional hazards model.

  1. Our model is: λ(t|X) = λ₀(t)eˣᵝ
  2. We will maximize the likelihood (instead of minimizing a loss function). One quirk of the Cox model is that its estimation uses the so-called “partial likelihood” instead of the full likelihood.
  3. We will maximize the partial likelihood by setting the derivative to zero.
  4. The derivative is “messy”, so we will use the Newton-Raphson algorithm to computationally arrive at parameter estimates that yield a derivative of zero (instead of solving for them algebraically).

1. Model specification

The Cox model is λ(t|X) = λ₀(t)eˣᵝ, Where X is the covariate matrix, β is the vector of coefficients we want to estimate, and λ₀ is the so-called baseline hazard function. The hazard function can be interpreted as a patient’s “risk” of failure at time t. Mathematically, it is defined as follows:

where f(t) is the PDF of T, the survival times for all patients.

When the covariates in the Cox model are zero, the patient’s hazard function λ(t|X) is equal to the baseline hazard function λ₀(t). Non-zero covariates affect the patient’s hazard by multiplying the baseline hazard by the eˣᵝ term.

For our purposes, this explanation will suffice. Many other sources have thoroughly explained the Cox model specification in more detail (see Therneau and Grambsch 2000 for example).

2. Specifying the likelihood

A key characteristic of the Cox model is that it can be expressed as follows:

Where g is the baseline hazard function and h is the exponentiated covariates and betas. Importantly, g does not depend on X and h does not depend on t. This allows us to compare the hazard functions of two different patients without having to estimate g(t):

This allows for so-called semi-parametric inference, wherein we parameterize h(X) but allow g(t) to be more free-form.

We see this idea in play when we specify the likelihood L(λ₀, β). Cox showed that: L(λ₀, β) can be expressed as L₁(λ₀, β) * L₂(β), where L₂’s maximum β̂ is an asymptotically normal unbiased estimator of β, whereas L₁ contains relatively little information about β. Cox essentially proposed ignoring L₁ and basing all inference on maximizing L₂. Thus, L₂ can be called the partial likelihood. See here for a good description of this.

For simplicity, we will refer to L₂ simply as L, but with the understanding that L represents the partial likelihood. Cox specified the contribution of patient j to the partial likelihood as follows:

The numerator is the probability that patient j fails at time Tⱼ (patient j’s failure time), given that patient j was at risk for death at time Tⱼ. Being in the risk set R(Tⱼ) essentially means that the patient hasn’t failed yet or that their censoring date hasn’t passed yet. The denominator is the probability that patient k fails at Tⱼ given that they are in the risk set R(Tⱼ) for all patients k in R(Tⱼ).

Mathematically, this can be represented as follows:

The entire partial likelihood can be written as:

If the dataset has censoring, the likelihood expression can be modified:

Where δⱼ is 0 if the patient is censored and 1 otherwise. For simplicity, we will not consider censoring in the rest of the mathematical results or code.

3. Maximizing the likelihood

A note about notation before proceeding:
Since we will be eventually implementing the math that follows in code, it’s important to be clear about notation, particulary matrix/vector algebra:

1. X is a n × p matrix representing p covariates for n patients
2. xⱼ is a 1 × p vector representing p covariates for patient j
3. β is a p × 1 vector representing the coefficients for the p covariates
4. Based on the above notation, yields a n × 1 vector, and xⱼβ yields a vector of length 1.

As in many cases of MLE, maximizing the log-likelihood is easier and equivalent to maximizing the likelihood, since f(x) = log(x) is monotonically increasing. The partial log-likelihood for the Cox model is:

Taking the derivative with respect to β:

Note that since we are taking the derivative with respect to β, which is a p × 1 vector, we would expect a p × 1 dimensional derivative, or gradient. Since xⱼ has dimensions p × 1, x_ke^{x_k\beta}$ has dimensions p × 1, and $e^{x_k\beta}$ has dimensions 1 × 1, the overall gradient will have dimensions p × 1.

At this point, we have technically done all the math required to implement the Cox model. However, the above gradient formula would only allow us to estimate β̂, not its variance. To estimate the variance of β̂, we can use the negative inverse of the second derivative of the likelihood. This involves the quotient rule (and some chain rule):

The second derivative of the likelihood with respect to β is:

Taking this derivative with respect to β requires some care with matrix dimensions. Computing $\frac{\partial}{\partial \beta}x_ke^{x_k\beta}$ requires multiplying $x_ke^{x_k\beta}$ by $x_k$. Since the second derivative must have dimensions p × 1, $\frac{\partial}{\partial \beta}\sum_{k \in R(T_j)}x_ke^{x_k\beta}$ must also have dimensions p × 1. Keeping this in mind, we can see that $\frac{\partial}{\partial \beta}x_ke^{x_k\beta}$ must be $x_k²e^{x_k\beta}$, where $x²_k$ is not $x_k^Tx_k$, or $x_kx_k^T$, but rather the element-wise square of $x_k$. This is the only way we can preserve the desired dimensions.

We can plug this second derivative into the variance formula to estimate the variance. Now we have all the mathematical results needed to implement our model!

4. Find the parameter values that minimize the gradient (and maximize the likelihood)

Our equation for the zero-derivative of the partial likelihood is

It is impossible to solve for β analytically, so we will employ the Newton-Raphson algorithm to find the optimal values for β.

The Newton-Raphson algorithm is a variant of the standard gradient descent procedure:

1. Start with some initial guess β̂₀
2. Compute the gradient of the likelihood at β = β̂₀
3. Update your initial guess β̂₁ = β̂₀ + θ * U(β̂₀) where U is the gradient and θ is the learning rate
4. Repeat steps 2–3 until U(β̂n) converges (hopefully to zero!)

The intuition behind gradient descent is as follows: we can visualize the likelihood function as a curve with some maximum. Your initial guess β̂₀ will put you somewhere on that curve, but you likely won’t be at the maximum (unless you get really lucky with your initial guess). To figure out whether you’re on an upward sloping part or downward sloping part of the curve, take the gradient. If it’s positive, you’re on an upward sloping part and you should continue moving in the positive direction to reach the maximum. If it’s negative, you’re on a downward sloping part and you should move in the negative direction to reach the maximum. Step 3 updates the initial guess by adding the scaled gradient to the initial guess. The scaling parameter $\theta$ is the learning rate, and is usually a small number between 0 and 1.

The only difference in Newton-Raphson is that θ = Vâr(β̂n₋₁). This is computationally more expensive, because you have to compute the variance at each step, but leads to faster convergence.

Implementation

Setting Parameters

library(survival)
library(ggplot2)
library(dplyr)

Newton-Raphson converges quickly, so we can set the number of iterations to 10. Thernau and Grambsch recommend setting β̂₀ = 0.

ITERS <- 10
INITIAL_BETA <- 0

Helper Functions

We can write helper functions for:

  1. Finding the risk set R(t) given some time t
GetRiskSet <- function(time_of_interest,
entry_vector,
time_vector,
event_vector) {
return(which((time_of_interest >= entry_vector) & ((time_vector == time_of_interest & event_vector == 1) | (time_vector > time_of_interest))))
}

2. Calculating the gradient of the likelihood

CoxGradient <- function(beta,
Xs,
entry,
Ts,
event) {
p <- ncol(Xs)

gradient <- apply(cbind(Ts, Xs), 1,
function(df){

df <- matrix(df, nrow = 1)
ts <- df[, 1]
xs <- df[, 2:ncol(df)]
X_risk_set <- Xs[GetRiskSet(ts, entry, Ts, event),] %>%
matrix(ncol = ncol(Xs))

t1 <- t(X_risk_set) %*% exp(X_risk_set %*% beta)
t2 <- sum(exp(X_risk_set %*% beta))

return(xs - t1 / t2)

}) %>%
matrix(nrow = p) %>%
rowSums()

return(gradient)

}

3. Calculating the variance of the likelihood

CoxVariance <- function(beta,
Xs,
entry,
Ts,
event) {

p <- ncol(Xs)

variance <- apply(cbind(Ts, Xs), 1,
function(df){

df <- matrix(df, nrow = 1)
ts <- df[, 1]
xs <- df[, 2:ncol(df)]
X_risk_set <- Xs[GetRiskSet(ts, entry, Ts, event),] %>%
matrix(ncol = ncol(Xs))

sum1 <- sum(exp(X_risk_set %*% beta))
sum2 <- rowSums(t(X_risk_set^2) %*% exp(X_risk_set %*% beta))
sum3 <- rowSums(t(X_risk_set) %*% exp(X_risk_set %*% beta))^2

return(- (sum1 * sum2 - sum3) / sum1^2)

}) %>%
matrix(nrow = p) %>%
rowSums()

return(-1 / variance)

}

4. Plotting our gradient over time to view convergence

GradientPlot <- function(gradients){

gradient_track <- reshape2::melt(gradients) %>%
`names<-`(c("iteration", "beta", "gradient"))

p <- ggplot(gradient_track, aes(y = gradient, x = iteration)) +
geom_line() +
facet_wrap(~factor(beta))

return(p)

}

5. Plotting our estimated β̂s compared to those from the survival::coxph() function

BetaPlot <- function(store_betas, store_variance, coxph_output){

variance_track <- reshape2::melt(store_variance) %>%
`names<-`(c("iteration", "variable", "coxph_AS_variance"))

plot_df <- reshape2::melt(store_betas) %>%
`names<-`(c("iteration", "variable", "coxph_AS_beta")) %>%
left_join(coxph_output, by = "variable") %>%
left_join(variance_track, by = c("variable", "iteration")) %>%
mutate(coxph_AS_lci = coxph_AS_beta - 1.96 * sqrt(coxph_AS_variance),
coxph_AS_uci = coxph_AS_beta + 1.96 * sqrt(coxph_AS_variance))

ggplot(plot_df) +
geom_line(aes(y = coxph_AS_beta, x = iteration), size = .8) +
geom_line(aes(y = coxph_AS_lci, x = iteration), linetype = "dashed", size = .8) +
geom_line(aes(y = coxph_AS_uci, x = iteration), linetype = "dashed", size = .8) +
geom_line(aes(y = coxph_beta, x = iteration, color = "red")) +
geom_line(aes(y = coxph_lci, x = iteration), linetype = "dashed", color = "red") +
geom_line(aes(y = coxph_uci, x = iteration), linetype = "dashed", color = "red") +
facet_wrap(~factor(variable_name)) +
ylab("beta estimate") +
theme(legend.position = "none")

}

6. A wrapper function to put it all together

coxph_AS <- function(formula, data){

# Fit using coxph()
model <- coxph(formula, data)
cox_beta <- data.frame(variable = 1:length(coef(model)),
coxph_beta = coef(model))
cox_ci <- cbind(1:length(coef(model)), confint(model)) %>%
`colnames<-`(c("variable", "coxph_lci", "coxph_uci")) %>%
data.frame()
coxph_output <- left_join(cox_beta, cox_ci, by = "variable") %>%
mutate(variable_name = names(coef(model)))

# Load data
x <- data %>% select(all.vars(formula[-1])) %>% as.matrix
lhs <- setdiff(all.vars(formula), all.vars(formula[-1]))

if (length(lhs) == 1) { # no censoring
times <- data %>% select(all.vars(formula)[1]) %>% as.matrix
event <- rep(1, dim(times)[1]) %>% as.matrix
entry <- rep(0, dim(times)[1]) %>% as.matrix
} else if (length(lhs) == 2) { # censoring but no entry time specified
times <- data %>% select(all.vars(formula)[1]) %>% as.matrix
event <- data %>% select(all.vars(formula)[2]) %>% as.matrix
entry <- rep(0, dim(times)[1]) %>% as.matrix
} else if (length(lhs) == 3) { # censoring but no entry time specified
times <- data %>% select(all.vars(formula)[2]) %>% as.matrix
event <- data %>% select(all.vars(formula)[3]) %>% as.matrix
entry <- data %>% select(all.vars(formula)[1]) %>% as.matrix
}
# Initialize matrices
store_gradient <- matrix(NA, nrow = ITERS, ncol = ncol(x))
store_betas <- matrix(NA, nrow = ITERS, ncol = ncol(x))
store_variance <- matrix(NA, nrow = ITERS, ncol = ncol(x))
beta <- matrix(rep(INITIAL_BETA, ncol(x)), nrow = ncol(x))

# Newton-Raphson iterations
for(i in 1:ITERS){
store_gradient[i,] <- CoxGradient(beta, x, entry, times, event)
store_variance[i,] <- CoxVariance(beta, x, entry, times, event)
store_betas[i,] <- beta <- beta + store_gradient[i,] * store_variance[i,]
}

# Plot
beta_plot <- BetaPlot(store_betas, store_variance, coxph_output)
gradient_plot <- GradientPlot(store_gradient)

# Final output data.frame
coxph_output$coxph_AS_beta <- store_betas[i,]
coxph_output$coxph_AS_lci <- store_betas[i,] - 1.96 * sqrt(store_variance[i,])
coxph_output$coxph_AS_uci <- store_betas[i,] + 1.96 * sqrt(store_variance[i,])

coxph_output <- coxph_output %>%
transmute(variable = variable_name,
beta = coxph_beta,
beta_AS = coxph_AS_beta,
LCI = coxph_lci,
LCI_AS = coxph_AS_lci,
UCI = coxph_uci,
UCI_AS = coxph_AS_uci)

return(list(beta_plot = beta_plot,
gradient_plot = gradient_plot,
model_output = coxph_output))
}

Testing it out

data(veteran)
set.seed(12345)
sim <- coxph_AS(Surv(time) ~ trt + karno + age, data = veteran)

Comparing the numerical outputs from the two models shows that our estimates (with suffix “_AS”) match survival::coxph() pretty closely.

We can also see how our β̂ estimates and corresponding confidence intervals compare to survival::coxph() (in red).

Finally, we can see how (quickly!) our gradient converges to zero.

References

--

--

Akshay Swaminathan
Analytics Vidhya

Data scientist, global health researcher, language learner