# Survival Analysis: Time-to-Event Data in Medicine {#sec-survival}
```{r}
#| label: setup-survival
#| include: false
library(tidyverse)
library(survival)
library(survminer)
library(knitr)
opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
```
## Why Ordinary Regression Fails for Time-to-Event Data
In medicine, we frequently ask questions like: *How long until a patient relapses after chemotherapy? How long does a hip replacement last before revision surgery is needed? What is the median survival time after a Stage IV lung cancer diagnosis?*
These are **time-to-event** questions. The outcome is not simply "did the event happen?" (which would be logistic regression) nor is it a continuous measurement at a fixed time point (which would be linear regression). The outcome is *how long* until something happens.
You might think: "Why not just use linear regression with time as the outcome?" There is a fundamental problem — **censoring**.
### The Censoring Problem
Imagine a clinical trial that follows 100 patients for 5 years after a cancer diagnosis. At the end of the study:
- 40 patients have died (we know their exact survival time).
- 30 patients are still alive at year 5 (we know they survived *at least* 5 years, but we do not know when they will die).
- 20 patients moved away or were lost to follow-up at various time points.
- 10 patients withdrew from the study.
For the last 60 patients, we have **incomplete information**. We know they survived *at least* until the last time we observed them, but not how much longer. This is **right censoring** — the event of interest (death) may occur to the right of (after) our last observation.
If we simply excluded these 60 patients, we would introduce severe bias — we would systematically underestimate survival times because the long-survivors are disproportionately censored. If we treated their last-observed time as their event time, we would also underestimate survival. Survival analysis methods were developed precisely to handle this problem.
### Types of Censoring
- **Right censoring** (most common): The event has not yet occurred when observation ends. The true event time is to the right of the censoring time.
- **Left censoring**: The event occurred *before* observation began (e.g., a patient already had the disease at enrollment, but we do not know when it started).
- **Interval censoring**: The event is known to have occurred within an interval but the exact time is unknown (e.g., a tumor detected at a scheduled 6-month scan — it appeared sometime between the last scan and this one).
In this chapter, we focus primarily on right censoring, which dominates clinical research.
## The Kaplan-Meier Estimator
The **Kaplan-Meier (KM) estimator** is the workhorse of survival analysis. It provides a **non-parametric** estimate of the survival function $S(t) = P(T > t)$, the probability of surviving beyond time $t$.
::: {.callout-note}
## "Non-parametric" — what it means here
A **non-parametric** method makes no assumption about the *shape* of the underlying distribution. A *parametric* survival model (e.g., a Weibull or exponential model) assumes the survival curve follows a specific mathematical formula governed by a few parameters; you are essentially fitting a smooth, pre-specified curve to the data. The Kaplan-Meier estimator does the opposite: it lets the data speak for themselves, building the survival curve directly from the observed event times as a series of steps, without assuming any particular form. The advantage is flexibility and honesty — the curve reflects exactly what happened in your cohort. The cost is that it cannot extrapolate beyond the observed follow-up and uses the data less efficiently than a correctly-specified parametric model.
:::
### How It Works
The KM estimator calculates survival probability at each time point where an event occurs:
$$\hat{S}(t) = \prod_{t_i \le t} \left(1 - \frac{d_i}{n_i}\right)$$
where:
- $t_i$ is a time at which at least one event occurred,
- $d_i$ is the number of events at time $t_i$,
- $n_i$ is the number of individuals at risk just before time $t_i$ (alive and uncensored).
**The intuition (for a clinical audience).** Forget the product symbol for a moment and think of survival as running a gauntlet. At each moment when a death occurs, ask a simple question: *of the patients still alive and being followed right now, what fraction made it through this instant?* That fraction is $\left(1 - \frac{d_i}{n_i}\right)$ --- the number who died ($d_i$) out of those still at risk ($n_i$), subtracted from one. To survive *to* time $t$, a patient must clear every one of these hurdles in sequence, so we **multiply** the individual survival fractions together (that is what the $\prod$ symbol means --- a running product). This is exactly how you would compute the chance of passing several independent checkpoints: multiply the chance of passing each.
A worked snippet makes it concrete. Suppose 100 patients are at risk, 10 die on day 30, then of the 90 remaining 9 die on day 60. The survival probability after day 30 is $1 - 10/100 = 0.90$; after day 60 it is $0.90 \times (1 - 9/90) = 0.90 \times 0.90 = 0.81$. The curve only steps *down* at the moments events occur, and stays flat in between --- which is why KM curves look like staircases.
The key insight is that censored individuals contribute to the **risk set** ($n_i$) up until the moment they are censored, and then they quietly drop out (they reduce $n_i$ for later time points but never count as an event $d_i$). They are not discarded --- they contribute all the information they can, namely "this patient was still event-free up to here."
### Clinical Example: Primary Biliary Cholangitis (PBC)
We will use the classic PBC dataset from the Mayo Clinic, which is included in R's `survival` package. This dataset contains 418 patients with primary biliary cholangitis (PBC), a chronic liver disease. Patients were followed from enrollment until death, transplant, or censoring.
::: {.callout-important}
## Packages you need to load first
The R code in this chapter relies on three packages. If you are running it yourself, install them once and then load them at the top of your script --- the code will error otherwise (for example, the `%>%` pipe comes from `dplyr`/`tidyverse`, and `ggsurvplot()` comes from `survminer`):
```r
# Run once if not already installed:
# install.packages(c("survival", "survminer", "tidyverse"))
library(survival) # core survival analysis: Surv(), survfit(), coxph(), the pbc data
library(survminer) # ggsurvplot() for publication-quality Kaplan-Meier plots
library(dplyr) # the %>% pipe and data wrangling (loaded via tidyverse too)
```
`survminer` in particular is a separate package from `survival` and is easy to forget --- without it, `ggsurvplot()` will not be found.
:::
::: {.panel-tabset}
#### R
```{r}
#| label: km-r
# Load the packages used below (see the note above for installation)
library(survival)
library(survminer)
library(dplyr)
# Load the PBC dataset
data(pbc, package = "survival")
# Prepare data — status: 0 = censored, 1 = transplant, 2 = dead
# For now, treat death as the event and censor everything else
pbc_clean <- pbc %>%
filter(!is.na(trt)) %>%
mutate(
status_binary = ifelse(status == 2, 1, 0),
time_years = time / 365.25,
trt_label = factor(trt, labels = c("D-penicillamine", "Placebo"))
)
cat("N =", nrow(pbc_clean), "\n")
cat("Events (deaths):", sum(pbc_clean$status_binary), "\n")
cat("Censored:", sum(pbc_clean$status_binary == 0), "\n")
# Fit overall KM curve
km_overall <- survfit(Surv(time_years, status_binary) ~ 1, data = pbc_clean)
print(km_overall)
```
```{r}
#| label: km-plot-r
#| fig-cap: "Kaplan-Meier survival curve for the PBC dataset. Tick marks indicate censored observations."
# Plot KM curve with confidence intervals
ggsurvplot(
km_overall,
data = pbc_clean,
conf.int = TRUE,
risk.table = TRUE,
xlab = "Time (years)",
ylab = "Survival probability",
title = "Overall Survival in PBC Patients",
ggtheme = theme_minimal()
)
```
#### Python
```{python}
#| label: km-python
import pandas as pd
import numpy as np
from lifelines import KaplanMeierFitter
from lifelines.statistics import logrank_test
import matplotlib.pyplot as plt
# Load PBC data (from R via CSV or use lifelines datasets)
# We'll replicate the data preparation
from lifelines.datasets import load_rossi # placeholder — in practice, load PBC
# For demonstration, we use the PBC data exported from R or a similar dataset
# Here we simulate the structure:
import subprocess
# In practice, you would load the PBC data. We use lifelines' built-in loader:
try:
pbc = pd.read_csv("data/pbc.csv")
except FileNotFoundError:
# Create from R's dataset structure for demonstration
from lifelines.datasets import load_regression_dataset
print("Note: Using lifelines example data for demonstration.")
print("In practice, load the PBC dataset from the survival R package.")
# Kaplan-Meier estimation
kmf = KaplanMeierFitter()
# Example with generic time-to-event data
np.random.seed(42)
n = 312
time_years = np.random.exponential(scale=8, size=n)
event_observed = np.random.binomial(1, 0.45, size=n)
kmf.fit(time_years, event_observed=event_observed, label="PBC Patients")
print(kmf.summary.head(10))
print(f"\nMedian survival time: {kmf.median_survival_time_:.2f} years")
```
```{python}
#| label: km-plot-python
#| fig-cap: "Kaplan-Meier survival curve estimated using the lifelines package in Python."
fig, ax = plt.subplots(figsize=(8, 5))
kmf.plot_survival_function(ax=ax, ci_show=True)
ax.set_xlabel("Time (years)")
ax.set_ylabel("Survival probability")
ax.set_title("Kaplan-Meier Survival Estimate")
plt.tight_layout()
plt.show()
```
:::
### Interpreting the KM Curve
Key features to examine:
1. **Median survival time**: the time at which $\hat{S}(t) = 0.5$ (where the curve crosses the 50% line). If the curve never drops to 50%, the median is undefined.
2. **Survival at specific time points**: e.g., "5-year survival is 65%."
3. **Shape of the curve**: A steep early drop suggests high early mortality. A long flat tail suggests most survivors do well long-term.
4. **Confidence bands**: Wider bands mean more uncertainty, often because fewer patients remain at risk.
5. **Censoring marks** (tick marks on the curve): Heavy censoring near the tail means the late estimates are unreliable.
### Confidence Intervals for the KM Estimator
The survival curve is an *estimate* from a finite sample, so it comes with uncertainty --- the confidence band you see around a KM curve. The pointwise confidence interval for $\hat{S}(t)$ is computed via **Greenwood's formula**, which estimates the variance (the "wobble") of $\hat{S}(t)$:
$$\widehat{\text{Var}}[\hat{S}(t)] = \hat{S}(t)^2 \sum_{t_i \le t} \frac{d_i}{n_i(n_i - d_i)}$$
You do not need to compute this by hand --- `survfit()` does it for you --- but the formula carries one genuinely useful intuition, which is the only thing worth taking away from it. Look at each term in the sum: it grows when $n_i$ (the number still at risk) is **small**. Early in follow-up, $n_i$ is large and each term is tiny, so the band is tight. Late in follow-up, after many patients have died or been censored, $n_i$ shrinks, each term balloons, and the band fans out. **That is why the confidence band on every KM curve is narrow on the left and wide on the right**: the tail of the curve is estimated from only a handful of remaining patients, so it is far less reliable. When you read a KM plot, treat the wide right-hand tail with caution.
A 95% confidence interval is then approximately $\hat{S}(t) \pm 1.96 \sqrt{\widehat{\text{Var}}[\hat{S}(t)]}$. In practice, software uses a log-log transformation of this interval because it guarantees the limits stay within the valid [0, 1] range for a probability (a plain $\pm$ interval can stray below 0 or above 1 in the tails).
## The Log-Rank Test
The **log-rank test** compares survival curves between two or more groups. It is the most widely used test for this purpose.
The null hypothesis is that survival is identical across groups: $H_0: S_1(t) = S_2(t)$ for all $t$.
The test statistic compares the observed number of events in each group to the number expected under the null hypothesis, summed over all event times. It is most powerful when the **proportional hazards assumption** holds — that is, when one group has a consistently higher (or lower) hazard than the other across time.
::: {.panel-tabset}
#### R
```{r}
#| label: logrank-r
# Compare survival between treatment groups
km_by_trt <- survfit(Surv(time_years, status_binary) ~ trt_label, data = pbc_clean)
# Log-rank test
logrank <- survdiff(Surv(time_years, status_binary) ~ trt_label, data = pbc_clean)
print(logrank)
```
```{r}
#| label: logrank-plot-r
#| fig-cap: "Kaplan-Meier curves comparing D-penicillamine vs placebo in PBC patients."
ggsurvplot(
km_by_trt,
data = pbc_clean,
pval = TRUE,
conf.int = TRUE,
risk.table = TRUE,
xlab = "Time (years)",
ylab = "Survival probability",
legend.title = "Treatment",
palette = c("#E7B800", "#2E9FDF"),
ggtheme = theme_minimal()
)
```
#### Python
```{python}
#| label: logrank-python
from lifelines import KaplanMeierFitter
from lifelines.statistics import logrank_test
import numpy as np
np.random.seed(42)
# Simulate two treatment groups
n_per_group = 156
time_trt = np.random.exponential(scale=8.5, size=n_per_group)
event_trt = np.random.binomial(1, 0.42, size=n_per_group)
time_placebo = np.random.exponential(scale=7.5, size=n_per_group)
event_placebo = np.random.binomial(1, 0.48, size=n_per_group)
# Log-rank test
results = logrank_test(time_trt, time_placebo, event_trt, event_placebo)
print(f"Log-rank test statistic: {results.test_statistic:.3f}")
print(f"p-value: {results.p_value:.4f}")
# Plot both curves
fig, ax = plt.subplots(figsize=(8, 5))
kmf_trt = KaplanMeierFitter()
kmf_trt.fit(time_trt, event_trt, label="D-penicillamine")
kmf_trt.plot_survival_function(ax=ax)
kmf_placebo = KaplanMeierFitter()
kmf_placebo.fit(time_placebo, event_placebo, label="Placebo")
kmf_placebo.plot_survival_function(ax=ax)
ax.set_xlabel("Time (years)")
ax.set_ylabel("Survival probability")
ax.set_title("Survival by Treatment Group")
plt.tight_layout()
plt.show()
```
:::
### Reading the Log-Rank Output
The `survdiff()` output in R looks like a small table with one row per group, plus a chi-squared statistic and p-value at the bottom. Here is what each piece tells you and how to report it:
- **`N`** --- the number of patients in each group.
- **`Observed`** --- how many events (deaths) actually occurred in each group.
- **`Expected`** --- how many events you would have expected in each group *if survival were identical across groups* (the null hypothesis). This is the key comparison: it accounts for how long each group was followed and how many were at risk over time.
- **Observed vs Expected** --- the heart of the test. If a group has *more* observed deaths than expected, it is doing worse than the average; *fewer* than expected, better. The test asks whether the gap between observed and expected, accumulated across all event times, is larger than chance would produce.
- **`Chisq` and `p`** --- the test statistic and its p-value. A small p-value (conventionally < 0.05) means the survival curves differ more than would be expected by chance; you reject the hypothesis that the two groups have identical survival.
For the PBC treatment comparison above, the relevant question is whether D-penicillamine changed survival versus placebo. A **non-significant** p-value (as is the case in this dataset) means there is no evidence the drug altered survival --- which matches the historical conclusion that D-penicillamine is not effective in PBC. Report it as, for example: *"There was no significant difference in survival between D-penicillamine and placebo (log-rank p = 0.75)."* Crucially, a non-significant log-rank test is **not** proof the treatments are equivalent --- it may simply reflect insufficient events to detect a difference.
The `ggsurvplot()` figure shows the same comparison visually: two KM curves with the log-rank p-value annotated (`pval = TRUE`). If the curves sit almost on top of each other, that is the visual counterpart of a non-significant test.
::: {.callout-note}
## When the Log-Rank Test is Weak
The log-rank test has low power when survival curves cross (e.g., one treatment is better early but worse later) --- the early and late differences cancel out, hiding a real effect. In such cases, consider the **weighted log-rank test** (e.g., Fleming-Harrington, which can up-weight early or late differences) or the **restricted mean survival time (RMST)** approach, which compares the average event-free time up to a chosen horizon and does not rely on proportional hazards.
:::
## Cox Proportional Hazards Model
The **Cox proportional hazards (PH) model** is the most widely used regression model for survival data. Introduced by Sir David Cox in 1972, it models the **hazard function** --- the instantaneous rate of the event occurring at time $t$, given survival to that point.
::: {.callout-note}
## What is a "hazard", in plain terms?
The word *hazard* trips up most clinicians on first encounter, because it is not a probability and not a risk. Think of it as a **speedometer for risk**: at any instant $t$ (among patients who have survived to that instant), it measures *how fast* events are occurring right now. Like a car's speed, it is a *rate* (events per unit time), it applies to *this moment* rather than the whole journey, and it can change from moment to moment.
Contrast it with quantities you already use:
- A **risk** (or cumulative probability) is like the total distance travelled --- "35% of patients have died **by** 5 years." It accumulates over a period.
- A **risk ratio / relative risk (RR)** compares those totals between two groups over a fixed period --- "the treated group's 5-year risk is half the control group's."
- A **hazard** is the instantaneous speed, and a **hazard ratio (HR)** compares the *speeds* of two groups at any given moment.
Why bother with a speed rather than a simple risk ratio? Because censoring and varying follow-up make "total risk by time $t$" hard to compare directly --- patients are observed for different lengths of time. The hazard sidesteps this by asking, at each instant, only about the patients actually still at risk at that instant. The Cox model is built on this instantaneous quantity:
:::
$$h(t | X) = h_0(t) \exp(\beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p)$$
where:
- $h(t | X)$ is the hazard at time $t$ for an individual with covariates $X$,
- $h_0(t)$ is the **baseline hazard** (an unspecified function of time),
- $\beta_1, \ldots, \beta_p$ are regression coefficients,
- $\exp(\beta_j)$ is the **hazard ratio** for a one-unit increase in $X_j$.
### Key Properties
1. **Semi-parametric**: The baseline hazard $h_0(t)$ is left unspecified — the model makes no assumption about the shape of the hazard over time. Only the relative effect of covariates is estimated.
2. **Proportional hazards assumption**: The ratio of hazards for any two individuals is constant over time. If patient A has twice the hazard of patient B at time 1, they also have twice the hazard at time 5.
3. **Hazard ratios**: $\text{HR} = \exp(\beta)$. An HR > 1 means increased hazard (worse prognosis); HR < 1 means decreased hazard (better prognosis); HR = 1 means no effect.
### Fitting the Cox Model
::: {.panel-tabset}
#### R
```{r}
#| label: cox-r
# Fit Cox model with multiple predictors
cox_model <- coxph(
Surv(time_years, status_binary) ~ age + sex + albumin + bili + edema,
data = pbc_clean
)
summary(cox_model)
```
```{r}
#| label: cox-forest-r
#| fig-cap: "Forest plot of hazard ratios from the Cox PH model for PBC patients."
# Forest plot of hazard ratios
ggforest(cox_model, data = pbc_clean)
```
#### Python
```{python}
#| label: cox-python
from lifelines import CoxPHFitter
import pandas as pd
import numpy as np
# Create a simulated clinical dataset
np.random.seed(42)
n = 312
df = pd.DataFrame({
'time_years': np.random.exponential(scale=8, size=n),
'event': np.random.binomial(1, 0.45, size=n),
'age': np.random.normal(50, 10, size=n),
'sex': np.random.binomial(1, 0.5, size=n),
'albumin': np.random.normal(3.5, 0.5, size=n),
'bili': np.random.exponential(scale=3, size=n),
'edema': np.random.choice([0, 0.5, 1], size=n, p=[0.75, 0.15, 0.10])
})
cph = CoxPHFitter()
cph.fit(df, duration_col='time_years', event_col='event',
formula='age + sex + albumin + bili + edema')
cph.print_summary()
```
```{python}
#| label: cox-forest-python
#| fig-cap: "Forest plot of hazard ratios from the Cox model fitted in Python."
fig, ax = plt.subplots(figsize=(8, 5))
cph.plot(ax=ax)
ax.set_title("Cox PH Model — Hazard Ratios")
plt.tight_layout()
plt.show()
```
:::
### Interpreting Hazard Ratios
Consider a Cox model result where bilirubin has $\hat{\beta} = 0.15$ (HR = 1.16, 95% CI: 1.10–1.23, p < 0.001).
Interpretation: "For each 1 mg/dL increase in serum bilirubin, the hazard of death increases by 16% (HR = 1.16, 95% CI: 1.10–1.23), adjusting for age, sex, albumin, and edema."
::: {.callout-warning}
## Hazard Ratios Are Not Risk Ratios
A hazard ratio of 2.0 does **not** mean "twice the risk of death." It means the instantaneous rate of death is twice as high at any given moment. Over long follow-up, the cumulative risk difference depends on the baseline hazard. Be precise in your language when reporting results.
:::
## Checking the Proportional Hazards Assumption
The proportional hazards assumption is critical. If it is violated, the hazard ratios from the Cox model do not have a stable interpretation — the "effect" of a covariate depends on when you look.
### Schoenfeld Residuals
The most common diagnostic tool is the **Schoenfeld residual test**. The intuition is simple: if a covariate's effect (its hazard ratio) is truly constant over time --- which is exactly what proportional hazards requires --- then a plot of its residuals against time should be a flat, horizontal cloud with no trend. If instead the residuals drift up or down over time, the covariate's effect is *changing* over time, and the PH assumption is violated for that covariate.
::: {.panel-tabset}
#### R
```{r}
#| label: ph-check-r
# Test proportional hazards assumption
ph_test <- cox.zph(cox_model)
print(ph_test)
```
```{r}
#| label: ph-plot-r
#| fig-cap: "Schoenfeld residual plots for each covariate. A non-flat trend suggests PH violation."
# Plot Schoenfeld residuals
par(mfrow = c(2, 3))
plot(ph_test)
```
#### Python
```{python}
#| label: ph-check-python
# Check proportional hazards assumption in lifelines
cph.check_assumptions(df, p_value_threshold=0.05, show_plots=True)
```
:::
**What to look for in the output.** The `cox.zph()` table in R gives one row per covariate plus a `GLOBAL` row, each with a chi-squared statistic and a **p-value**. Read it as a series of hypothesis tests where the null is "this covariate satisfies proportional hazards":
- A **large p-value (e.g., > 0.05)** is the *good* outcome here --- it means no evidence against proportional hazards, so the assumption is reasonable for that covariate.
- A **small p-value (< 0.05)** flags a covariate whose effect changes over time --- a PH violation you need to address (see the next section).
- The **`GLOBAL`** row tests all covariates jointly; if it is non-significant, the model as a whole is broadly consistent with PH.
Then back up the numbers with the **Schoenfeld residual plots**: for each covariate you want the smoothed line to be roughly **flat/horizontal**. A clear upward or downward slope is the visual signature of a violation --- for example, a covariate whose effect is strong early but fades with time will show a declining trend. A useful habit: do not rely on the p-value alone (with large samples even trivial deviations become "significant"); always eyeball the plot to judge whether the trend is practically meaningful, not just statistically detectable.
### What To Do When PH Is Violated
If the PH assumption is violated for a covariate, you have several options:
1. **Stratify** on the offending variable: `coxph(Surv(time, event) ~ x1 + strata(x2), data = df)`.
Stratification is the most common fix and deserves a proper explanation, because what it does is subtle. Recall that the Cox model assumes a single shared **baseline hazard** $h_0(t)$ that every patient's hazard is a multiple of. When you `strata(x2)`, you tell the model: *do not assume one common baseline --- let each level of `x2` have its own, separately-shaped baseline hazard*. So if you stratify on, say, hospital site or disease stage, patients in stage 1 follow one baseline time-course and stage-4 patients another, and the model no longer forces their hazards to stay in fixed proportion.
The important trade-off: a stratified variable **drops out of the results table** --- you no longer get a hazard ratio for it, because you are no longer modelling its effect, just absorbing it into the baseline. So stratify on variables you want to *adjust for* but are not interested in estimating (nuisance variables like centre, or a strong confounder that violates PH). Keep the variables you actually want HRs for as ordinary covariates. The other covariates (`x1` above) still get hazard ratios, now estimated within strata. A practical note: stratification works best for categorical variables with a few levels; a continuous variable must be binned into categories first, which loses information.
2. **Include a time interaction**: allow the effect to change over time by adding a term such as `x:time` (or `tt()` in `survival`). This estimates *how* the hazard ratio drifts --- useful when the time-dependence is itself of interest.
3. **Use time-varying covariates** (see below) when the covariate's *value* (not just its effect) changes during follow-up.
4. **Use an alternative model**: accelerated failure time (AFT) models or fully parametric models (Weibull, log-normal), which do not rest on proportional hazards at all.
## Time-Varying Covariates
In many clinical settings, covariates change over the course of follow-up. A patient's lab values, treatment status, or disease stage may evolve. The standard Cox model assumes covariates are fixed at baseline, but the model can be extended to handle **time-varying covariates**.
The idea is to split each patient's follow-up into intervals. Within each interval, the covariate values are fixed, but they can change between intervals. In R, this is accomplished using the `tmerge()` function from the `survival` package or by manually creating a "counting process" data format with start and stop times:
```r
# Conceptual example — counting process format
# PatientID start stop event lab_value
# 1 0 3 0 2.1
# 1 3 7 0 3.4
# 1 7 10 1 5.8 <- event at time 10
```
This is an advanced topic. The key takeaway is that the Cox model is flexible enough to accommodate covariates that change over time, but the data preparation is more complex and requires careful thought about when measurements were taken.
## Competing Risks
Standard survival analysis assumes that the event of interest is the only possible event, or that censoring is **non-informative** (i.e., censored patients have the same future risk as those who remain under observation).
**Competing risks** arise when another event prevents the event of interest from ever being observed. The defining feature is that the competing event *removes the patient from being at risk for your event altogether* --- it is not just a temporary loss to follow-up. Classic clinical examples:
- Studying time to heart transplant rejection, but the patient dies of infection first.
- Studying time to cancer recurrence, but the patient dies of cardiovascular disease.
- Studying time to hospital readmission, but the patient dies before readmission.
**Why this matters clinically.** The instinctive fix --- treat the competing event (e.g., death) as just another censoring --- is **wrong**, and in a specific, important direction: it **overestimates** the probability of your event of interest. The reason is intuitive once stated. Ordinary censoring assumes the censored patient *could still have the event later* and is simply unobserved (e.g., they moved away). But a patient who has died of another cause can **never** experience cancer recurrence --- they are not "out there, still at risk." By treating them as censored, the standard Kaplan-Meier "1 − S(t)" silently pretends they might still recur, and inflates the recurrence probability. In older patients or any setting with substantial mortality from other causes, this bias can be large and clinically misleading --- you would quote a recurrence risk higher than any real patient faces. This is why, whenever a meaningful fraction of patients can experience a competing event, you should report the **cumulative incidence function** (which correctly accounts for competing events) rather than 1 minus the Kaplan-Meier estimate.
### Two Approaches
There are two established ways to handle competing risks, and the distinction between them is one of the most commonly confused points in clinical survival analysis. They answer **different questions**, so the "right" one depends on what you want to know.
1. **Cause-specific hazard models.** Fit a separate Cox model for each event type, treating the competing events as censored. Each model estimates the **rate of that specific event among patients still alive and event-free** at each instant. This answers an *etiological* question: *what drives the rate of this particular event?* A hazard ratio here describes the effect of a covariate on the biological process leading to that event, holding the at-risk population fixed. Because competing events are censored, the cause-specific hazard ratios should **not** be read directly as effects on the *probability* of the event over time.
2. **Fine-Gray subdistribution hazard model.** Rather than censoring patients who experience the competing event, the Fine-Gray model **keeps them in the risk set** (conceptually, they remain "unable to have the event" forever). It models the **subdistribution hazard**, which links directly to the **cumulative incidence function** --- the actual probability of experiencing the event by time $t$ *in the real world, where competing events also happen*. This answers a *prognostic / predictive* question: *what is the chance this patient actually experiences the event by 5 years, given they might die of something else first?*
The crucial consequence: a covariate can increase the cause-specific hazard of an event yet *decrease* its cumulative incidence (because it also raises the competing mortality), so the two models can even point in opposite directions. Neither is "wrong" --- they answer different questions.
::: {.callout-tip}
## Rule of Thumb for Competing Risks
If you are interested in **etiology** (what causes the event, mechanistically?), use **cause-specific hazards**. If you are interested in **prediction** (what is the absolute probability of the event by time $t$?), use the **Fine-Gray model** and report **cumulative incidence functions** rather than 1 minus KM. For a clinical prediction model, Fine-Gray (or directly modelling the cumulative incidence) is usually what you want.
:::
For a clinically-oriented, practical treatment of competing risks --- including worked examples of both approaches and guidance on which to choose --- see the prediction-modelling textbook by Smits, van Kuijk, and Wynants [@smits2026prediction]. Accessible methodological introductions are given by Austin, Lee, and Fine [@austin2016competing], and the original subdistribution-hazard method is due to Fine and Gray [@fine1999]. In R, the `tidycmprsk` and `cmprsk` packages fit Fine-Gray models and estimate cumulative incidence; in Python, `scikit-survival` and `lifelines` provide competing-risks tools.
## Full Clinical Example: PBC Prognostic Model
Let us build a complete survival model for the PBC dataset, following the steps a clinical researcher would take.
::: {.callout-note}
## Kaplan-Meier vs Cox --- when and why to use each
Before reading the model output, it helps to be crystal clear on how the two main tools differ, because they are often confused:
| | **Kaplan-Meier (+ log-rank)** | **Cox proportional hazards** |
|---|---|---|
| What it gives you | A survival *curve* for a whole group (or a few groups) | The *effect* of each predictor on the hazard, as a hazard ratio |
| Handles covariates? | Only categorical group comparisons (e.g., treatment arm, stage) | Multiple predictors at once, including continuous ones, **adjusted for each other** |
| Output to report | Median survival, x-year survival %, log-rank p-value | Hazard ratios with 95% CIs and p-values |
| Analogy | Like comparing mean outcomes between groups (a *t*-test / ANOVA of survival) | Like multivariable regression for survival |
In short: reach for **Kaplan-Meier** to *describe and visualise* survival and to compare a small number of groups; reach for the **Cox model** when you need to *adjust for several variables simultaneously* and quantify each one's independent effect. They are complementary --- a typical paper shows KM curves for the headline comparison and a Cox model for the adjusted analysis.
:::
The worked example below follows the standard sequence: explore the data, fit the multivariable Cox model, assess how well it discriminates, and use it to predict survival for individual patients.
::: {.panel-tabset}
#### R
```{r}
#| label: clinical-example-r
# Step 1: Explore the data
pbc_model <- pbc_clean %>%
select(time_years, status_binary, age, sex, albumin, bili,
protime, stage, edema, trt_label) %>%
drop_na()
cat("Complete cases:", nrow(pbc_model), "\n")
cat("Events:", sum(pbc_model$status_binary), "\n")
cat("Median follow-up:", median(pbc_model$time_years), "years\n")
# Step 2: Fit multivariable Cox model
cox_full <- coxph(
Surv(time_years, status_binary) ~ age + sex + albumin + log(bili) +
protime + stage + edema + trt_label,
data = pbc_model
)
summary(cox_full)
```
```{r}
#| label: clinical-concordance-r
# Step 3: Assess model discrimination
cat("Concordance (C-index):", summary(cox_full)$concordance[1], "\n")
cat("This means the model correctly ranks survival times",
round(summary(cox_full)$concordance[1] * 100, 1),
"% of the time.\n")
```
```{r}
#| label: clinical-predict-r
#| fig-cap: "Predicted survival curves for a low-risk and high-risk patient profile."
# Step 4: Predicted survival for new patients
new_patients <- data.frame(
age = c(40, 65),
sex = c("f", "f"),
albumin = c(3.8, 2.5),
bili = c(1.0, 10.0),
protime = c(10.5, 13.0),
stage = c(2, 4),
edema = c(0, 1),
trt_label = c("Placebo", "Placebo")
)
pred_surv <- survfit(cox_full, newdata = new_patients)
plot(pred_surv, col = c("blue", "red"), lwd = 2,
xlab = "Time (years)", ylab = "Survival probability",
main = "Predicted Survival: Low-risk vs High-risk Patient")
legend("topright", c("Low risk", "High risk"),
col = c("blue", "red"), lwd = 2)
```
#### Python
```{python}
#| label: clinical-example-python
from lifelines import CoxPHFitter, KaplanMeierFitter
import pandas as pd
import numpy as np
np.random.seed(42)
n = 276
# Simulate a PBC-like dataset
pbc_sim = pd.DataFrame({
'time_years': np.abs(np.random.exponential(7, n) +
np.random.normal(0, 1, n)),
'event': np.random.binomial(1, 0.45, n),
'age': np.random.normal(50, 10, n),
'sex': np.random.binomial(1, 0.12, n), # mostly female in PBC
'albumin': np.random.normal(3.5, 0.4, n),
'log_bili': np.random.normal(0.5, 1.0, n),
'protime': np.random.normal(11, 1, n),
'stage': np.random.choice([1, 2, 3, 4], n, p=[0.05, 0.25, 0.40, 0.30]),
'edema': np.random.choice([0, 0.5, 1], n, p=[0.75, 0.15, 0.10]),
'treatment': np.random.binomial(1, 0.5, n)
})
# Fit Cox model
cph = CoxPHFitter()
cph.fit(pbc_sim, duration_col='time_years', event_col='event',
formula='age + sex + albumin + log_bili + protime + stage + edema + treatment')
cph.print_summary()
print(f"\nConcordance index: {cph.concordance_index_:.3f}")
```
```{python}
#| label: clinical-predict-python
#| fig-cap: "Predicted survival curves for two patient profiles using the Python Cox model."
# Predict survival for specific patient profiles
low_risk = pd.DataFrame({
'age': [40], 'sex': [0], 'albumin': [3.8], 'log_bili': [0.0],
'protime': [10.5], 'stage': [2], 'edema': [0], 'treatment': [0]
})
high_risk = pd.DataFrame({
'age': [65], 'sex': [0], 'albumin': [2.5], 'log_bili': [2.3],
'protime': [13.0], 'stage': [4], 'edema': [1], 'treatment': [0]
})
fig, ax = plt.subplots(figsize=(8, 5))
cph.predict_survival_function(low_risk).plot(ax=ax, label='Low risk', color='blue')
cph.predict_survival_function(high_risk).plot(ax=ax, label='High risk', color='red')
ax.set_xlabel("Time (years)")
ax.set_ylabel("Survival probability")
ax.set_title("Predicted Survival: Low-risk vs High-risk Patient")
ax.legend()
plt.tight_layout()
plt.show()
```
:::
### What the Outputs Are Telling You
It is easy to run this code and be confronted with a wall of numbers. Here is what to actually look at, in order:
1. **The `summary(cox_full)` / `print_summary()` table** is the heart of it. For each predictor, focus on three columns:
- **`exp(coef)` --- the hazard ratio.** This is the headline number. HR > 1 = higher hazard (worse survival) per unit increase in that predictor; HR < 1 = protective. For example, an HR of 2.7 for `log(bili)` means each one-unit rise in log-bilirubin nearly triples the instantaneous hazard, holding the other predictors fixed.
- **The 95% confidence interval** (`lower .95`, `upper .95`). If it excludes 1, the effect is statistically significant; its width tells you how precisely the effect is estimated.
- **`Pr(>|z|)` --- the p-value.** Which predictors are independently associated with survival after adjustment?
You are *learning*: which factors carry prognostic information, in which direction, and how strongly. In the PBC model you should see that higher bilirubin, older age, lower albumin, and worse edema/stage all predict shorter survival --- which matches the clinical picture of advanced liver disease.
2. **The concordance (C-index)** summarises *discrimination* --- how well the model ranks who dies sooner. 0.5 is a coin-flip; 1.0 is perfect. A value around 0.8 (as here) means that, given two patients, the model correctly identifies the one who dies first about 80% of the time. This is the single best one-number summary of predictive performance.
3. **The predicted survival curves** turn the model into something a clinician can use: plug in a patient's covariates and read off their estimated survival probability over time. The low-risk vs high-risk curves should be visibly separated --- the *visual* proof that the predictors above translate into meaningfully different prognoses. This is also where the link back to Kaplan-Meier becomes clear: a KM curve describes an *observed group*, whereas these Cox-predicted curves are *individualised* and *adjusted* for the full covariate profile.
If you observe nothing else, observe these three things: **direction and size of each hazard ratio, whether its CI excludes 1, and the C-index.** Together they answer "what matters, how much, and how well does the whole model predict?"
## Exercises
### Exercise 1: Kaplan-Meier Curves by Disease Stage
Using the PBC dataset, create Kaplan-Meier curves stratified by disease stage (1–4). Perform a log-rank test to compare survival across stages. Report the median survival time for each stage.
::: {.panel-tabset}
#### R Starter Code
```{r}
#| label: ex1-r
#| eval: false
# Fit KM by stage
km_stage <- survfit(Surv(time_years, status_binary) ~ stage, data = pbc_clean)
# Your code: plot the curves and perform the log-rank test
# Hint: use ggsurvplot() with pval = TRUE
# Hint: use survdiff() for the formal test
```
#### Python Starter Code
```{python}
#| label: ex1-python
#| eval: false
# Fit KM for each stage and plot
# Your code here
# Hint: loop over stages, fit KaplanMeierFitter for each
# Use logrank_test() for pairwise comparisons
```
:::
### Exercise 2: Build and Evaluate a Cox Model
Build a Cox proportional hazards model for PBC survival using age, bilirubin (log-transformed), albumin, prothrombin time, and edema as predictors.
1. Report the hazard ratios and 95% confidence intervals.
2. Check the proportional hazards assumption. Which covariates, if any, violate it?
3. Calculate the concordance index. How well does the model discriminate?
4. Predict the 5-year survival probability for a 55-year-old woman with bilirubin = 3 mg/dL, albumin = 3.2 g/dL, prothrombin time = 11 seconds, and no edema.
::: {.panel-tabset}
#### R Starter Code
```{r}
#| label: ex2-r
#| eval: false
# Your code here
# cox_model <- coxph(Surv(time_years, status_binary) ~ ..., data = pbc_clean)
# summary(cox_model)
# cox.zph(cox_model)
```
#### Python Starter Code
```{python}
#| label: ex2-python
#| eval: false
# Your code here
# cph = CoxPHFitter()
# cph.fit(...)
# cph.check_assumptions(...)
```
:::
### Exercise 3: Competing Risks Thinking
Consider a study of time to kidney transplant rejection in 500 recipients. During follow-up, some patients die before experiencing rejection (a competing risk).
1. Explain why treating death as censoring would bias the results. What direction would the bias go?
2. Which approach would you use if your goal were to estimate the probability of rejection within 2 years — cause-specific hazards or Fine-Gray? Why?
3. (Bonus) In R, use the `cmprsk` or `tidycmprsk` package to fit a Fine-Gray model. In Python, explore `lifelines` or `scikit-survival` for competing risk methods.
## Summary
| Concept | What It Does | Key Assumption |
|---------|-------------|----------------|
| Kaplan-Meier | Non-parametric survival estimate | Non-informative censoring |
| Log-rank test | Compares survival curves | Proportional hazards (most powerful) |
| Cox PH model | Regression for survival data | Proportional hazards, non-informative censoring |
| Fine-Gray model | Competing risks regression | Models subdistribution hazard |
Survival analysis is fundamental to clinical research. The methods in this chapter — Kaplan-Meier estimation, log-rank testing, and Cox regression — form the backbone of nearly every clinical trial and prognostic study. The proportional hazards assumption must always be checked, and competing risks should be considered whenever another event can prevent observation of the event of interest.
## References and Further Reading
The foundational paper on proportional hazards regression is Cox (1972), "Regression Models and Life-Tables," published in the *Journal of the Royal Statistical Society, Series B*, which introduced the semi-parametric model that now bears his name. This paper is one of the most cited in all of statistics.
For a comprehensive and accessible introduction to survival analysis in the context of this course, see Smits et al. (2026), which covers Kaplan-Meier estimation, the Cox model, and competing risks with modern clinical examples. Their treatment of model diagnostics and the proportional hazards assumption is particularly practical.
Bradburn et al. provide an excellent tutorial series on survival analysis for medical researchers, published in the *British Journal of Cancer*. Their series covers Kaplan-Meier methods, the Cox model, and advanced topics in a format accessible to clinicians without extensive statistical training.
For hands-on implementation, the `survival` and `survminer` packages in R are the standard tools. Therneau and Grambsch's *Modeling Survival Data: Extending the Cox Model* (Springer, 2000) is the definitive reference for the R `survival` package. In Python, the `lifelines` library by Cameron Davidson-Pilon provides an excellent API for survival analysis, and its online documentation includes worked clinical examples.
For competing risks specifically, Fine and Gray (1999), "A Proportional Hazards Model for the Subdistribution of a Competing Risk," in the *Journal of the American Statistical Association*, is the key methodological reference. Austin and Fine (2017) provide practical guidance on when and how to use competing risk methods in clinical research.