Appendix D — Exercise Solutions

Exercise Solutions

This appendix collects solutions for all exercises in the course. Each solution is provided in both R and Python. Use them to check your work after attempting the exercises yourself.

Try First

These solutions are most valuable after you have made a genuine attempt at each exercise. Resist the temptation to read the solution before trying.

The solution files are also available in the course repository under solutions/R/ and solutions/python/.

D.1 Chapter 04: Continuous Variables and Splines

D.1.1 Exercise 1

Show R solution

# =============================================================================
# Chapter 4, Exercise 1: Recognising the Problem
# Categorisation of systolic blood pressure and stroke risk
# =============================================================================
# This exercise is primarily conceptual. Answers are provided as comments,
# with supporting code to illustrate the points.

library(ggplot2)

# --- Question (a) ---
# What assumptions does categorising SBP into Normal (<120), Elevated (120-129),
# Stage 1 HTN (130-139), Stage 2 HTN (>=140) impose?
#
# ANSWER:
# 1. Within each category, the relationship between SBP and stroke risk is FLAT.
#    A patient with SBP=121 is treated identically to SBP=129.
# 2. At category boundaries, there is a SUDDEN JUMP in risk.
#    SBP=129 and SBP=130 are treated as fundamentally different despite only
#    1 mmHg difference.
# 3. The cut-points (120, 130, 140) are assumed to be biologically meaningful
#    boundaries, which may not reflect the true continuous biology.

# --- Question (b) ---
# A patient with SBP 131 is classified the same as SBP 139. Why is this problematic?
#
# ANSWER:
# The difference of 8 mmHg (131 vs 139) is clinically substantial. In reality,
# stroke risk increases with higher SBP. By lumping these patients together,
# we lose the ability to distinguish their different risk levels. The patient
# at 139 mmHg (close to Stage 2) likely has meaningfully higher stroke risk
# than the patient at 131 mmHg, but the categorised model assigns them
# identical predicted risk. This information loss reduces statistical power
# and can bias effect estimates.

# --- Question (c) ---
# Suggest a better modelling approach.
#
# ANSWER:
# Use a restricted cubic spline (RCS) with 3-5 knots to model SBP as a
# continuous predictor. This:
#   - Preserves the full information in SBP
#   - Allows the relationship to be non-linear (capturing any steepening
#     at higher pressures)
#   - Constrains the curve to be linear in the tails (sensible extrapolation)
#   - Uses only k-1 degrees of freedom for k knots (parsimonious)

# --- Supporting demonstration ---
# Simulate data to visually compare categorised vs spline approaches

library(rms)

set.seed(2025)
n <- 1000
sbp <- rnorm(n, mean = 130, sd = 15)
sbp <- pmax(sbp, 90)
sbp <- pmin(sbp, 200)

# True relationship: risk accelerates at higher SBP
logit_p <- -6 + 0.02 * sbp + 0.0002 * (sbp - 130)^2
y <- rbinom(n, 1, plogis(logit_p))

sim_data <- data.frame(sbp = sbp, stroke = y)

# Categorise as described in the exercise
sim_data$sbp_cat <- cut(sim_data$sbp,
                         breaks = c(-Inf, 120, 130, 140, Inf),
                         labels = c("Normal", "Elevated", "Stage1", "Stage2"))

# Set up datadist for rms
dd <- datadist(sim_data)
options(datadist = "dd")

# Fit three models
fit_cat <- lrm(stroke ~ sbp_cat, data = sim_data)
fit_linear <- lrm(stroke ~ sbp, data = sim_data)
fit_rcs <- lrm(stroke ~ rcs(sbp, 4), data = sim_data)

# Compare AIC
cat("AIC Comparison:\n")
cat("  Categorised:", AIC(fit_cat), "\n")
cat("  Linear:     ", AIC(fit_linear), "\n")
cat("  RCS (4 knots):", AIC(fit_rcs), "\n")
cat("\nLower AIC = better fit. The RCS model captures the true non-linear\n")
cat("relationship without imposing arbitrary cut-points.\n")

# Plot comparison
sbp_seq <- seq(min(sim_data$sbp), max(sim_data$sbp), length.out = 200)

pred_rcs <- Predict(fit_rcs, sbp = sbp_seq)
pred_linear_vals <- predict(fit_linear,
                             newdata = data.frame(sbp = sbp_seq),
                             type = "fitted")

# True probability
true_logit <- -6 + 0.02 * sbp_seq + 0.0002 * (sbp_seq - 130)^2
true_prob <- plogis(true_logit)

plot_df <- data.frame(
  sbp = rep(sbp_seq, 3),
  probability = c(true_prob, pred_linear_vals,
                  plogis(as.data.frame(pred_rcs)$yhat)),
  Model = rep(c("True relationship", "Linear", "RCS (4 knots)"),
              each = length(sbp_seq))
)

p <- ggplot(plot_df, aes(x = sbp, y = probability, colour = Model)) +
  geom_line(linewidth = 1.1) +
  scale_colour_manual(values = c("True relationship" = "black",
                                  "Linear" = "#3498db",
                                  "RCS (4 knots)" = "#e74c3c")) +
  labs(x = "Systolic Blood Pressure (mmHg)",
       y = "Predicted Probability of Stroke",
       title = "Comparing modelling approaches for SBP-stroke relationship") +
  theme_minimal(base_size = 14) +
  theme(legend.position = "bottom")

print(p)

Show Python solution

# =============================================================================
# Chapter 4, Exercise 1: Recognising the Problem
# Categorisation of systolic blood pressure and stroke risk
# =============================================================================
# This exercise is primarily conceptual. Answers are provided as comments,
# with supporting code to illustrate the points.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from patsy import dmatrix
from scipy.special import expit

# --- Question (a) ---
# What assumptions does categorising SBP into Normal (<120), Elevated (120-129),
# Stage 1 HTN (130-139), Stage 2 HTN (>=140) impose?
#
# ANSWER:
# 1. Within each category, the relationship between SBP and stroke risk is FLAT.
#    A patient with SBP=121 is treated identically to SBP=129.
# 2. At category boundaries, there is a SUDDEN JUMP in risk.
#    SBP=129 and SBP=130 are treated as fundamentally different despite only
#    1 mmHg difference.
# 3. The cut-points (120, 130, 140) are assumed to be biologically meaningful
#    boundaries, which may not reflect the true continuous biology.

# --- Question (b) ---
# A patient with SBP 131 is classified the same as SBP 139. Why is this problematic?
#
# ANSWER:
# The difference of 8 mmHg (131 vs 139) is clinically substantial. In reality,
# stroke risk increases with higher SBP. By lumping these patients together,
# we lose the ability to distinguish their different risk levels. The patient
# at 139 mmHg (close to Stage 2) likely has meaningfully higher stroke risk
# than the patient at 131 mmHg, but the categorised model assigns them
# identical predicted risk. This information loss reduces statistical power
# and can bias effect estimates.

# --- Question (c) ---
# Suggest a better modelling approach.
#
# ANSWER:
# Use a restricted cubic spline (RCS) with 3-5 knots to model SBP as a
# continuous predictor. This:
#   - Preserves the full information in SBP
#   - Allows the relationship to be non-linear (capturing any steepening
#     at higher pressures)
#   - Constrains the curve to be linear in the tails (sensible extrapolation)
#   - Uses only k-1 degrees of freedom for k knots (parsimonious)

# --- Supporting demonstration ---
# Simulate data to visually compare categorised vs spline approaches

np.random.seed(2025)
n = 1000
sbp = np.random.normal(130, 15, size=n)
sbp = np.clip(sbp, 90, 200)

# True relationship: risk accelerates at higher SBP
logit_p = -6 + 0.02 * sbp + 0.0002 * (sbp - 130)**2
y = np.random.binomial(1, expit(logit_p))

df = pd.DataFrame({'sbp': sbp, 'stroke': y})

# --- Model 1: Categorised ---
df['sbp_cat'] = pd.cut(df['sbp'],
                         bins=[-np.inf, 120, 130, 140, np.inf],
                         labels=['Normal', 'Elevated', 'Stage1', 'Stage2'])
X_cat = pd.get_dummies(df['sbp_cat'], drop_first=True).astype(float)
X_cat = sm.add_constant(X_cat)
fit_cat = sm.GLM(df['stroke'], X_cat, family=sm.families.Binomial()).fit()

# --- Model 2: Linear ---
X_lin = sm.add_constant(df[['sbp']])
fit_lin = sm.GLM(df['stroke'], X_lin, family=sm.families.Binomial()).fit()

# --- Model 3: RCS (natural cubic spline with df=3, i.e., 4 knots) ---
X_rcs = dmatrix("cr(sbp, df=3)", data=df, return_type='dataframe')
fit_rcs = sm.GLM(df['stroke'], X_rcs, family=sm.families.Binomial()).fit()

# Compare AIC
print("AIC Comparison:")
print(f"  Categorised: {fit_cat.aic:.1f}")
print(f"  Linear:      {fit_lin.aic:.1f}")
print(f"  RCS (4 knots): {fit_rcs.aic:.1f}")
print("\nLower AIC = better fit. The RCS model captures the true non-linear")
print("relationship without imposing arbitrary cut-points.")

# --- Plot comparison ---
sbp_range = np.linspace(df['sbp'].min(), df['sbp'].max(), 200)

# True probability
true_logit = -6 + 0.02 * sbp_range + 0.0002 * (sbp_range - 130)**2
true_prob = expit(true_logit)

# Linear predictions
X_lin_pred = sm.add_constant(pd.DataFrame({'sbp': sbp_range}))
lin_prob = fit_lin.predict(X_lin_pred)

# RCS predictions
X_rcs_pred = dmatrix("cr(sbp, df=3)",
                      data=pd.DataFrame({'sbp': sbp_range}),
                      return_type='dataframe')
rcs_prob = fit_rcs.predict(X_rcs_pred)

plt.figure(figsize=(10, 6))
plt.plot(sbp_range, true_prob, 'k--', linewidth=2, label='True relationship')
plt.plot(sbp_range, lin_prob, color='#3498db', linewidth=1.5, label='Linear')
plt.plot(sbp_range, rcs_prob, color='#e74c3c', linewidth=1.5, label='RCS (4 knots)')
plt.xlabel('Systolic Blood Pressure (mmHg)')
plt.ylabel('Predicted Probability of Stroke')
plt.title('Comparing modelling approaches for SBP-stroke relationship')
plt.legend()
plt.tight_layout()
plt.show()

D.1.2 Exercise 2

Show R solution

# =============================================================================
# Chapter 4, Exercise 2: Fitting and Interpreting Spline Models
# PBC dataset: bilirubin and transplant status
# =============================================================================

library(rms)
library(ggplot2)

# --- Load the PBC data ---
data(pbc, package = "survival")
pbc <- pbc[!is.na(pbc$trt), ]  # Remove incomplete cases

# Create a binary outcome: transplanted (status == 1) vs not
pbc$transplant <- as.numeric(pbc$status == 1)

# Set up datadist (required by rms)
dd <- datadist(pbc)
options(datadist = "dd")

# --- Task 1: Fit logistic regression with bilirubin as a linear term ---
fit_linear <- lrm(transplant ~ bili, data = pbc)
cat("=== Linear Model ===\n")
print(fit_linear)

# --- Task 2: Fit logistic regression with bilirubin using RCS (4 knots) ---
fit_rcs <- lrm(transplant ~ rcs(bili, 4), data = pbc)
cat("\n=== RCS Model (4 knots) ===\n")
print(fit_rcs)

# --- Task 3: Test for non-linearity ---
cat("\n=== ANOVA for RCS model (test of non-linearity) ===\n")
print(anova(fit_rcs))
# The ANOVA output shows:
#   - Total effect of bili (tests whether bili has ANY association)
#   - Nonlinear component (tests whether the relationship departs from linear)
# If the nonlinear p-value is significant, a linear term is insufficient.

# --- Task 4: Compare AIC ---
cat("\n=== AIC Comparison ===\n")
cat("Linear model AIC:", AIC(fit_linear), "\n")
cat("RCS model AIC:   ", AIC(fit_rcs), "\n")
cat("Lower AIC = better fit (penalised for complexity)\n")

# --- Task 5: Plot the relationship ---
p <- Predict(fit_rcs, bili)
pred_df <- as.data.frame(p)

plot_rcs <- ggplot(pred_df, aes(x = bili, y = yhat)) +
  geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.2, fill = "#3498db") +
  geom_line(linewidth = 1.2, colour = "#2c3e50") +
  geom_hline(yintercept = 0, linetype = "dashed", colour = "grey50") +
  labs(x = "Serum Bilirubin (mg/dL)",
       y = "Log-Odds of Transplant",
       title = "Non-linear effect of bilirubin on transplant (RCS, 4 knots)") +
  theme_minimal(base_size = 14)

print(plot_rcs)

# --- Task 6: Interpretation ---
cat("\n=== Interpretation ===\n")
cat("The plot shows the estimated log-odds of transplant as a function of\n")
cat("serum bilirubin. If the curve is not a straight line, the relationship\n")
cat("is non-linear, and a linear term would be insufficient.\n")
cat("\nThe ANOVA non-linearity test (above) formally tests this. If the\n")
cat("non-linear p-value is < 0.05, there is evidence against linearity.\n")
cat("\nCompare the AIC values: a lower AIC for the RCS model confirms\n")
cat("that the additional flexibility improves fit beyond what a linear\n")
cat("term provides.\n")

Show Python solution

# =============================================================================
# Chapter 4, Exercise 2: Fitting and Interpreting Spline Models
# PBC dataset: bilirubin and transplant status
# =============================================================================

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from patsy import dmatrix

# --- Simulate data similar to PBC bilirubin-transplant relationship ---
# (Following the chapter's Python approach since PBC is an R-native dataset)
np.random.seed(42)
n = 300
bili = np.random.exponential(scale=2, size=n)
bili = np.clip(bili, 0.3, 25)
logit_p = -3 + 0.1 * bili + 0.005 * bili**2  # Non-linear true relationship
prob = 1 / (1 + np.exp(-logit_p))
transplant = np.random.binomial(1, prob)

df = pd.DataFrame({'bili': bili, 'transplant': transplant})

# --- Task 1: Fit logistic regression with linear bilirubin ---
X_linear = sm.add_constant(df[['bili']])
fit_linear = sm.GLM(df['transplant'], X_linear,
                     family=sm.families.Binomial()).fit()
print("=== Linear Model ===")
print(fit_linear.summary())

# --- Task 2: Create spline basis and fit logistic regression with splines ---
# cr() creates a natural cubic spline (restricted cubic spline)
# df=3 means 4 knots (df = number of knots - 1)
spline_basis = dmatrix("cr(bili, df=3)", data=df, return_type='dataframe')
fit_spline = sm.GLM(df['transplant'], spline_basis,
                      family=sm.families.Binomial()).fit()
print("\n=== Spline Model (RCS, 4 knots) ===")
print(fit_spline.summary())

# --- Task 3: Compare AIC ---
print(f"\n=== AIC Comparison ===")
print(f"Linear model AIC: {fit_linear.aic:.1f}")
print(f"Spline model AIC: {fit_spline.aic:.1f}")
print("Lower AIC = better fit (penalised for complexity)")

# --- Task 4: Test for non-linearity ---
# Compare the linear model to the spline model using a likelihood ratio test.
# The difference in deviance follows a chi-squared distribution with
# df = (spline params - linear params) degrees of freedom.
from scipy import stats

lr_stat = fit_linear.deviance - fit_spline.deviance
df_diff = fit_linear.df_model - fit_spline.df_model  # difference in parameters
# Note: df_model counts differently in statsmodels; compute manually
n_params_linear = len(fit_linear.params)
n_params_spline = len(fit_spline.params)
df_diff = n_params_spline - n_params_linear

p_value = 1 - stats.chi2.cdf(lr_stat, df_diff)
print(f"\n=== Likelihood Ratio Test for Non-linearity ===")
print(f"LR statistic: {lr_stat:.3f}")
print(f"Degrees of freedom: {df_diff}")
print(f"p-value: {p_value:.4f}")
if p_value < 0.05:
    print("Result: Significant non-linearity detected. The spline model fits better.")
else:
    print("Result: No significant non-linearity. A linear term may suffice.")

# --- Task 5: Plot the predicted probabilities across bilirubin range ---
bili_range = np.linspace(df['bili'].min(), df['bili'].max(), 200)

# Linear predictions
X_lin_pred = sm.add_constant(pd.DataFrame({'bili': bili_range}))
lin_prob = fit_linear.predict(X_lin_pred)

# Spline predictions
spline_pred = dmatrix("cr(bili, df=3)",
                       data=pd.DataFrame({'bili': bili_range}),
                       return_type='dataframe')
rcs_prob = fit_spline.predict(spline_pred)

plt.figure(figsize=(10, 6))
plt.plot(bili_range, lin_prob, color='#3498db', linewidth=1.5,
         label='Linear model')
plt.plot(bili_range, rcs_prob, color='#2c3e50', linewidth=2,
         label='RCS model (4 knots)')
plt.scatter(df['bili'], df['transplant'], alpha=0.15, color='grey', s=20)
plt.xlabel('Serum Bilirubin (mg/dL)')
plt.ylabel('Predicted Probability of Transplant')
plt.title('Non-linear effect of bilirubin on transplant (RCS)')
plt.legend()
plt.tight_layout()
plt.show()

# --- Task 6: Interpretation ---
print("\n=== Interpretation ===")
print("The plot shows the predicted probability of transplant as a function of")
print("serum bilirubin, comparing the linear and RCS models.")
print("If the RCS curve departs notably from a straight line, the relationship")
print("is non-linear and the linear model is insufficient.")
print("The AIC comparison and likelihood ratio test formally assess this.")
print("The RCS model captures the accelerating risk at higher bilirubin levels")
print("that the linear model cannot represent.")

D.1.3 Exercise 3

Show R solution

# =============================================================================
# Chapter 4, Exercise 3: Categorisation vs Splines Head-to-Head
# Simulate U-shaped relationship and compare approaches
# =============================================================================

library(rms)
library(ggplot2)

set.seed(2025)
n <- 1000
x <- rnorm(n, mean = 50, sd = 10)

# True U-shaped relationship
logit_p <- -5 + 0.004 * (x - 50)^2
y <- rbinom(n, 1, plogis(logit_p))

sim_data <- data.frame(x = x, y = y)
dd <- datadist(sim_data)
options(datadist = "dd")

# --- Model 1: Categorise into quartiles ---
sim_data$x_cat <- cut(sim_data$x,
                       breaks = quantile(sim_data$x, c(0, 0.25, 0.5, 0.75, 1)),
                       include.lowest = TRUE)
fit_cat <- lrm(y ~ x_cat, data = sim_data)

# --- Model 2: Linear term ---
fit_lin <- lrm(y ~ x, data = sim_data)

# --- Model 3: RCS with 5 knots ---
fit_rcs <- lrm(y ~ rcs(x, 5), data = sim_data)

# --- Question (a): Compare AIC ---
cat("=== Question (a): AIC Comparison ===\n")
cat("Categorised (quartiles):", AIC(fit_cat), "\n")
cat("Linear:                 ", AIC(fit_lin), "\n")
cat("RCS (5 knots):          ", AIC(fit_rcs), "\n")
cat("\nThe RCS model should have the lowest AIC, indicating best fit.\n")
cat("The linear model has the worst fit because it cannot capture\n")
cat("the U-shape at all.\n")

# --- Question (b): Plot all three fitted curves with the true function ---
x_seq <- seq(min(sim_data$x), max(sim_data$x), length.out = 200)

# True curve
true_logit <- -5 + 0.004 * (x_seq - 50)^2
true_prob <- plogis(true_logit)

# RCS predictions
pred_rcs <- Predict(fit_rcs, x = x_seq)
rcs_prob <- plogis(as.data.frame(pred_rcs)$yhat)

# Linear predictions
pred_linear <- predict(fit_linear <- lrm(y ~ x, data = sim_data),
                        newdata = data.frame(x = x_seq), type = "fitted")

# Categorised predictions (need to assign each x_seq value to a category)
x_seq_cat <- cut(x_seq,
                  breaks = quantile(sim_data$x, c(0, 0.25, 0.5, 0.75, 1)),
                  include.lowest = TRUE)
pred_cat <- predict(fit_cat,
                     newdata = data.frame(x_cat = x_seq_cat),
                     type = "fitted")

plot_df <- data.frame(
  x = rep(x_seq, 4),
  probability = c(true_prob, pred_linear, pred_cat, rcs_prob),
  Model = rep(c("True relationship", "Linear", "Categorised (quartiles)",
                "RCS (5 knots)"), each = length(x_seq))
)

p <- ggplot(plot_df, aes(x = x, y = probability, colour = Model,
                          linetype = Model)) +
  geom_line(linewidth = 1.1) +
  scale_colour_manual(values = c("True relationship" = "black",
                                  "Linear" = "#3498db",
                                  "Categorised (quartiles)" = "#e74c3c",
                                  "RCS (5 knots)" = "#27ae60")) +
  scale_linetype_manual(values = c("True relationship" = "dashed",
                                    "Linear" = "solid",
                                    "Categorised (quartiles)" = "solid",
                                    "RCS (5 knots)" = "solid")) +
  labs(x = "Predictor (x)", y = "Predicted Probability",
       title = "Comparing modelling strategies for a U-shaped relationship") +
  theme_minimal(base_size = 14) +
  theme(legend.position = "bottom")

print(p)

cat("\n=== Question (b): Interpretation ===\n")
cat("The RCS model best recovers the true U-shaped relationship.\n")
cat("The linear model completely misses the U-shape (it fits a flat or\n")
cat("slightly sloped line). The categorised model captures some of the\n")
cat("pattern but in crude steps, losing precision.\n")

# --- Question (c): Change quartiles to tertiles ---
cat("\n=== Question (c): Tertile categorisation ===\n")
sim_data$x_tert <- cut(sim_data$x,
                         breaks = quantile(sim_data$x, c(0, 1/3, 2/3, 1)),
                         include.lowest = TRUE)
fit_tert <- lrm(y ~ x_tert, data = sim_data)
cat("AIC with tertiles: ", AIC(fit_tert), "\n")
cat("AIC with quartiles:", AIC(fit_cat), "\n")
cat("AIC with RCS:      ", AIC(fit_rcs), "\n")

cat("\nChanging from quartiles to tertiles changes the categorised model's\n")
cat("estimates. The AIC may differ, and the estimated effects will change\n")
cat("because different patients are grouped together. This demonstrates\n")
cat("that categorisation results are ARBITRARY and depend on cut-point\n")
cat("choice. The RCS model does not have this problem because it models\n")
cat("the continuous relationship directly.\n")

# Test for non-linearity in the RCS model
cat("\n=== Non-linearity test for RCS model ===\n")
print(anova(fit_rcs))

Show Python solution

# =============================================================================
# Chapter 4, Exercise 3: Categorisation vs Splines Head-to-Head
# Simulate U-shaped relationship and compare approaches
# =============================================================================

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from patsy import dmatrix
from scipy.special import expit

np.random.seed(2025)
n = 1000
x = np.random.normal(50, 10, size=n)

# True U-shaped relationship
logit_p = -5 + 0.004 * (x - 50)**2
y = np.random.binomial(1, expit(logit_p))

df = pd.DataFrame({'x': x, 'y': y})

# --- Model 1: Categorise into quartiles ---
df['x_cat'] = pd.qcut(df['x'], q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
X_cat = pd.get_dummies(df['x_cat'], drop_first=True).astype(float)
X_cat = sm.add_constant(X_cat)
fit_cat = sm.GLM(df['y'], X_cat, family=sm.families.Binomial()).fit()

# --- Model 2: Linear ---
X_lin = sm.add_constant(df[['x']])
fit_lin = sm.GLM(df['y'], X_lin, family=sm.families.Binomial()).fit()

# --- Model 3: RCS with df=4 (5 knots) ---
X_rcs = dmatrix("cr(x, df=4)", data=df, return_type='dataframe')
fit_rcs = sm.GLM(df['y'], X_rcs, family=sm.families.Binomial()).fit()

# --- Question (a): Compare AIC ---
print("=== Question (a): AIC Comparison ===")
print(f"Categorised (quartiles): {fit_cat.aic:.1f}")
print(f"Linear:                  {fit_lin.aic:.1f}")
print(f"RCS (5 knots):           {fit_rcs.aic:.1f}")
print("\nThe RCS model should have the lowest AIC, indicating best fit.")
print("The linear model has the worst fit because it cannot capture")
print("the U-shape at all.\n")

# --- Question (b): Plot all three fitted curves with the true function ---
x_range = np.linspace(df['x'].min(), df['x'].max(), 200)

# True curve
true_logit = -5 + 0.004 * (x_range - 50)**2
true_prob = expit(true_logit)

# RCS predictions
X_rcs_pred = dmatrix("cr(x, df=4)",
                      data=pd.DataFrame({'x': x_range}),
                      return_type='dataframe')
rcs_prob = fit_rcs.predict(X_rcs_pred)

# Linear predictions
X_lin_pred = sm.add_constant(pd.DataFrame({'x': x_range}))
lin_prob = fit_lin.predict(X_lin_pred)

# Categorised predictions
# Assign each x_range value to its quartile, then predict
quartile_edges = df['x'].quantile([0, 0.25, 0.5, 0.75, 1.0]).values
quartile_edges[0] = -np.inf
quartile_edges[-1] = np.inf
x_cat_pred = pd.cut(x_range, bins=quartile_edges,
                      labels=['Q1', 'Q2', 'Q3', 'Q4'])
X_cat_pred = pd.get_dummies(x_cat_pred, drop_first=True).astype(float)
X_cat_pred = sm.add_constant(X_cat_pred)
# Ensure columns match
for col in X_cat.columns:
    if col not in X_cat_pred.columns:
        X_cat_pred[col] = 0.0
X_cat_pred = X_cat_pred[X_cat.columns]
cat_prob = fit_cat.predict(X_cat_pred)

plt.figure(figsize=(10, 6))
plt.plot(x_range, true_prob, 'k--', linewidth=2, label='True relationship')
plt.plot(x_range, rcs_prob, color='#27ae60', linewidth=2,
         label='RCS (5 knots)')
plt.plot(x_range, cat_prob, color='#e74c3c', linewidth=1.5,
         label='Categorised (quartiles)')
plt.plot(x_range, lin_prob, color='#3498db', linewidth=1.5, label='Linear')
plt.xlabel('Predictor (x)')
plt.ylabel('Predicted Probability')
plt.title('Comparing modelling strategies for a U-shaped relationship')
plt.legend()
plt.tight_layout()
plt.show()

print("=== Question (b): Interpretation ===")
print("The RCS model best recovers the true U-shaped relationship.")
print("The linear model completely misses the U-shape.")
print("The categorised model captures some pattern but in crude steps.\n")

# --- Question (c): Change quartiles to tertiles ---
print("=== Question (c): Tertile categorisation ===")
df['x_tert'] = pd.qcut(df['x'], q=3, labels=['T1', 'T2', 'T3'])
X_tert = pd.get_dummies(df['x_tert'], drop_first=True).astype(float)
X_tert = sm.add_constant(X_tert)
fit_tert = sm.GLM(df['y'], X_tert, family=sm.families.Binomial()).fit()

print(f"AIC with tertiles:  {fit_tert.aic:.1f}")
print(f"AIC with quartiles: {fit_cat.aic:.1f}")
print(f"AIC with RCS:       {fit_rcs.aic:.1f}")
print("\nChanging from quartiles to tertiles changes the categorised model's")
print("estimates. This demonstrates that categorisation results are ARBITRARY")
print("and depend on cut-point choice. The RCS model does not have this")
print("problem because it models the continuous relationship directly.")

# --- Likelihood ratio test for non-linearity ---
from scipy import stats

lr_stat = fit_lin.deviance - fit_rcs.deviance
df_diff = len(fit_rcs.params) - len(fit_lin.params)
p_value = 1 - stats.chi2.cdf(lr_stat, df_diff)
print(f"\n=== Non-linearity test ===")
print(f"LR statistic: {lr_stat:.3f}, df: {df_diff}, p-value: {p_value:.6f}")
if p_value < 0.05:
    print("Significant non-linearity: the U-shaped relationship is confirmed.")

D.2 Chapter 05: Penalised Regression

D.2.1 Exercise 1

Show R solution

# =============================================================================
# Chapter 5, Exercise 1: Conceptual Questions
# Penalised Regression: Ridge, LASSO, Elastic Net
# =============================================================================
# This exercise is conceptual. Answers are provided as comments.

# --- Question (a) ---
# A colleague says: "I used LASSO and it removed age from my model predicting
# mortality. This means age is not a risk factor for death."
# Explain why this interpretation is incorrect.
#
# ANSWER:
# The LASSO removing age does NOT mean age is unimportant. It means that,
# given the OTHER predictors in the model, age does not provide enough
# ADDITIONAL predictive information to justify its inclusion under the
# penalty. Possible reasons why LASSO dropped age:
#
# 1. Age may be highly CORRELATED with another retained variable (e.g.,
#    number of comorbidities, renal function). The LASSO tends to pick
#    one of a group of correlated predictors and drop the rest arbitrarily.
#
# 2. The sample size may be too small to detect age's effect given the
#    penalty strength.
#
# 3. The LASSO performs variable selection based on PREDICTION, not
#    CAUSAL importance. A variable can be a genuine risk factor but
#    redundant for prediction when other variables are present.
#
# The correct interpretation is: "Age was not selected by the LASSO model,
# possibly because its predictive information overlaps with other retained
# variables. This does not imply age is not a risk factor for death."

# --- Question (b) ---
# Ridge CV-RMSE = 2.1 days. Standard linear regression RMSE = 1.8 days
# (on the same training data). Colleague says standard model is better.
# What is wrong?
#
# ANSWER:
# The comparison is unfair because:
#
# 1. The Ridge RMSE of 2.1 is CROSS-VALIDATED -- it estimates performance
#    on UNSEEN data by training and testing on different folds.
#
# 2. The standard regression RMSE of 1.8 is likely computed on the SAME
#    training data used to fit the model (apparent performance). This is
#    OVERLY OPTIMISTIC because the model has memorised the training data's
#    noise.
#
# 3. If you computed a cross-validated RMSE for the standard regression,
#    it would likely be HIGHER than 2.1 (worse than Ridge), especially
#    with 20 predictors.
#
# The correct comparison requires evaluating both models using the SAME
# methodology -- either both cross-validated, or both on a held-out test
# set. Never compare a cross-validated metric to a training metric.

# --- Question (c) ---
# Explain in one sentence why the LASSO produces exact zeros but Ridge does not.
#
# ANSWER:
# The L1 penalty (LASSO) constrains coefficients to a diamond-shaped region
# whose corners lie on the axes, so the loss function contours are much more
# likely to first touch the constraint region at a corner (where one or more
# coefficients are exactly zero) than the smooth circular boundary of the L2
# penalty (Ridge), which has no corners.

# --- Supporting code to illustrate the geometry ---
library(ggplot2)

theta <- seq(0, 2 * pi, length.out = 200)

# L2 constraint (circle)
l2_x <- cos(theta)
l2_y <- sin(theta)

# L1 constraint (diamond)
l1_x <- c(1, 0, -1, 0, 1)
l1_y <- c(0, 1, 0, -1, 0)

p <- ggplot() +
  # L2 constraint region
  geom_polygon(data = data.frame(x = l2_x, y = l2_y),
               aes(x, y), fill = "#3498db", alpha = 0.15) +
  geom_path(data = data.frame(x = l2_x, y = l2_y),
            aes(x, y), colour = "#3498db", linewidth = 1.2) +
  # L1 constraint region
  geom_polygon(data = data.frame(x = l1_x, y = l1_y),
               aes(x, y), fill = "#e74c3c", alpha = 0.15) +
  geom_path(data = data.frame(x = l1_x, y = l1_y),
            aes(x, y), colour = "#e74c3c", linewidth = 1.2) +
  # Labels
  annotate("text", x = 0.7, y = 0.7, label = "L2 (Ridge)",
           colour = "#3498db", size = 4) +
  annotate("text", x = -0.7, y = -0.3, label = "L1 (LASSO)",
           colour = "#e74c3c", size = 4) +
  annotate("text", x = 0, y = -1.3,
           label = "The diamond has corners on the axes\nwhere coefficients = 0",
           size = 3.5) +
  coord_fixed() +
  labs(x = expression(beta[1]), y = expression(beta[2]),
       title = "L1 vs L2 constraint geometry") +
  theme_minimal(base_size = 14) +
  xlim(-1.5, 1.5) + ylim(-1.5, 1.0)

print(p)

cat("\nThe L1 diamond has corners at the axes, making it more likely that\n")
cat("the optimal solution lies at a corner (coefficient = 0), producing\n")
cat("exact sparsity. The L2 circle has no corners, so coefficients are\n")
cat("shrunk toward zero but never reach it exactly.\n")

Show Python solution

# =============================================================================
# Chapter 5, Exercise 1: Conceptual Questions
# Penalised Regression: Ridge, LASSO, Elastic Net
# =============================================================================
# This exercise is conceptual. Answers are provided as comments.

import numpy as np
import matplotlib.pyplot as plt

# --- Question (a) ---
# A colleague says: "I used LASSO and it removed age from my model predicting
# mortality. This means age is not a risk factor for death."
# Explain why this interpretation is incorrect.
#
# ANSWER:
# The LASSO removing age does NOT mean age is unimportant. It means that,
# given the OTHER predictors in the model, age does not provide enough
# ADDITIONAL predictive information to justify its inclusion under the
# penalty. Possible reasons why LASSO dropped age:
#
# 1. Age may be highly CORRELATED with another retained variable (e.g.,
#    number of comorbidities, renal function). The LASSO tends to pick
#    one of a group of correlated predictors and drop the rest arbitrarily.
#
# 2. The sample size may be too small to detect age's effect given the
#    penalty strength.
#
# 3. The LASSO performs variable selection based on PREDICTION, not
#    CAUSAL importance. A variable can be a genuine risk factor but
#    redundant for prediction when other variables are present.
#
# The correct interpretation is: "Age was not selected by the LASSO model,
# possibly because its predictive information overlaps with other retained
# variables. This does not imply age is not a risk factor for death."

# --- Question (b) ---
# Ridge CV-RMSE = 2.1 days. Standard linear regression RMSE = 1.8 days
# (on the same training data). Colleague says standard model is better.
# What is wrong?
#
# ANSWER:
# The comparison is unfair because:
#
# 1. The Ridge RMSE of 2.1 is CROSS-VALIDATED -- it estimates performance
#    on UNSEEN data by training and testing on different folds.
#
# 2. The standard regression RMSE of 1.8 is likely computed on the SAME
#    training data used to fit the model (apparent performance). This is
#    OVERLY OPTIMISTIC because the model has memorised the training data's
#    noise.
#
# 3. If you computed a cross-validated RMSE for the standard regression,
#    it would likely be HIGHER than 2.1 (worse than Ridge), especially
#    with 20 predictors.
#
# The correct comparison requires evaluating both models using the SAME
# methodology -- either both cross-validated, or both on a held-out test
# set. Never compare a cross-validated metric to a training metric.

# --- Question (c) ---
# Explain in one sentence why the LASSO produces exact zeros but Ridge does not.
#
# ANSWER:
# The L1 penalty (LASSO) constrains coefficients to a diamond-shaped region
# whose corners lie on the axes, so the loss function contours are much more
# likely to first touch the constraint region at a corner (where one or more
# coefficients are exactly zero) than the smooth circular boundary of the L2
# penalty (Ridge), which has no corners.

# --- Supporting code to illustrate the geometry ---

theta = np.linspace(0, 2 * np.pi, 200)

# L2 constraint (circle)
l2_x = np.cos(theta)
l2_y = np.sin(theta)

# L1 constraint (diamond)
l1_x = [1, 0, -1, 0, 1]
l1_y = [0, 1, 0, -1, 0]

fig, ax = plt.subplots(figsize=(7, 7))

# L2 region
ax.fill(l2_x, l2_y, alpha=0.15, color='#3498db')
ax.plot(l2_x, l2_y, color='#3498db', linewidth=1.5, label='L2 (Ridge)')

# L1 region
ax.fill(l1_x, l1_y, alpha=0.15, color='#e74c3c')
ax.plot(l1_x, l1_y, color='#e74c3c', linewidth=1.5, label='L1 (LASSO)')

# Mark the corners
ax.plot([1, -1, 0, 0], [0, 0, 1, -1], 'ro', markersize=6)
ax.annotate('Corner (exact zero for $\\beta_2$)', xy=(1, 0), fontsize=9,
            xytext=(1.1, 0.3), arrowprops=dict(arrowstyle='->', color='red'))

ax.set_xlabel(r'$\beta_1$', fontsize=14)
ax.set_ylabel(r'$\beta_2$', fontsize=14)
ax.set_title('L1 vs L2 constraint geometry')
ax.set_aspect('equal')
ax.legend(fontsize=11)
ax.set_xlim(-1.5, 1.8)
ax.set_ylim(-1.5, 1.5)
ax.axhline(0, color='grey', linewidth=0.5)
ax.axvline(0, color='grey', linewidth=0.5)
plt.tight_layout()
plt.show()

print("The L1 diamond has corners at the axes, making it more likely that")
print("the optimal solution lies at a corner (coefficient = 0), producing")
print("exact sparsity. The L2 circle has no corners, so coefficients are")
print("shrunk toward zero but never reach it exactly.")

D.2.2 Exercise 2

Show R solution

# =============================================================================
# Chapter 5, Exercise 2: Predicting Diabetes Onset
# Pima Indians Diabetes dataset
# =============================================================================

library(glmnet)
library(mlbench)
library(ggplot2)

# --- Load data ---
data(PimaIndiansDiabetes2, package = "mlbench")
pima <- na.omit(PimaIndiansDiabetes2)
cat("Complete cases:", nrow(pima), "\n\n")

# Prepare data
X <- as.matrix(pima[, 1:8])
y <- ifelse(pima$diabetes == "pos", 1, 0)

# --- Task 1: Fit a LASSO model with 10-fold cross-validation ---
set.seed(123)
cv_lasso <- cv.glmnet(X, y, family = "binomial", alpha = 1, nfolds = 10)

cat("=== LASSO Cross-Validation ===\n")
cat("Lambda.min:", round(cv_lasso$lambda.min, 4), "\n")
cat("Lambda.1se:", round(cv_lasso$lambda.1se, 4), "\n\n")

# --- Task 2: Plot the cross-validation curve ---
plot(cv_lasso, main = "LASSO Cross-Validation Curve")

# --- Task 3: Which variables are selected at lambda.1se? ---
cat("=== Variables selected at lambda.1se ===\n")
coef_1se <- coef(cv_lasso, s = "lambda.1se")
print(coef_1se)

# Count non-zero coefficients (excluding intercept)
n_selected_lasso <- sum(coef_1se[-1, ] != 0)
cat("\nNumber of variables selected by LASSO at lambda.1se:", n_selected_lasso, "\n")

# --- Task 4: Fit Ridge and Elastic Net (alpha = 0.5) ---
set.seed(123)
cv_ridge <- cv.glmnet(X, y, family = "binomial", alpha = 0, nfolds = 10)
set.seed(123)
cv_enet <- cv.glmnet(X, y, family = "binomial", alpha = 0.5, nfolds = 10)

# --- Task 5: Compare the three methods ---

# (a) How many variables does each select at lambda.1se?
coef_ridge_1se <- coef(cv_ridge, s = "lambda.1se")
coef_enet_1se <- coef(cv_enet, s = "lambda.1se")

n_selected_ridge <- sum(abs(coef_ridge_1se[-1, ]) > 1e-6)
n_selected_enet <- sum(abs(coef_enet_1se[-1, ]) > 1e-6)

cat("\n=== Number of variables selected (lambda.1se) ===\n")
cat("LASSO:       ", n_selected_lasso, "of 8\n")
cat("Elastic Net: ", n_selected_enet, "of 8\n")
cat("Ridge:       ", n_selected_ridge, "of 8 (Ridge never sets to exactly 0)\n")

# (b) Cross-validated deviance for each
cat("\n=== Cross-validated deviance (at lambda.min) ===\n")
cat("LASSO:       ", round(min(cv_lasso$cvm), 4), "\n")
cat("Elastic Net: ", round(min(cv_enet$cvm), 4), "\n")
cat("Ridge:       ", round(min(cv_ridge$cvm), 4), "\n")

# (c) Plot the regularisation path for the LASSO model
fit_lasso <- glmnet(X, y, family = "binomial", alpha = 1)
plot(fit_lasso, xvar = "lambda", label = TRUE,
     main = "LASSO Regularisation Path")
abline(v = log(cv_lasso$lambda.min), lty = 2, col = "blue")
abline(v = log(cv_lasso$lambda.1se), lty = 2, col = "red")
legend("topright", legend = c("lambda.min", "lambda.1se"),
       col = c("blue", "red"), lty = 2, cex = 0.8)

# --- Task 6: Recommendation ---
cat("\n=== Recommendation ===\n")
cat("For the Pima diabetes dataset (8 predictors, ~390 complete cases):\n")
cat("- The number of predictors is small relative to sample size,\n")
cat("  so penalisation is not strictly necessary.\n")
cat("- LASSO is useful here for identifying the most predictive variables.\n")
cat("- All three methods produce similar cross-validated deviance,\n")
cat("  confirming that the choice of penalty type matters less than\n")
cat("  using penalisation vs not.\n")
cat("- For variable selection and a parsimonious model, LASSO at\n")
cat("  lambda.1se is a good choice.\n")
cat("- For best prediction, any of the three at lambda.min would work.\n")

# --- Bonus: Compare coefficients side by side ---
coef_comparison <- data.frame(
  Variable = colnames(X),
  LASSO = as.vector(coef(cv_lasso, s = "lambda.1se"))[-1],
  Ridge = as.vector(coef(cv_ridge, s = "lambda.1se"))[-1],
  ElasticNet = as.vector(coef(cv_enet, s = "lambda.1se"))[-1]
)

cat("\n=== Coefficient comparison (lambda.1se) ===\n")
print(round(coef_comparison, 4))

Show Python solution

# =============================================================================
# Chapter 5, Exercise 2: Predicting Diabetes Onset
# Pima Indians Diabetes dataset
# =============================================================================

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt

# --- Load the Pima Indians Diabetes dataset ---
url = ("https://raw.githubusercontent.com/jbrownlee/Datasets/master/"
       "pima-indians-diabetes.data.csv")
columns = ['pregnancies', 'glucose', 'blood_pressure', 'skin_thickness',
           'insulin', 'bmi', 'diabetes_pedigree', 'age', 'outcome']
pima = pd.read_csv(url, names=columns)

# Remove rows with zero values where zero is not meaningful
cols_no_zero = ['glucose', 'blood_pressure', 'skin_thickness',
                'insulin', 'bmi']
pima[cols_no_zero] = pima[cols_no_zero].replace(0, np.nan)
pima = pima.dropna()

print(f"Complete cases: {len(pima)}\n")

X = pima[columns[:-1]].values
y = pima['outcome'].values

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# --- Task 1: Fit LASSO with cross-validation ---
lasso_cv = LogisticRegressionCV(
    penalty='l1', solver='saga', Cs=50, cv=10,
    scoring='neg_log_loss', max_iter=10000, random_state=42
)
lasso_cv.fit(X_scaled, y)

print("=== LASSO Cross-Validation ===")
print(f"Best C (= 1/lambda): {lasso_cv.C_[0]:.4f}")
print(f"Best lambda: {1/lasso_cv.C_[0]:.4f}\n")

print("LASSO selected features:")
n_selected_lasso = 0
for name, coef in zip(columns[:-1], lasso_cv.coef_[0]):
    if abs(coef) > 1e-6:
        print(f"  {name}: {coef:.4f}")
        n_selected_lasso += 1
print(f"\nNumber selected: {n_selected_lasso} of 8\n")

# --- Task 2: Fit Ridge with cross-validation ---
ridge_cv = LogisticRegressionCV(
    penalty='l2', solver='lbfgs', Cs=50, cv=10,
    scoring='neg_log_loss', max_iter=10000, random_state=42
)
ridge_cv.fit(X_scaled, y)

# --- Task 3: Fit Elastic Net with cross-validation ---
enet_cv = LogisticRegressionCV(
    penalty='elasticnet', solver='saga', Cs=50, cv=10,
    l1_ratios=[0.25, 0.5, 0.75], scoring='neg_log_loss',
    max_iter=10000, random_state=42
)
enet_cv.fit(X_scaled, y)

print(f"Elastic Net best l1_ratio (alpha): {enet_cv.l1_ratio_[0]:.2f}")
print(f"Elastic Net best C (1/lambda): {enet_cv.C_[0]:.4f}\n")

# --- Task 4: Compare cross-validated scores ---
print("=== Cross-validated log-loss (lower = better) ===")
for name, model in [('LASSO', lasso_cv), ('Ridge', ridge_cv),
                     ('Elastic Net', enet_cv)]:
    scores = cross_val_score(model, X_scaled, y, cv=10,
                              scoring='neg_log_loss')
    print(f"  {name}: {-scores.mean():.4f} (+/- {scores.std():.4f})")

# --- Task 5: Plot regularisation path for LASSO ---
Cs = np.logspace(-2, 2, 80)
coefs = []
for C in Cs:
    model = LogisticRegression(penalty='l1', C=C, solver='saga',
                                max_iter=10000, random_state=42)
    model.fit(X_scaled, y)
    coefs.append(model.coef_[0])

coefs = np.array(coefs)
lambdas = 1.0 / Cs

plt.figure(figsize=(10, 6))
for j, name in enumerate(columns[:-1]):
    plt.plot(np.log10(lambdas), coefs[:, j], label=name)
plt.axvline(x=np.log10(1/lasso_cv.C_[0]), color='k', linestyle='--',
            label='CV optimal lambda')
plt.xlabel('log10(lambda)')
plt.ylabel('Coefficient')
plt.title('LASSO Regularisation Path - Pima Diabetes')
plt.legend(loc='best', fontsize=8)
plt.tight_layout()
plt.show()

# --- Task 6: Recommendation ---
print("\n=== Recommendation ===")
print("For the Pima diabetes dataset (8 predictors, ~390 complete cases):")
print("- The number of predictors is small relative to sample size,")
print("  so penalisation is not strictly necessary.")
print("- LASSO is useful for identifying the most predictive variables.")
print("- All three methods produce similar cross-validated performance.")
print("- For variable selection, LASSO provides a parsimonious model.")
print("- For best prediction, any method works well here.")

# --- Bonus: Compare coefficients side by side ---
print("\n=== Coefficient comparison ===")
coef_df = pd.DataFrame({
    'Variable': columns[:-1],
    'LASSO': lasso_cv.coef_[0],
    'Ridge': ridge_cv.coef_[0],
    'Elastic Net': enet_cv.coef_[0]
})
print(coef_df.round(4).to_string(index=False))

D.2.3 Exercise 3

Show R solution

# =============================================================================
# Chapter 5, Exercise 3: The Effect of Correlation
# How LASSO and Ridge handle correlated predictors differently
# =============================================================================

library(glmnet)
library(MASS)

set.seed(2025)
n <- 200

# --- Create 4 correlated predictors (correlation ~ 0.9) ---
Sigma <- matrix(0.9, 4, 4)
diag(Sigma) <- 1
X_corr <- mvrnorm(n, mu = rep(0, 4), Sigma = Sigma)

# Add 6 independent predictors (noise)
X_indep <- matrix(rnorm(n * 6), n, 6)
X <- cbind(X_corr, X_indep)
colnames(X) <- c(paste0("Corr_", 1:4), paste0("Noise_", 1:6))

# True model: all 4 correlated predictors have effect = 0.5
true_beta <- c(rep(0.5, 4), rep(0, 6))
y <- rbinom(n, 1, plogis(X %*% true_beta))

# --- Task 1: Fit LASSO and examine which variables are selected ---
set.seed(42)
cv_lasso <- cv.glmnet(X, y, family = "binomial", alpha = 1)
cat("=== Task 1: LASSO coefficients (lambda.1se) ===\n")
print(round(as.matrix(coef(cv_lasso, s = "lambda.1se")), 3))

# --- Task 2: Fit Ridge and examine coefficients ---
set.seed(42)
cv_ridge <- cv.glmnet(X, y, family = "binomial", alpha = 0)
cat("\n=== Task 2: Ridge coefficients (lambda.1se) ===\n")
print(round(as.matrix(coef(cv_ridge, s = "lambda.1se")), 3))

# --- Task 3: Fit Elastic Net (alpha = 0.5) ---
set.seed(42)
cv_enet <- cv.glmnet(X, y, family = "binomial", alpha = 0.5)
cat("\n=== Task 3: Elastic Net coefficients (lambda.1se) ===\n")
print(round(as.matrix(coef(cv_enet, s = "lambda.1se")), 3))

# --- Question (a) ---
cat("\n=== Question (a): Does LASSO select all 4 correlated predictors? ===\n")
lasso_coefs <- as.vector(coef(cv_lasso, s = "lambda.1se"))[-1]
names(lasso_coefs) <- colnames(X)
selected_lasso <- names(lasso_coefs[abs(lasso_coefs) > 1e-6])
cat("LASSO selected:", paste(selected_lasso, collapse = ", "), "\n")
cat("LASSO typically selects only 1-2 of the 4 correlated predictors.\n")
cat("This is because the LASSO arbitrarily picks one predictor from a\n")
cat("correlated group and drops the rest. With correlation ~0.9, any\n")
cat("single correlated predictor carries almost the same information\n")
cat("as all four combined, so the LASSO keeps just one (or two) and\n")
cat("sets the others to zero.\n")

# --- Question (b) ---
cat("\n=== Question (b): How does Ridge handle correlated predictors? ===\n")
ridge_coefs <- as.vector(coef(cv_ridge, s = "lambda.1se"))[-1]
names(ridge_coefs) <- colnames(X)
cat("Ridge coefficients for correlated predictors:\n")
print(round(ridge_coefs[1:4], 4))
cat("\nRidge distributes the effect EVENLY across all correlated predictors.\n")
cat("All four correlated predictors retain non-zero (and similar) coefficients.\n")
cat("This is the 'grouping effect' -- Ridge shrinks correlated predictors\n")
cat("toward each other rather than picking one arbitrarily.\n")

# --- Question (c) ---
cat("\n=== Question (c): Does Elastic Net improve on LASSO? ===\n")
enet_coefs <- as.vector(coef(cv_enet, s = "lambda.1se"))[-1]
names(enet_coefs) <- colnames(X)
selected_enet <- names(enet_coefs[abs(enet_coefs) > 1e-6])
cat("Elastic Net selected:", paste(selected_enet, collapse = ", "), "\n")
cat("The Elastic Net typically selects MORE of the correlated predictors\n")
cat("than LASSO, thanks to the L2 component (grouping effect).\n")
cat("It provides a middle ground: some sparsity (like LASSO) but\n")
cat("better handling of correlated groups (like Ridge).\n")

# --- Question (d): Stability analysis ---
cat("\n=== Question (d): Stability analysis (10 different seeds) ===\n")
cat("LASSO variable selection across 10 bootstrap samples:\n")
for (seed in 1:10) {
  set.seed(seed)
  idx <- sample(n, n, replace = TRUE)
  cv_boot <- cv.glmnet(X[idx, ], y[idx], family = "binomial", alpha = 1,
                        nfolds = 5)
  boot_coefs <- as.vector(coef(cv_boot, s = "lambda.1se"))[-1]
  selected <- colnames(X)[abs(boot_coefs) > 1e-6]
  cat(sprintf("  Seed %2d: %s\n", seed, paste(selected, collapse = ", ")))
}

cat("\nRidge variable selection (all retained) across 10 bootstrap samples:\n")
for (seed in 1:10) {
  set.seed(seed)
  idx <- sample(n, n, replace = TRUE)
  cv_boot <- cv.glmnet(X[idx, ], y[idx], family = "binomial", alpha = 0,
                        nfolds = 5)
  boot_coefs <- as.vector(coef(cv_boot, s = "lambda.1se"))[-1]
  selected <- colnames(X)[abs(boot_coefs) > 0.01]
  cat(sprintf("  Seed %2d: %s\n", seed, paste(selected, collapse = ", ")))
}

cat("\nConclusion: LASSO variable selection is UNSTABLE for correlated\n")
cat("predictors -- the selected predictor(s) change across samples.\n")
cat("Ridge is stable because it always retains all predictors.\n")
cat("Elastic Net offers a compromise with better stability than LASSO.\n")

Show Python solution

# =============================================================================
# Chapter 5, Exercise 3: The Effect of Correlation
# How LASSO and Ridge handle correlated predictors differently
# =============================================================================

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
from scipy.special import expit

np.random.seed(2025)
n = 200

# --- Create 4 correlated predictors (correlation ~ 0.9) ---
mean = np.zeros(4)
cov = np.full((4, 4), 0.9)
np.fill_diagonal(cov, 1.0)
X_corr = np.random.multivariate_normal(mean, cov, size=n)

# Add 6 independent noise predictors
X_indep = np.random.randn(n, 6)
X = np.hstack([X_corr, X_indep])

feature_names = ([f"Corr_{i+1}" for i in range(4)] +
                 [f"Noise_{i+1}" for i in range(6)])

# True model: all 4 correlated predictors contribute
true_beta = np.array([0.5]*4 + [0.0]*6)
y = np.random.binomial(1, expit(X @ true_beta))

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# --- Task 1: LASSO ---
lasso = LogisticRegressionCV(penalty='l1', solver='saga', Cs=50,
                              cv=10, max_iter=10000, random_state=42)
lasso.fit(X_scaled, y)

print("=== Task 1: LASSO coefficients ===")
for name, coef in zip(feature_names, lasso.coef_[0]):
    marker = " *" if abs(coef) > 1e-6 else ""
    print(f"  {name:10s}: {coef:8.4f}{marker}")

# --- Task 2: Ridge ---
ridge = LogisticRegressionCV(penalty='l2', solver='lbfgs', Cs=50,
                              cv=10, max_iter=10000, random_state=42)
ridge.fit(X_scaled, y)

print("\n=== Task 2: Ridge coefficients ===")
for name, coef in zip(feature_names, ridge.coef_[0]):
    print(f"  {name:10s}: {coef:8.4f}")

# --- Task 3: Elastic Net ---
enet = LogisticRegressionCV(penalty='elasticnet', solver='saga', Cs=50,
                             cv=10, l1_ratios=[0.5], max_iter=10000,
                             random_state=42)
enet.fit(X_scaled, y)

print("\n=== Task 3: Elastic Net coefficients ===")
for name, coef in zip(feature_names, enet.coef_[0]):
    marker = " *" if abs(coef) > 1e-6 else ""
    print(f"  {name:10s}: {coef:8.4f}{marker}")

# --- Question (a): Does LASSO select all 4 correlated predictors? ---
print("\n=== Question (a) ===")
selected_lasso = [name for name, coef in zip(feature_names, lasso.coef_[0])
                  if abs(coef) > 1e-6]
print(f"LASSO selected: {selected_lasso}")
print("LASSO typically selects only 1-2 of the 4 correlated predictors.")
print("This is because the LASSO arbitrarily picks one predictor from a")
print("correlated group and drops the rest. With correlation ~0.9, any")
print("single predictor carries almost the same information as all four.")

# --- Question (b): How does Ridge handle correlated predictors? ---
print("\n=== Question (b) ===")
print("Ridge coefficients for correlated predictors:")
for i in range(4):
    print(f"  {feature_names[i]}: {ridge.coef_[0][i]:.4f}")
print("Ridge distributes the effect EVENLY across all correlated predictors.")
print("All four retain non-zero (and similar) coefficients.")

# --- Question (c): Does Elastic Net improve on LASSO? ---
print("\n=== Question (c) ===")
selected_enet = [name for name, coef in zip(feature_names, enet.coef_[0])
                 if abs(coef) > 1e-6]
print(f"Elastic Net selected: {selected_enet}")
print("The Elastic Net typically selects MORE of the correlated predictors")
print("than LASSO, thanks to the L2 component (grouping effect).")

# --- Question (d): Stability analysis ---
print("\n=== Question (d): Stability analysis (10 bootstrap samples) ===")
print("\nLASSO variable selection:")
for seed in range(10):
    rng = np.random.RandomState(seed)
    idx = rng.choice(n, n, replace=True)
    lasso_boot = LogisticRegressionCV(penalty='l1', solver='saga',
                                       Cs=20, cv=5, max_iter=10000,
                                       random_state=42)
    lasso_boot.fit(X_scaled[idx], y[idx])
    selected = [feature_names[j] for j in range(10)
                if abs(lasso_boot.coef_[0][j]) > 1e-6]
    print(f"  Seed {seed}: {selected}")

print("\nRidge — all predictors retained (non-zero coefficients):")
for seed in range(10):
    rng = np.random.RandomState(seed)
    idx = rng.choice(n, n, replace=True)
    ridge_boot = LogisticRegressionCV(penalty='l2', solver='lbfgs',
                                       Cs=20, cv=5, max_iter=10000,
                                       random_state=42)
    ridge_boot.fit(X_scaled[idx], y[idx])
    # Ridge always keeps all predictors; show which have coef > 0.01
    selected = [feature_names[j] for j in range(10)
                if abs(ridge_boot.coef_[0][j]) > 0.01]
    print(f"  Seed {seed}: {selected}")

print("\nConclusion: LASSO variable selection is UNSTABLE for correlated")
print("predictors -- the selected predictor(s) change across samples.")
print("Ridge is stable because it always retains all predictors.")
print("Elastic Net offers a compromise with better stability than LASSO.")

D.3 Chapter 06: Survival Analysis

D.3.1 Exercise 1

Show R solution

# =============================================================================
# Chapter 6, Exercise 1: Kaplan-Meier Curves by Disease Stage
# PBC dataset: KM curves stratified by stage, log-rank test, median survival
# =============================================================================

library(survival)
library(survminer)
library(tidyverse)

# --- Load and prepare the PBC dataset ---
data(pbc, package = "survival")

pbc_clean <- pbc %>%
  filter(!is.na(trt)) %>%
  mutate(
    status_binary = ifelse(status == 2, 1, 0),
    time_years = time / 365.25,
    trt_label = factor(trt, labels = c("D-penicillamine", "Placebo"))
  ) %>%
  filter(!is.na(stage))  # Remove missing stage values

cat("N =", nrow(pbc_clean), "\n")
cat("Events (deaths):", sum(pbc_clean$status_binary), "\n")
cat("Stages:", paste(sort(unique(pbc_clean$stage)), collapse = ", "), "\n\n")

# --- Fit Kaplan-Meier curves stratified by disease stage ---
km_stage <- survfit(Surv(time_years, status_binary) ~ stage, data = pbc_clean)
print(km_stage)

# --- Report median survival time for each stage ---
cat("\n=== Median Survival Time by Stage ===\n")
median_surv <- summary(km_stage)$table
print(median_surv)

# Extract and display more clearly
cat("\nMedian survival (years) by stage:\n")
for (s in sort(unique(pbc_clean$stage))) {
  km_s <- survfit(Surv(time_years, status_binary) ~ 1,
                   data = pbc_clean[pbc_clean$stage == s, ])
  med <- surv_median(km_s)
  cat(sprintf("  Stage %d: %.2f years (95%% CI: %.2f - %.2f)\n",
              s, med$median, med$lower, med$upper))
}

# --- Perform the log-rank test ---
cat("\n=== Log-Rank Test ===\n")
logrank <- survdiff(Surv(time_years, status_binary) ~ stage, data = pbc_clean)
print(logrank)

# Extract p-value
pvalue <- 1 - pchisq(logrank$chisq, length(logrank$n) - 1)
cat(sprintf("\nLog-rank test p-value: %.6f\n", pvalue))

if (pvalue < 0.05) {
  cat("Result: Significant difference in survival across disease stages.\n")
} else {
  cat("Result: No significant difference detected.\n")
}

# --- Plot the Kaplan-Meier curves ---
p <- ggsurvplot(
  km_stage,
  data = pbc_clean,
  pval = TRUE,               # Show log-rank p-value
  conf.int = TRUE,            # Show confidence intervals
  risk.table = TRUE,          # Show number at risk table
  xlab = "Time (years)",
  ylab = "Survival probability",
  title = "Kaplan-Meier Curves by Disease Stage (PBC)",
  legend.title = "Stage",
  palette = c("#2ecc71", "#3498db", "#e67e22", "#e74c3c"),
  ggtheme = theme_minimal()
)

print(p)

# --- Interpretation ---
cat("\n=== Interpretation ===\n")
cat("The KM curves should show a clear separation by stage, with higher\n")
cat("stages having worse survival. The log-rank test formally confirms\n")
cat("whether these differences are statistically significant.\n")
cat("As expected in PBC, patients with Stage 4 disease have the worst\n")
cat("prognosis, while Stage 1-2 patients have substantially better survival.\n")

Show Python solution

# =============================================================================
# Chapter 6, Exercise 1: Kaplan-Meier Curves by Disease Stage
# PBC dataset (simulated): KM curves by stage, log-rank test, median survival
# =============================================================================

import numpy as np
import pandas as pd
from lifelines import KaplanMeierFitter
from lifelines.statistics import logrank_test, multivariate_logrank_test
import matplotlib.pyplot as plt

# --- Simulate PBC-like data with stage variable ---
# (Since PBC is an R-native dataset, we simulate its structure as done
#  in the chapter's Python code)
np.random.seed(42)
n = 312

# Simulate stage (1-4) with realistic proportions
stage = np.random.choice([1, 2, 3, 4], size=n, p=[0.05, 0.25, 0.40, 0.30])

# Simulate survival times that depend on stage (higher stage = worse survival)
# Base scale varies by stage
scale_by_stage = {1: 14, 2: 10, 3: 7, 4: 4}
time_years = np.array([np.random.exponential(scale=scale_by_stage[s]) for s in stage])
time_years = np.clip(time_years, 0.1, 15)  # Clip to realistic range

# Event probability increases with stage
event_prob = {1: 0.20, 2: 0.35, 3: 0.50, 4: 0.65}
event = np.array([np.random.binomial(1, event_prob[s]) for s in stage])

pbc = pd.DataFrame({
    'time_years': time_years,
    'event': event,
    'stage': stage
})

print(f"N = {len(pbc)}")
print(f"Events (deaths): {pbc['event'].sum()}")
print(f"Stages: {sorted(pbc['stage'].unique())}\n")

# --- Fit KM for each stage and report median survival ---
print("=== Median Survival Time by Stage ===")
kmf_dict = {}
for s in sorted(pbc['stage'].unique()):
    mask = pbc['stage'] == s
    kmf = KaplanMeierFitter()
    kmf.fit(pbc.loc[mask, 'time_years'],
            event_observed=pbc.loc[mask, 'event'],
            label=f'Stage {s}')
    kmf_dict[s] = kmf

    median_surv = kmf.median_survival_time_
    ci = kmf.confidence_interval_survival_function_
    print(f"  Stage {s}: Median survival = {median_surv:.2f} years "
          f"(n={mask.sum()}, events={pbc.loc[mask, 'event'].sum()})")

# --- Perform the log-rank test (overall comparison across all stages) ---
print("\n=== Log-Rank Test (overall) ===")
result = multivariate_logrank_test(
    pbc['time_years'], pbc['stage'], pbc['event']
)
print(f"Test statistic: {result.test_statistic:.3f}")
print(f"p-value: {result.p_value:.6f}")

if result.p_value < 0.05:
    print("Result: Significant difference in survival across disease stages.")
else:
    print("Result: No significant difference detected.")

# --- Pairwise log-rank tests (Stage 1 vs 4, etc.) ---
print("\n=== Pairwise Log-Rank Tests (selected) ===")
for s1, s2 in [(1, 4), (2, 3), (3, 4)]:
    mask1 = pbc['stage'] == s1
    mask2 = pbc['stage'] == s2
    if mask1.sum() > 0 and mask2.sum() > 0:
        lr = logrank_test(
            pbc.loc[mask1, 'time_years'], pbc.loc[mask2, 'time_years'],
            pbc.loc[mask1, 'event'], pbc.loc[mask2, 'event']
        )
        print(f"  Stage {s1} vs {s2}: p = {lr.p_value:.4f}")

# --- Plot KM curves for all stages ---
fig, ax = plt.subplots(figsize=(10, 6))

colors = {1: '#2ecc71', 2: '#3498db', 3: '#e67e22', 4: '#e74c3c'}
for s, kmf in kmf_dict.items():
    kmf.plot_survival_function(ax=ax, ci_show=True, color=colors[s])

ax.set_xlabel("Time (years)", fontsize=12)
ax.set_ylabel("Survival probability", fontsize=12)
ax.set_title("Kaplan-Meier Curves by Disease Stage (PBC)", fontsize=14)
ax.legend(fontsize=10)

# Add p-value annotation
ax.text(0.7, 0.95, f'Log-rank p = {result.p_value:.4f}',
        transform=ax.transAxes, fontsize=11,
        verticalalignment='top',
        bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))

plt.tight_layout()
plt.show()

# --- Interpretation ---
print("\n=== Interpretation ===")
print("The KM curves show a clear separation by stage, with higher stages")
print("having worse survival. The log-rank test formally confirms whether")
print("these differences are statistically significant.")
print("Patients with Stage 4 disease have the worst prognosis, while")
print("Stage 1-2 patients have substantially better survival.")

D.3.2 Exercise 2

Show R solution

# =============================================================================
# Chapter 6, Exercise 2: Build and Evaluate a Cox Model
# PBC dataset: Cox PH with age, log(bili), albumin, protime, edema
# =============================================================================

library(survival)
library(survminer)
library(tidyverse)

# --- Load and prepare data ---
data(pbc, package = "survival")

pbc_clean <- pbc %>%
  filter(!is.na(trt)) %>%
  mutate(
    status_binary = ifelse(status == 2, 1, 0),
    time_years = time / 365.25,
    trt_label = factor(trt, labels = c("D-penicillamine", "Placebo")),
    log_bili = log(bili)
  ) %>%
  select(time_years, status_binary, age, log_bili, albumin, protime, edema) %>%
  drop_na()

cat("Complete cases:", nrow(pbc_clean), "\n")
cat("Events:", sum(pbc_clean$status_binary), "\n\n")

# --- Task 1: Fit the Cox model and report hazard ratios ---
cox_model <- coxph(
  Surv(time_years, status_binary) ~ age + log_bili + albumin + protime + edema,
  data = pbc_clean
)

cat("=== Task 1: Cox Model Summary ===\n")
print(summary(cox_model))

# Extract and display HRs with 95% CIs
cat("\n=== Hazard Ratios and 95% CIs ===\n")
hr_table <- data.frame(
  Variable = names(coef(cox_model)),
  HR = exp(coef(cox_model)),
  Lower_95 = exp(confint(cox_model))[, 1],
  Upper_95 = exp(confint(cox_model))[, 2],
  p_value = summary(cox_model)$coefficients[, 5]
)
print(round(hr_table, 4))

# Forest plot
ggforest(cox_model, data = pbc_clean)

# --- Task 2: Check the proportional hazards assumption ---
cat("\n=== Task 2: Proportional Hazards Assumption ===\n")
ph_test <- cox.zph(cox_model)
print(ph_test)

cat("\nInterpretation of PH test:\n")
cat("- A significant p-value (< 0.05) indicates violation of the PH assumption.\n")
cat("- Look at the GLOBAL test and individual covariate tests.\n")

# Which covariates violate PH?
ph_df <- data.frame(
  Variable = rownames(ph_test$table)[-nrow(ph_test$table)],
  p_value = ph_test$table[-nrow(ph_test$table), "p"]
)
violators <- ph_df$Variable[ph_df$p_value < 0.05]
if (length(violators) > 0) {
  cat("Covariates violating PH assumption (p < 0.05):",
      paste(violators, collapse = ", "), "\n")
} else {
  cat("No covariates significantly violate the PH assumption.\n")
}

# Plot Schoenfeld residuals
par(mfrow = c(2, 3))
plot(ph_test)
par(mfrow = c(1, 1))

# --- Task 3: Concordance index ---
cat("\n=== Task 3: Concordance Index ===\n")
c_index <- summary(cox_model)$concordance[1]
c_se <- summary(cox_model)$concordance[2]
cat(sprintf("C-index: %.3f (SE: %.3f)\n", c_index, c_se))
cat(sprintf("95%% CI: %.3f - %.3f\n", c_index - 1.96 * c_se, c_index + 1.96 * c_se))

cat("\nInterpretation:\n")
cat("- C-index = 0.5: model has no discriminative ability (random)\n")
cat("- C-index = 1.0: perfect discrimination\n")
cat("- C-index > 0.7: generally considered acceptable\n")
cat("- C-index > 0.8: considered strong discrimination\n")
cat(sprintf("This model's C-index of %.3f indicates %s discrimination.\n",
            c_index,
            ifelse(c_index > 0.8, "strong",
                   ifelse(c_index > 0.7, "acceptable", "modest"))))

# --- Task 4: Predict 5-year survival for a specific patient ---
cat("\n=== Task 4: 5-Year Survival Prediction ===\n")
new_patient <- data.frame(
  age = 55,
  log_bili = log(3),       # bilirubin = 3 mg/dL
  albumin = 3.2,
  protime = 11,
  edema = 0                # no edema
)

cat("Patient profile:\n")
cat("  Age: 55 years\n")
cat("  Bilirubin: 3 mg/dL (log = ", round(log(3), 2), ")\n")
cat("  Albumin: 3.2 g/dL\n")
cat("  Prothrombin time: 11 seconds\n")
cat("  Edema: none\n\n")

# Get predicted survival function
pred_surv <- survfit(cox_model, newdata = new_patient)

# Extract 5-year survival probability
surv_summary <- summary(pred_surv, times = 5)
cat(sprintf("Predicted 5-year survival probability: %.3f (%.1f%%)\n",
            surv_summary$surv, surv_summary$surv * 100))
cat(sprintf("95%% CI: %.3f - %.3f\n",
            surv_summary$lower, surv_summary$upper))

# Plot the predicted survival curve
plot(pred_surv,
     xlab = "Time (years)",
     ylab = "Survival probability",
     main = "Predicted Survival for Patient Profile",
     lwd = 2, col = "blue")
abline(h = 0.5, lty = 2, col = "grey")
abline(v = 5, lty = 2, col = "red")
legend("topright", c("Predicted survival", "5-year mark"),
       col = c("blue", "red"), lty = c(1, 2), lwd = c(2, 1))

Show Python solution

# =============================================================================
# Chapter 6, Exercise 2: Build and Evaluate a Cox Model
# PBC-like dataset: Cox PH with age, log(bili), albumin, protime, edema
# =============================================================================

import numpy as np
import pandas as pd
from lifelines import CoxPHFitter
import matplotlib.pyplot as plt

# --- Simulate PBC-like clinical dataset ---
# (Following the chapter's Python approach with simulated data)
np.random.seed(42)
n = 276

pbc = pd.DataFrame({
    'time_years': np.abs(np.random.exponential(7, n) +
                          np.random.normal(0, 1, n)),
    'event': np.random.binomial(1, 0.45, n),
    'age': np.random.normal(50, 10, n),
    'log_bili': np.random.normal(0.5, 1.0, n),     # log(bilirubin)
    'albumin': np.random.normal(3.5, 0.4, n),
    'protime': np.random.normal(11, 1, n),
    'edema': np.random.choice([0, 0.5, 1], n, p=[0.75, 0.15, 0.10])
})

# Make survival time depend on covariates to create realistic associations
hazard_linear = (0.04 * pbc['age'] + 0.3 * pbc['log_bili'] -
                 0.5 * pbc['albumin'] + 0.2 * pbc['protime'] +
                 0.8 * pbc['edema'])
# Adjust times based on hazard
pbc['time_years'] = pbc['time_years'] * np.exp(-0.1 * hazard_linear)
pbc['time_years'] = np.clip(pbc['time_years'], 0.05, 15)

print(f"N = {len(pbc)}")
print(f"Events: {pbc['event'].sum()}")
print(f"Median follow-up: {pbc['time_years'].median():.2f} years\n")

# --- Task 1: Fit Cox model and report hazard ratios ---
cph = CoxPHFitter()
cph.fit(pbc, duration_col='time_years', event_col='event',
        formula='age + log_bili + albumin + protime + edema')

print("=== Task 1: Cox Model Summary ===")
cph.print_summary()

# Extract HR table
print("\n=== Hazard Ratios and 95% CIs ===")
hr_table = pd.DataFrame({
    'HR': np.exp(cph.params_),
    'Lower_95': np.exp(cph.confidence_intervals_.iloc[:, 0]),
    'Upper_95': np.exp(cph.confidence_intervals_.iloc[:, 1]),
    'p_value': cph.summary['p']
})
print(hr_table.round(4))

# Forest plot
fig, ax = plt.subplots(figsize=(8, 5))
cph.plot(ax=ax)
ax.set_title("Cox PH Model - Hazard Ratios")
plt.tight_layout()
plt.show()

# --- Task 2: Check proportional hazards assumption ---
print("\n=== Task 2: Proportional Hazards Assumption ===")
try:
    cph.check_assumptions(pbc, p_value_threshold=0.05, show_plots=True)
except Exception as e:
    print(f"PH assumption check result: {e}")
    print("If no warnings are raised, the PH assumption is not violated.")

# Interpretation
print("\nInterpretation:")
print("- A significant p-value (< 0.05) indicates violation of PH.")
print("- Look at the Schoenfeld residual plots for trends over time.")
print("- If a covariate violates PH, consider stratifying on it or")
print("  adding a time interaction.")

# --- Task 3: Concordance index ---
print(f"\n=== Task 3: Concordance Index ===")
c_index = cph.concordance_index_
print(f"C-index: {c_index:.3f}")
print(f"\nInterpretation:")
print(f"- C-index = 0.5: no discriminative ability (random)")
print(f"- C-index = 1.0: perfect discrimination")
print(f"- C-index > 0.7: generally considered acceptable")
print(f"- C-index > 0.8: considered strong discrimination")
quality = "strong" if c_index > 0.8 else ("acceptable" if c_index > 0.7 else "modest")
print(f"This model's C-index of {c_index:.3f} indicates {quality} discrimination.")

# --- Task 4: 5-year survival for a specific patient ---
print(f"\n=== Task 4: 5-Year Survival Prediction ===")
print("Patient profile:")
print("  Age: 55 years")
print(f"  Bilirubin: 3 mg/dL (log = {np.log(3):.2f})")
print("  Albumin: 3.2 g/dL")
print("  Prothrombin time: 11 seconds")
print("  Edema: none (0)")

new_patient = pd.DataFrame({
    'age': [55],
    'log_bili': [np.log(3)],    # bilirubin = 3 mg/dL
    'albumin': [3.2],
    'protime': [11],
    'edema': [0]
})

# Get predicted survival function
surv_func = cph.predict_survival_function(new_patient)

# Find 5-year survival
# Get the closest time point to 5 years
times = surv_func.index
closest_5yr = times[np.argmin(np.abs(times - 5))]
surv_5yr = surv_func.loc[closest_5yr].values[0]

print(f"\nPredicted 5-year survival probability: {surv_5yr:.3f} ({surv_5yr*100:.1f}%)")

# Plot predicted survival curve
fig, ax = plt.subplots(figsize=(8, 5))
surv_func.plot(ax=ax, color='blue', linewidth=2, label='Predicted survival')
ax.axhline(y=0.5, color='grey', linestyle='--', alpha=0.7, label='50% survival')
ax.axvline(x=5, color='red', linestyle='--', alpha=0.7, label='5-year mark')
ax.set_xlabel("Time (years)", fontsize=12)
ax.set_ylabel("Survival probability", fontsize=12)
ax.set_title("Predicted Survival for Patient Profile", fontsize=14)
ax.legend()
plt.tight_layout()
plt.show()

D.3.3 Exercise 3

Show R solution

# =============================================================================
# Chapter 6, Exercise 3: Competing Risks Thinking
# Kidney transplant rejection with death as a competing risk
# =============================================================================
# This exercise is primarily conceptual (parts 1-2) with a bonus coding
# component (part 3).

# --- Question 1 ---
# Explain why treating death as censoring would bias the results.
# What direction would the bias go?
#
# ANSWER:
# Treating death as simple censoring VIOLATES the assumption of
# NON-INFORMATIVE censoring. Non-informative censoring means that
# censored patients have the same future risk of the event as those
# who remain under observation. But patients who die are NOT like
# patients who remain alive -- they have ZERO future risk of rejection
# (because dead patients cannot experience rejection).
#
# The bias direction: treating death as censoring INFLATES (overestimates)
# the estimated probability of rejection. This happens because the
# Kaplan-Meier estimator assumes that censored patients would eventually
# experience the event at the same rate as those still at risk. But dead
# patients will NEVER experience rejection, so acting as if they could
# leads to an overestimate of the cumulative incidence of rejection.
#
# In the KM framework, when a patient dies and is "censored," they are
# removed from the risk set, effectively assuming they would have had
# the same rejection rate as surviving patients. Since they actually
# have zero rejection risk, this inflates the estimated rate.

# --- Question 2 ---
# Which approach for estimating probability of rejection within 2 years:
# cause-specific hazards or Fine-Gray?
#
# ANSWER:
# For estimating the PROBABILITY of rejection within 2 years, use the
# FINE-GRAY subdistribution hazard model. Here is why:
#
# - Cause-specific hazards model the RATE of rejection among those
#   currently alive and rejection-free. This answers: "Among patients
#   still alive, what is the instantaneous risk of rejection?" This is
#   useful for understanding ETIOLOGY (what causes rejection).
#
# - The Fine-Gray model directly models the CUMULATIVE INCIDENCE
#   FUNCTION, which gives the probability of rejection by time t,
#   accounting for the fact that some patients will die first. This
#   is the right quantity for PREDICTION and CLINICAL COMMUNICATION.
#
# When your goal is prediction ("What is the probability that this
# patient's transplant will be rejected within 2 years?"), you need
# the cumulative incidence, which properly accounts for the competing
# risk of death. The Fine-Gray model provides this directly.

# --- Question 3 (Bonus): Fine-Gray model in R ---

library(survival)
library(tidycmprsk)
library(ggplot2)

# Simulate kidney transplant data with competing risks
set.seed(2025)
n <- 500

# Covariates
age <- rnorm(n, mean = 50, sd = 12)
donor_type <- rbinom(n, 1, 0.4)  # 0 = living, 1 = deceased donor
hla_mismatch <- rpois(n, lambda = 2)

# Simulate competing event times
# Time to rejection (cause 1)
lambda_reject <- exp(-4 + 0.02 * age + 0.5 * donor_type + 0.2 * hla_mismatch)
time_reject <- rexp(n, rate = lambda_reject)

# Time to death without rejection (cause 2)
lambda_death <- exp(-5 + 0.03 * age + 0.3 * donor_type)
time_death <- rexp(n, rate = lambda_death)

# Administrative censoring at 10 years
time_censor <- runif(n, 5, 10)

# Determine observed time and event type
time_obs <- pmin(time_reject, time_death, time_censor)
event <- ifelse(time_obs == time_censor, 0,
                ifelse(time_obs == time_reject, 1, 2))
# 0 = censored, 1 = rejection, 2 = death

df <- data.frame(
  time = time_obs,
  event = event,
  age = age,
  donor_type = factor(donor_type, labels = c("Living", "Deceased")),
  hla_mismatch = hla_mismatch
)

cat("=== Dataset summary ===\n")
cat("N:", n, "\n")
cat("Rejections (event=1):", sum(event == 1), "\n")
cat("Deaths (event=2):", sum(event == 2), "\n")
cat("Censored (event=0):", sum(event == 0), "\n\n")

# --- Cumulative Incidence Function (CIF) ---
# Using tidycmprsk for a tidy interface
cuminc_fit <- cuminc(Surv(time, event) ~ donor_type, data = df)
print(cuminc_fit)

# Plot cumulative incidence by donor type
p <- ggcuminc(cuminc_fit, outcome = "1") +  # outcome 1 = rejection
  labs(x = "Time (years)",
       y = "Cumulative Incidence of Rejection",
       title = "Cumulative Incidence of Transplant Rejection by Donor Type") +
  theme_minimal(base_size = 14)
print(p)

# --- Fine-Gray subdistribution hazard model ---
cat("\n=== Fine-Gray Model ===\n")
# Using tidycmprsk::crr for Fine-Gray regression
fg_model <- crr(Surv(time, event) ~ age + donor_type + hla_mismatch,
                data = df, failcode = 1)  # failcode 1 = rejection
print(summary(fg_model))

cat("\nInterpretation:\n")
cat("- The Fine-Gray model estimates subdistribution hazard ratios.\n")
cat("- These quantify how covariates affect the cumulative incidence\n")
cat("  of rejection, accounting for the competing risk of death.\n")
cat("- HR > 1 means higher cumulative incidence of rejection.\n")

# --- Compare with cause-specific Cox model ---
cat("\n=== Cause-Specific Cox Model (for comparison) ===\n")
# Censor deaths for cause-specific analysis of rejection
df$event_cs <- ifelse(df$event == 1, 1, 0)  # only rejection is event
cox_cs <- coxph(Surv(time, event_cs) ~ age + donor_type + hla_mismatch,
                data = df)
print(summary(cox_cs))

cat("\n=== Comparison Summary ===\n")
cat("The cause-specific and Fine-Gray HRs can differ because they\n")
cat("answer different questions:\n")
cat("- Cause-specific: 'What affects the rate of rejection among\n")
cat("  those still alive?'\n")
cat("- Fine-Gray: 'What affects the probability of rejection by\n")
cat("  time t, accounting for death?'\n")
cat("Both are valid; the choice depends on whether your goal is\n")
cat("etiological understanding or clinical prediction.\n")

Show Python solution

# =============================================================================
# Chapter 6, Exercise 3: Competing Risks Thinking
# Kidney transplant rejection with death as a competing risk
# =============================================================================
# This exercise is primarily conceptual (parts 1-2) with a bonus coding
# component (part 3).

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# --- Question 1 ---
# Explain why treating death as censoring would bias the results.
# What direction would the bias go?
#
# ANSWER:
# Treating death as simple censoring VIOLATES the assumption of
# NON-INFORMATIVE censoring. Non-informative censoring means that
# censored patients have the same future risk of the event as those
# who remain under observation. But patients who die are NOT like
# patients who remain alive -- they have ZERO future risk of rejection
# (because dead patients cannot experience rejection).
#
# The bias direction: treating death as censoring INFLATES (overestimates)
# the estimated probability of rejection. This happens because the
# Kaplan-Meier estimator assumes that censored patients would eventually
# experience the event at the same rate as those still at risk. But dead
# patients will NEVER experience rejection, so acting as if they could
# leads to an overestimate of the cumulative incidence of rejection.
#
# In the KM framework, when a patient dies and is "censored," they are
# removed from the risk set, effectively assuming they would have had
# the same rejection rate as surviving patients. Since they actually
# have zero rejection risk, this inflates the estimated rate.

# --- Question 2 ---
# Which approach for estimating probability of rejection within 2 years:
# cause-specific hazards or Fine-Gray?
#
# ANSWER:
# For estimating the PROBABILITY of rejection within 2 years, use the
# FINE-GRAY subdistribution hazard model. Here is why:
#
# - Cause-specific hazards model the RATE of rejection among those
#   currently alive and rejection-free. This answers: "Among patients
#   still alive, what is the instantaneous risk of rejection?" This is
#   useful for understanding ETIOLOGY (what causes rejection).
#
# - The Fine-Gray model directly models the CUMULATIVE INCIDENCE
#   FUNCTION, which gives the probability of rejection by time t,
#   accounting for the fact that some patients will die first. This
#   is the right quantity for PREDICTION and CLINICAL COMMUNICATION.
#
# When your goal is prediction ("What is the probability that this
# patient's transplant will be rejected within 2 years?"), you need
# the cumulative incidence, which properly accounts for the competing
# risk of death. The Fine-Gray model provides this directly.

# --- Question 3 (Bonus): Competing risks analysis in Python ---

from lifelines import KaplanMeierFitter, CoxPHFitter

# Simulate kidney transplant data with competing risks
np.random.seed(2025)
n = 500

# Covariates
age = np.random.normal(50, 12, size=n)
donor_type = np.random.binomial(1, 0.4, size=n)  # 0=living, 1=deceased
hla_mismatch = np.random.poisson(2, size=n)

# Simulate competing event times
# Time to rejection (cause 1)
lambda_reject = np.exp(-4 + 0.02 * age + 0.5 * donor_type + 0.2 * hla_mismatch)
time_reject = np.random.exponential(1 / lambda_reject)

# Time to death without rejection (cause 2)
lambda_death = np.exp(-5 + 0.03 * age + 0.3 * donor_type)
time_death = np.random.exponential(1 / lambda_death)

# Administrative censoring at 5-10 years
time_censor = np.random.uniform(5, 10, size=n)

# Determine observed time and event type
time_obs = np.minimum(np.minimum(time_reject, time_death), time_censor)
event = np.where(time_obs == time_censor, 0,
                 np.where(time_obs == time_reject, 1, 2))
# 0 = censored, 1 = rejection, 2 = death

df = pd.DataFrame({
    'time': time_obs,
    'event': event,
    'age': age,
    'donor_type': donor_type,
    'hla_mismatch': hla_mismatch
})

print("=== Dataset summary ===")
print(f"N: {n}")
print(f"Rejections (event=1): {(event == 1).sum()}")
print(f"Deaths (event=2): {(event == 2).sum()}")
print(f"Censored (event=0): {(event == 0).sum()}\n")

# --- Approach 1: Naive KM (treating death as censoring) ---
# This is WRONG but illustrative
kmf_naive = KaplanMeierFitter()
event_naive = (event == 1).astype(int)  # only rejection = event; death = censored
kmf_naive.fit(time_obs, event_observed=event_naive, label="Naive KM (death = censored)")

# --- Approach 2: Cause-specific Cox model ---
# For rejection: censor deaths and administrative censoring
print("=== Cause-Specific Cox Model (rejection) ===")
df_cs = df.copy()
df_cs['event_rejection'] = (df_cs['event'] == 1).astype(int)

cph_cs = CoxPHFitter()
cph_cs.fit(df_cs, duration_col='time', event_col='event_rejection',
           formula='age + donor_type + hla_mismatch')
cph_cs.print_summary()

# --- Approach 3: Cumulative Incidence Function (Aalen-Johansen estimator) ---
# Compute the CIF manually using the Aalen-Johansen approach
# This properly accounts for competing risks

def compute_cif(times, events, cause=1):
    """
    Compute the cumulative incidence function for a specific cause
    using the Aalen-Johansen estimator.
    """
    # Sort by time
    order = np.argsort(times)
    t_sorted = times[order]
    e_sorted = events[order]

    unique_times = np.unique(t_sorted[e_sorted > 0])
    n_risk = len(times)

    cif = []
    surv = 1.0  # Overall survival (from all causes)
    cif_times = []

    for t in unique_times:
        at_risk = np.sum(t_sorted >= t)
        d_cause = np.sum((t_sorted == t) & (e_sorted == cause))
        d_all = np.sum((t_sorted == t) & (e_sorted > 0))

        if at_risk > 0:
            # CIF increment: S(t-) * d_cause/n_risk
            cif_increment = surv * (d_cause / at_risk)
            # Update overall survival
            surv *= (1 - d_all / at_risk)

            cif.append(cif_increment)
            cif_times.append(t)

    # Cumulative sum
    cif_cumulative = np.cumsum(cif)
    return np.array(cif_times), cif_cumulative

# Compute CIF for rejection (cause 1)
cif_times, cif_values = compute_cif(time_obs, event, cause=1)

# Compute CIF for death (cause 2)
cif_times_d, cif_values_d = compute_cif(time_obs, event, cause=2)

# --- Plot: Naive KM vs CIF ---
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left panel: Naive KM (1 - S(t)) vs CIF for rejection
ax1 = axes[0]
# Naive 1 - KM
naive_times = kmf_naive.survival_function_.index
naive_1_minus_s = 1 - kmf_naive.survival_function_.values.flatten()
ax1.step(naive_times, naive_1_minus_s, color='red', linewidth=2,
         label='Naive 1-KM (BIASED)', where='post')
# Proper CIF
ax1.step(cif_times, cif_values, color='blue', linewidth=2,
         label='Cumulative Incidence (correct)', where='post')
ax1.set_xlabel("Time (years)", fontsize=12)
ax1.set_ylabel("Probability of Rejection", fontsize=12)
ax1.set_title("Naive KM vs Cumulative Incidence", fontsize=13)
ax1.legend(fontsize=10)
ax1.set_xlim(0, 10)

# Right panel: Stacked CIF (rejection + death)
ax2 = axes[1]
# Interpolate to common time grid
from scipy import interpolate

t_grid = np.linspace(0.01, min(cif_times.max(), cif_times_d.max()), 200)
f_reject = interpolate.interp1d(cif_times, cif_values, kind='previous',
                                 bounds_error=False, fill_value=(0, cif_values[-1]))
f_death = interpolate.interp1d(cif_times_d, cif_values_d, kind='previous',
                                bounds_error=False, fill_value=(0, cif_values_d[-1]))

cif_r = f_reject(t_grid)
cif_d = f_death(t_grid)

ax2.fill_between(t_grid, 0, cif_r, alpha=0.4, color='#e74c3c', label='Rejection')
ax2.fill_between(t_grid, cif_r, cif_r + cif_d, alpha=0.4,
                 color='#3498db', label='Death (competing)')
ax2.plot(t_grid, cif_r + cif_d, color='black', linewidth=1)
ax2.set_xlabel("Time (years)", fontsize=12)
ax2.set_ylabel("Cumulative Incidence", fontsize=12)
ax2.set_title("Stacked Cumulative Incidence Functions", fontsize=13)
ax2.legend(fontsize=10)
ax2.set_xlim(0, 10)
ax2.set_ylim(0, 1)

plt.tight_layout()
plt.show()

# --- 2-year cumulative incidence of rejection ---
# Find closest time to 2 years
idx_2yr = np.argmin(np.abs(cif_times - 2))
ci_2yr = cif_values[idx_2yr]
print(f"\n=== 2-Year Cumulative Incidence of Rejection ===")
print(f"Proper CIF estimate: {ci_2yr:.3f} ({ci_2yr*100:.1f}%)")

# Compare with naive KM
naive_2yr = 1 - kmf_naive.predict(2)
print(f"Naive 1-KM estimate: {naive_2yr:.3f} ({naive_2yr*100:.1f}%)")
print(f"\nThe naive estimate is HIGHER (biased upward) because it treats")
print(f"deaths as censoring, overestimating the probability of rejection.")

# --- Summary ---
print("\n=== Summary ===")
print("1. Treating death as censoring overestimates rejection probability.")
print("2. For prediction, use the Fine-Gray model / cumulative incidence.")
print("3. For etiology, use cause-specific Cox models.")
print("4. Always report cumulative incidence functions, not 1-KM,")
print("   when competing risks are present.")

D.4 Chapter 07: ML Foundations

D.4.1 Exercise 1

Show R solution

# =============================================================================
# Chapter 7, Exercise 1: Cross-Validation Experiment
# Compare logistic regression and SVM (RBF kernel) using 10-fold stratified CV.
# Report AUC for both models.
# =============================================================================

library(tidyverse)
library(tidymodels)
library(kernlab)

# --- Simulate the clinical dataset ---
set.seed(123)
n <- 600
ex_data <- tibble(
  age = rnorm(n, 65, 10),
  creatinine = rlnorm(n, 0, 0.5),
  hemoglobin = rnorm(n, 12, 2),
  platelets = rnorm(n, 250, 70),
  wbc = rlnorm(n, 2, 0.4),
  icu = factor(rbinom(n, 1, plogis(-4 + 0.03 * rnorm(n, 65, 10) +
                                    0.5 * rlnorm(n, 0, 0.5))),
               labels = c("No", "Yes"))
)

cat("ICU admission rate:", mean(ex_data$icu == "Yes"), "\n")

# --- 10-fold stratified cross-validation ---
set.seed(42)
folds <- vfold_cv(ex_data, v = 10, strata = icu)

# --- Logistic Regression workflow ---
lr_spec <- logistic_reg() %>%
  set_engine("glm")

lr_recipe <- recipe(icu ~ ., data = ex_data) %>%
  step_normalize(all_numeric_predictors())

lr_wf <- workflow() %>%
  add_model(lr_spec) %>%
  add_recipe(lr_recipe)

lr_results <- fit_resamples(lr_wf, resamples = folds,
                            metrics = metric_set(roc_auc))

# --- SVM (RBF kernel) workflow ---
svm_spec <- svm_rbf(cost = 1, rbf_sigma = 0.5) %>%
  set_engine("kernlab") %>%
  set_mode("classification")

svm_recipe <- recipe(icu ~ ., data = ex_data) %>%
  step_normalize(all_numeric_predictors())

svm_wf <- workflow() %>%
  add_model(svm_spec) %>%
  add_recipe(svm_recipe)

svm_results <- fit_resamples(svm_wf, resamples = folds,
                             metrics = metric_set(roc_auc))

# --- Collect and compare results ---
lr_metrics  <- collect_metrics(lr_results) %>% mutate(model = "Logistic Regression")
svm_metrics <- collect_metrics(svm_results) %>% mutate(model = "SVM (RBF)")

comparison <- bind_rows(lr_metrics, svm_metrics) %>%
  select(model, .metric, mean, std_err)

print(comparison)

# --- Interpretation ---
# Both models are expected to show similar AUC because the simulated data
# has simple, roughly linear relationships between predictors and outcome.
# Logistic regression handles linear relationships well, and an RBF SVM
# adds flexibility that is not needed here.
# With real clinical data that has non-linear or interactive effects,
# SVM may outperform logistic regression.

Show Python solution

# =============================================================================
# Chapter 7, Exercise 1: Cross-Validation Experiment
# Compare logistic regression and SVM (RBF kernel) using 10-fold stratified CV.
# Report AUC for both models.
# =============================================================================

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# --- Simulate the clinical dataset ---
np.random.seed(123)
n = 600

X = pd.DataFrame({
    'age': np.random.normal(65, 10, n),
    'creatinine': np.random.lognormal(0, 0.5, n),
    'hemoglobin': np.random.normal(12, 2, n),
    'platelets': np.random.normal(250, 70, n),
    'wbc': np.random.lognormal(2, 0.4, n)
})

# Simulate ICU outcome (binary)
prob = 1 / (1 + np.exp(-(-4 + 0.03 * np.random.normal(65, 10, n) +
                          0.5 * np.random.lognormal(0, 0.5, n))))
y = np.random.binomial(1, prob)
print(f"ICU admission rate: {y.mean():.3f}")

# --- 10-fold stratified CV ---
cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

# --- Logistic Regression ---
lr_pipe = make_pipeline(StandardScaler(), LogisticRegression(max_iter=1000))
lr_scores = cross_val_score(lr_pipe, X, y, cv=cv, scoring='roc_auc')
print(f"\nLogistic Regression AUC: {lr_scores.mean():.3f} (+/- {lr_scores.std():.3f})")

# --- SVM (RBF kernel) ---
svm_pipe = make_pipeline(
    StandardScaler(),
    SVC(kernel='rbf', C=1.0, gamma=0.5, probability=True)
)
svm_scores = cross_val_score(svm_pipe, X, y, cv=cv, scoring='roc_auc')
print(f"SVM (RBF) AUC:           {svm_scores.mean():.3f} (+/- {svm_scores.std():.3f})")

# --- Comparison ---
results = pd.DataFrame({
    'Model': ['Logistic Regression', 'SVM (RBF)'],
    'Mean AUC': [lr_scores.mean(), svm_scores.mean()],
    'Std AUC': [lr_scores.std(), svm_scores.std()]
})
print("\n", results.to_string(index=False))

# --- Interpretation ---
# Both models are expected to show similar AUC because the simulated data
# has simple, roughly linear relationships. Logistic regression is well-suited
# for such data. The RBF SVM adds flexibility that is not needed here.
# With real clinical data containing non-linear effects, SVM may outperform
# logistic regression.

D.4.2 Exercise 2

Show R solution

# =============================================================================
# Chapter 7, Exercise 2: Feature Engineering Challenge
# Engineer clinically meaningful features from raw variables for a diabetes
# prediction model. Implement using tidymodels recipes.
# =============================================================================

library(tidyverse)
library(tidymodels)

# --- Simulate raw clinical data ---
set.seed(42)
n <- 500

raw_data <- tibble(
  height_cm = rnorm(n, 170, 10),
  weight_kg = rnorm(n, 80, 15),
  sbp = rnorm(n, 130, 18),
  dbp = rnorm(n, 82, 12),
  fasting_glucose = rnorm(n, 105, 25),
  hba1c = rnorm(n, 5.8, 0.8),
  age = rnorm(n, 55, 12),
  sex = sample(c("Male", "Female"), n, replace = TRUE),
  waist_circumference = rnorm(n, 95, 12),
  hip_circumference = rnorm(n, 100, 10),
  total_cholesterol = rnorm(n, 200, 40),
  hdl = rnorm(n, 50, 15),
  ldl = rnorm(n, 120, 35),
  triglycerides = rnorm(n, 150, 60),
  diabetes = factor(rbinom(n, 1, 0.3), labels = c("No", "Yes"))
)

# --- Part 1: Clinically meaningful engineered features ---
# 1. BMI = weight_kg / (height_m)^2
#    WHY: Standard obesity measure; strong risk factor for Type 2 diabetes.
#
# 2. Pulse Pressure = sbp - dbp
#    WHY: Reflects arterial stiffness; associated with cardiovascular risk
#         and metabolic syndrome.
#
# 3. Mean Arterial Pressure (MAP) = dbp + (sbp - dbp) / 3
#    WHY: Measures average perfusion pressure; linked to vascular health.
#
# 4. Waist-to-Hip Ratio (WHR) = waist_circumference / hip_circumference
#    WHY: Central adiposity is a stronger predictor of insulin resistance
#         than BMI alone.
#
# 5. Non-HDL Cholesterol = total_cholesterol - hdl
#    WHY: Captures all atherogenic lipoproteins; recommended by guidelines
#         as a secondary target in diabetes management.
#
# 6. Triglyceride-to-HDL Ratio = triglycerides / hdl
#    WHY: A proxy for insulin resistance; high TG/HDL ratio is associated
#         with increased diabetes risk.
#
# 7. LDL/HDL Ratio = ldl / hdl
#    WHY: Captures atherogenic dyslipidemia profile common in diabetes.

# --- Part 2: Implement feature engineering with recipes ---
diabetes_recipe <- recipe(diabetes ~ ., data = raw_data) %>%
  # BMI: weight / height_m^2
  step_mutate(
    height_m = height_cm / 100,
    bmi = weight_kg / height_m^2,
    # Pulse pressure
    pulse_pressure = sbp - dbp,
    # Mean arterial pressure
    map = dbp + (sbp - dbp) / 3,
    # Waist-to-hip ratio
    whr = waist_circumference / hip_circumference,
    # Non-HDL cholesterol
    non_hdl = total_cholesterol - hdl,
    # Triglyceride-to-HDL ratio
    tg_hdl_ratio = triglycerides / hdl,
    # LDL/HDL ratio
    ldl_hdl_ratio = ldl / hdl
  ) %>%
  # Remove intermediate and redundant variables
  step_rm(height_m) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_dummy(all_nominal_predictors())

# Prepare (bake) the recipe to see the result
prepped <- prep(diabetes_recipe)
engineered_data <- bake(prepped, new_data = NULL)

cat("Original variables:", ncol(raw_data) - 1, "\n")
cat("Engineered dataset columns:", ncol(engineered_data) - 1, "\n")
cat("\nColumn names:\n")
print(names(engineered_data))

# --- Part 3: Discuss redundant features ---
# After engineering:
# - height_cm and weight_kg may become redundant once BMI is computed
#   (though height or weight alone may still carry predictive signal).
# - sbp and dbp are partially captured by pulse_pressure and MAP,
#   though keeping them may still be useful for tree-based models.
# - waist_circumference and hip_circumference are largely captured by WHR.
# - total_cholesterol is partly captured by non_hdl (since non_hdl = TC - HDL).
# - Individual lipids (hdl, ldl, triglycerides) overlap with the ratios,
#   but a LASSO or tree model can sort out which representation is most useful.
#
# In practice, include both raw and engineered features and let a
# regularised model (LASSO, elastic net) or tree-based model perform
# implicit feature selection.

Show Python solution

# =============================================================================
# Chapter 7, Exercise 2: Feature Engineering Challenge
# Engineer clinically meaningful features from raw variables for a diabetes
# prediction model. Implement using pandas and sklearn.
# =============================================================================

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler

# --- Simulate raw clinical data ---
np.random.seed(42)
n = 500

raw_data = pd.DataFrame({
    'height_cm': np.random.normal(170, 10, n),
    'weight_kg': np.random.normal(80, 15, n),
    'sbp': np.random.normal(130, 18, n),
    'dbp': np.random.normal(82, 12, n),
    'fasting_glucose': np.random.normal(105, 25, n),
    'hba1c': np.random.normal(5.8, 0.8, n),
    'age': np.random.normal(55, 12, n),
    'sex': np.random.choice(['Male', 'Female'], n),
    'waist_circumference': np.random.normal(95, 12, n),
    'hip_circumference': np.random.normal(100, 10, n),
    'total_cholesterol': np.random.normal(200, 40, n),
    'hdl': np.random.normal(50, 15, n),
    'ldl': np.random.normal(120, 35, n),
    'triglycerides': np.random.normal(150, 60, n),
})
y = np.random.binomial(1, 0.3, n)

# --- Part 1: Clinically meaningful engineered features ---
# 1. BMI = weight_kg / (height_m)^2
#    WHY: Standard obesity measure; strong risk factor for Type 2 diabetes.
#
# 2. Pulse Pressure = sbp - dbp
#    WHY: Reflects arterial stiffness; associated with cardiovascular risk
#         and metabolic syndrome.
#
# 3. Mean Arterial Pressure (MAP) = dbp + (sbp - dbp) / 3
#    WHY: Measures average perfusion pressure; linked to vascular health.
#
# 4. Waist-to-Hip Ratio (WHR) = waist / hip
#    WHY: Central adiposity is a stronger predictor of insulin resistance
#         than BMI alone.
#
# 5. Non-HDL Cholesterol = total_cholesterol - hdl
#    WHY: Captures all atherogenic lipoproteins; recommended by guidelines
#         as a secondary target in diabetes management.
#
# 6. Triglyceride-to-HDL Ratio = triglycerides / hdl
#    WHY: A proxy for insulin resistance; high TG/HDL ratio is associated
#         with increased diabetes risk.
#
# 7. LDL/HDL Ratio = ldl / hdl
#    WHY: Captures atherogenic dyslipidemia profile common in diabetes.

# --- Part 2: Implement feature engineering ---
df = raw_data.copy()

# Compute engineered features
df['bmi'] = df['weight_kg'] / (df['height_cm'] / 100) ** 2
df['pulse_pressure'] = df['sbp'] - df['dbp']
df['map'] = df['dbp'] + (df['sbp'] - df['dbp']) / 3
df['whr'] = df['waist_circumference'] / df['hip_circumference']
df['non_hdl'] = df['total_cholesterol'] - df['hdl']
df['tg_hdl_ratio'] = df['triglycerides'] / df['hdl']
df['ldl_hdl_ratio'] = df['ldl'] / df['hdl']

# Encode sex as binary
df['sex_male'] = (df['sex'] == 'Male').astype(int)
df = df.drop(columns=['sex'])

print(f"Original features: {raw_data.shape[1]}")
print(f"Engineered features: {df.shape[1]}")
print(f"\nColumn names:\n{list(df.columns)}")

# Scale numeric features
scaler = StandardScaler()
numeric_cols = df.select_dtypes(include=[np.number]).columns
df_scaled = df.copy()
df_scaled[numeric_cols] = scaler.fit_transform(df[numeric_cols])

print(f"\nFirst few rows of engineered data:")
print(df_scaled.head())

# --- Part 3: Discuss redundant features ---
# After engineering:
# - height_cm and weight_kg may become redundant once BMI is computed
#   (though they may still carry independent predictive signal).
# - sbp and dbp are partially captured by pulse_pressure and MAP,
#   though keeping them may still be useful for tree-based models.
# - waist_circumference and hip_circumference are largely captured by WHR.
# - total_cholesterol is partly captured by non_hdl (since non_hdl = TC - HDL).
# - Individual lipids (hdl, ldl, triglycerides) overlap with the ratios,
#   but a LASSO or tree model can sort out which representation is most useful.
#
# In practice, include both raw and engineered features and let a
# regularised model (LASSO, elastic net) or tree-based model perform
# implicit feature selection.

D.4.3 Exercise 3

Show R solution

# =============================================================================
# Chapter 7, Exercise 3: The Bias-Variance Tradeoff in Practice
# Fit polynomials of degree 1, 3, 5, 10, and 20 to a training set.
# Plot fitted curves, compute training/test error, identify optimal degree.
# =============================================================================

library(tidyverse)

# --- Generate data ---
set.seed(42)
x_train <- sort(runif(50, 0, 10))
y_train <- sin(x_train) + rnorm(50, 0, 0.3)
x_test <- sort(runif(200, 0, 10))
y_test <- sin(x_test) + rnorm(200, 0, 0.3)

train_df <- tibble(x = x_train, y = y_train)
test_df  <- tibble(x = x_test,  y = y_test)

# --- Fit polynomials and compute RMSE ---
degrees <- c(1, 3, 5, 10, 20)
results <- tibble(degree = integer(), train_rmse = double(), test_rmse = double())

# Also store predictions for plotting
plot_data <- tibble()

for (d in degrees) {
  # Fit polynomial of degree d
  fit <- lm(y ~ poly(x, degree = d, raw = TRUE), data = train_df)

  # Predictions on train and test
  pred_train <- predict(fit, newdata = train_df)
  pred_test  <- predict(fit, newdata = test_df)

  # Compute RMSE
  rmse_train <- sqrt(mean((y_train - pred_train)^2))
  rmse_test  <- sqrt(mean((y_test - pred_test)^2))

  results <- bind_rows(results,
                       tibble(degree = d, train_rmse = rmse_train, test_rmse = rmse_test))

  # Smooth curve for plotting
  x_grid <- seq(0, 10, length.out = 300)
  pred_grid <- predict(fit, newdata = tibble(x = x_grid))
  # Clip extreme predictions for high-degree polynomials
  pred_grid <- pmin(pmax(pred_grid, -3), 3)
  plot_data <- bind_rows(plot_data,
                         tibble(x = x_grid, y_pred = pred_grid,
                                degree = paste("Degree", d)))
}

# --- Print RMSE results ---
cat("Polynomial Regression: Training vs Test RMSE\n")
cat("=============================================\n")
print(results)

# --- Part 2: Plot fitted curves ---
p1 <- ggplot() +
  geom_point(data = train_df, aes(x, y), alpha = 0.5, size = 2) +
  geom_line(data = plot_data, aes(x, y_pred, color = degree), linewidth = 1) +
  geom_line(data = tibble(x = seq(0, 10, 0.01), y = sin(seq(0, 10, 0.01))),
            aes(x, y), linetype = "dashed", color = "black", linewidth = 0.8) +
  labs(title = "Polynomial Fits of Varying Complexity",
       subtitle = "Dashed line = true function sin(x)",
       x = "x", y = "y", color = "Polynomial") +
  theme_minimal(base_size = 14) +
  theme(legend.position = "top")

print(p1)

# --- Part 3: Plot training vs test error ---
results_long <- results %>%
  pivot_longer(cols = c(train_rmse, test_rmse),
               names_to = "set", values_to = "rmse") %>%
  mutate(set = ifelse(set == "train_rmse", "Training", "Test"))

p2 <- ggplot(results_long, aes(x = degree, y = rmse, color = set)) +
  geom_line(linewidth = 1.2) +
  geom_point(size = 3) +
  labs(title = "Training vs Test RMSE by Polynomial Degree",
       x = "Polynomial Degree", y = "RMSE", color = "Dataset") +
  theme_minimal(base_size = 14) +
  theme(legend.position = "top")

print(p2)

# --- Part 4: Interpretation ---
best_degree <- results$degree[which.min(results$test_rmse)]
cat("\nBest polynomial degree (lowest test RMSE):", best_degree, "\n")
cat("\nInterpretation (Bias-Variance Tradeoff):\n")
cat("- Degree 1 (linear): High bias -- too simple to capture the sine curve.\n")
cat("  Underfits both training and test data.\n")
cat("- Degree 3-5: Good balance. Captures the main curvature of sin(x)\n")
cat("  without fitting noise. Test error is minimized here.\n")
cat("- Degree 10-20: High variance -- the polynomial wiggles to fit\n")
cat("  training noise. Training error drops but test error increases.\n")
cat("  This is classic overfitting.\n")
cat("\nThe optimal degree (~3-5) sits at the sweet spot of the bias-variance\n")
cat("tradeoff, where the model is flexible enough to capture the true\n")
cat("pattern but not so flexible that it memorizes noise.\n")

Show Python solution

# =============================================================================
# Chapter 7, Exercise 3: The Bias-Variance Tradeoff in Practice
# Fit polynomials of degree 1, 3, 5, 10, and 20 to a training set.
# Plot fitted curves, compute training/test error, identify optimal degree.
# =============================================================================

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# --- Generate data ---
np.random.seed(42)
x_train = np.sort(np.random.uniform(0, 10, 50))
y_train = np.sin(x_train) + np.random.normal(0, 0.3, 50)
x_test = np.sort(np.random.uniform(0, 10, 200))
y_test = np.sin(x_test) + np.random.normal(0, 0.3, 200)

# --- Fit polynomials and compute RMSE ---
degrees = [1, 3, 5, 10, 20]
results = []
predictions = {}

for d in degrees:
    # Create polynomial features and fit
    poly = PolynomialFeatures(degree=d, include_bias=False)
    X_train_poly = poly.fit_transform(x_train.reshape(-1, 1))
    X_test_poly = poly.transform(x_test.reshape(-1, 1))

    model = LinearRegression()
    model.fit(X_train_poly, y_train)

    # Predictions
    pred_train = model.predict(X_train_poly)
    pred_test = model.predict(X_test_poly)

    # RMSE
    rmse_train = np.sqrt(mean_squared_error(y_train, pred_train))
    rmse_test = np.sqrt(mean_squared_error(y_test, pred_test))

    results.append({'degree': d, 'train_rmse': rmse_train, 'test_rmse': rmse_test})

    # Smooth curve for plotting
    x_grid = np.linspace(0, 10, 300).reshape(-1, 1)
    X_grid_poly = poly.transform(x_grid)
    pred_grid = model.predict(X_grid_poly)
    # Clip extreme predictions for high-degree polynomials
    pred_grid = np.clip(pred_grid, -3, 3)
    predictions[d] = (x_grid.ravel(), pred_grid)

# --- Print RMSE results ---
print("Polynomial Regression: Training vs Test RMSE")
print("=" * 50)
print(f"{'Degree':>8s}  {'Train RMSE':>12s}  {'Test RMSE':>12s}")
print("-" * 50)
for r in results:
    print(f"{r['degree']:>8d}  {r['train_rmse']:>12.4f}  {r['test_rmse']:>12.4f}")

# --- Part 2: Plot fitted curves ---
fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(x_train, y_train, alpha=0.5, s=30, label='Training data', zorder=5)
x_true = np.linspace(0, 10, 300)
ax.plot(x_true, np.sin(x_true), 'k--', linewidth=1.5, label='True function sin(x)')

colors = ['#E69F00', '#56B4E9', '#009E73', '#D55E00', '#CC79A7']
for (d, color) in zip(degrees, colors):
    x_g, pred_g = predictions[d]
    ax.plot(x_g, pred_g, linewidth=1.5, color=color, label=f'Degree {d}')

ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_title('Polynomial Fits of Varying Complexity')
ax.legend(loc='upper right', fontsize=9)
ax.set_ylim(-3, 3)
plt.tight_layout()
plt.show()

# --- Part 3: Plot training vs test error ---
degs = [r['degree'] for r in results]
train_rmses = [r['train_rmse'] for r in results]
test_rmses = [r['test_rmse'] for r in results]

fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(degs, train_rmses, 'o-', linewidth=2, label='Training RMSE', color='steelblue')
ax.plot(degs, test_rmses, 'o-', linewidth=2, label='Test RMSE', color='darkorange')
ax.set_xlabel('Polynomial Degree')
ax.set_ylabel('RMSE')
ax.set_title('Training vs Test RMSE by Polynomial Degree')
ax.legend()
plt.tight_layout()
plt.show()

# --- Part 4: Interpretation ---
best_idx = np.argmin(test_rmses)
best_degree = degs[best_idx]
print(f"\nBest polynomial degree (lowest test RMSE): {best_degree}")
print("\nInterpretation (Bias-Variance Tradeoff):")
print("- Degree 1 (linear): High bias -- too simple to capture the sine curve.")
print("  Underfits both training and test data.")
print("- Degree 3-5: Good balance. Captures the main curvature of sin(x)")
print("  without fitting noise. Test error is minimized here.")
print("- Degree 10-20: High variance -- the polynomial wiggles to fit")
print("  training noise. Training error drops but test error increases.")
print("  This is classic overfitting.")
print("\nThe optimal degree (~3-5) sits at the sweet spot of the bias-variance")
print("tradeoff, where the model is flexible enough to capture the true")
print("pattern but not so flexible that it memorizes noise.")

D.5 Chapter 08: Trees and Ensembles

D.5.1 Exercise 1

Show R solution

# =============================================================================
# Chapter 8, Exercise 1: Build and Prune a Classification Tree
# 1. Fit a full, unpruned tree and count terminal nodes.
# 2. Use cross-validation to find the optimal cp.
# 3. Prune the tree and plot it.
# 4. Identify the top 3 splitting variables.
# =============================================================================

library(tidyverse)
library(rpart)
library(rpart.plot)

# --- Simulate the readmission dataset (same as chapter) ---
set.seed(42)
n <- 1000

readmit_data <- tibble(
  age = rnorm(n, 68, 12),
  length_of_stay = rpois(n, 5) + 1,
  num_comorbidities = rpois(n, 3),
  prior_admissions = rpois(n, 1),
  discharge_hemoglobin = rnorm(n, 11, 2),
  discharge_creatinine = rlnorm(n, 0.2, 0.5),
  has_diabetes = rbinom(n, 1, 0.35),
  has_chf = rbinom(n, 1, 0.25),
  readmitted = factor(
    rbinom(n, 1, plogis(-3 + 0.02 * (rnorm(n, 68, 12) - 68) +
                          0.15 * rpois(n, 1) +
                          0.1 * rpois(n, 3) +
                          0.3 * rbinom(n, 1, 0.25) -
                          0.1 * rnorm(n, 11, 2))),
    labels = c("No", "Yes")
  )
)

cat("Readmission rate:", mean(readmit_data$readmitted == "Yes"), "\n")

# --- Part 1: Fit a full, unpruned tree ---
full_tree <- rpart(readmitted ~ ., data = readmit_data, method = "class",
                   control = rpart.control(cp = 0.001, minsplit = 2, minbucket = 1))

n_terminal_full <- sum(full_tree$frame$var == "<leaf>")
cat("\nFull tree terminal nodes:", n_terminal_full, "\n")

# --- Part 2: Cross-validation to find optimal cp ---
cat("\nCP Table:\n")
printcp(full_tree)

# Plot CV error vs cp
plotcp(full_tree)

# Find optimal cp (minimum xerror)
cp_table <- full_tree$cptable
optimal_cp <- cp_table[which.min(cp_table[, "xerror"]), "CP"]
cat("\nOptimal cp:", optimal_cp, "\n")

# Alternative: 1-SE rule (smallest tree within 1 SE of the minimum)
min_xerror <- min(cp_table[, "xerror"])
min_se <- cp_table[which.min(cp_table[, "xerror"]), "xstd"]
cp_1se <- cp_table[cp_table[, "xerror"] <= min_xerror + min_se, "CP"]
optimal_cp_1se <- max(cp_1se)  # largest cp (smallest tree) within 1 SE
cat("Optimal cp (1-SE rule):", optimal_cp_1se, "\n")

# --- Part 3: Prune and plot ---
pruned_tree <- prune(full_tree, cp = optimal_cp)
n_terminal_pruned <- sum(pruned_tree$frame$var == "<leaf>")
cat("\nPruned tree terminal nodes:", n_terminal_pruned, "\n")

rpart.plot(pruned_tree, type = 4, extra = 106, under = TRUE,
           box.palette = "RdYlGn", roundint = FALSE,
           main = "Pruned Decision Tree: 30-Day Readmission")

# --- Part 4: Top 3 splitting variables ---
cat("\nVariable Importance:\n")
vi <- sort(full_tree$variable.importance, decreasing = TRUE)
print(vi)

top3 <- names(vi)[1:min(3, length(vi))]
cat("\nTop 3 splitting variables:", paste(top3, collapse = ", "), "\n")
cat("\nClinical interpretation:\n")
cat("- These variables capture patient acuity and complexity.\n")
cat("- Discharge lab values (hemoglobin, creatinine) reflect the patient's\n")
cat("  clinical status at the time of discharge.\n")
cat("- Age, comorbidity count, and prior admissions reflect overall\n")
cat("  disease burden and frailty.\n")
cat("- These are well-established risk factors for 30-day readmission\n")
cat("  in the clinical literature.\n")

Show Python solution

# =============================================================================
# Chapter 8, Exercise 1: Build and Prune a Classification Tree
# 1. Fit trees with different max_depth values and count leaves.
# 2. Cross-validate each to find optimal max_depth.
# 3. Plot the optimal tree.
# 4. Examine feature importances.
# =============================================================================

import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import cross_val_score, StratifiedKFold
import matplotlib.pyplot as plt

# --- Simulate the readmission dataset (same as chapter) ---
np.random.seed(42)
n = 1000

df = pd.DataFrame({
    'age': np.random.normal(68, 12, n),
    'length_of_stay': np.random.poisson(5, n) + 1,
    'num_comorbidities': np.random.poisson(3, n),
    'prior_admissions': np.random.poisson(1, n),
    'discharge_hemoglobin': np.random.normal(11, 2, n),
    'discharge_creatinine': np.random.lognormal(0.2, 0.5, n),
    'has_diabetes': np.random.binomial(1, 0.35, n),
    'has_chf': np.random.binomial(1, 0.25, n)
})
y = np.random.binomial(1, 0.18, n)

feature_names = list(df.columns)
print(f"Readmission rate: {y.mean():.3f}")

# --- Part 1: Fit a full, unpruned tree ---
full_tree = DecisionTreeClassifier(random_state=42)
full_tree.fit(df, y)
n_leaves_full = full_tree.get_n_leaves()
print(f"\nFull (unpruned) tree: {n_leaves_full} terminal nodes")
print(f"Full tree depth: {full_tree.get_depth()}")

# --- Part 2: Cross-validation to find optimal max_depth ---
depths = [2, 3, 4, 5, 7, 10, 15, 20, None]
cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

print("\nCross-validation results by max_depth:")
print(f"{'max_depth':>10s}  {'Mean AUC':>10s}  {'Std AUC':>10s}")
print("-" * 35)

cv_results = []
for d in depths:
    tree = DecisionTreeClassifier(max_depth=d, random_state=42)
    scores = cross_val_score(tree, df, y, cv=cv, scoring='roc_auc')
    label = str(d) if d is not None else "None"
    cv_results.append({'max_depth': d, 'label': label,
                       'mean_auc': scores.mean(), 'std_auc': scores.std()})
    print(f"{label:>10s}  {scores.mean():>10.3f}  {scores.std():>10.3f}")

# Find the best depth
best = max(cv_results, key=lambda x: x['mean_auc'])
print(f"\nBest max_depth: {best['label']} (AUC = {best['mean_auc']:.3f})")

# --- Part 3: Fit and plot the optimal tree ---
best_depth = best['max_depth']
optimal_tree = DecisionTreeClassifier(max_depth=best_depth, random_state=42)
optimal_tree.fit(df, y)

print(f"\nOptimal tree: {optimal_tree.get_n_leaves()} terminal nodes")

fig, ax = plt.subplots(figsize=(20, 10))
plot_tree(optimal_tree, feature_names=feature_names, class_names=['No', 'Yes'],
          filled=True, rounded=True, ax=ax, fontsize=9,
          max_depth=4)  # show up to depth 4 for readability
ax.set_title(f"Pruned Decision Tree (max_depth={best['label']})")
plt.tight_layout()
plt.show()

# --- Part 4: Top 3 splitting variables ---
importances = pd.Series(optimal_tree.feature_importances_, index=feature_names)
importances = importances.sort_values(ascending=False)

print("\nFeature Importances:")
print(importances.to_string())

top3 = importances.head(3).index.tolist()
print(f"\nTop 3 splitting variables: {top3}")

# Plot feature importances
fig, ax = plt.subplots(figsize=(8, 5))
importances.sort_values(ascending=True).plot(kind='barh', ax=ax, color='#2E86AB')
ax.set_xlabel("Feature Importance (Gini)")
ax.set_title("Decision Tree Feature Importance")
plt.tight_layout()
plt.show()

print("\nClinical interpretation:")
print("- These variables capture patient acuity and complexity.")
print("- Discharge lab values reflect the patient's clinical status at discharge.")
print("- Age, comorbidity count, and prior admissions reflect disease burden.")
print("- These are well-established readmission risk factors in the literature.")

D.5.2 Exercise 2

Show R solution

# =============================================================================
# Chapter 8, Exercise 2: Random Forest vs XGBoost Tuning Challenge
# 1. Split data 80/20. 2. Tune RF and XGBoost with 5-fold CV.
# 3. Select best hyperparameters. 4. Evaluate on test set.
# 5. Create variable importance plots.
# =============================================================================

library(tidyverse)
library(tidymodels)
library(vip)

# --- Simulate the readmission dataset (same as chapter) ---
set.seed(42)
n <- 1000

readmit_data <- tibble(
  age = rnorm(n, 68, 12),
  length_of_stay = rpois(n, 5) + 1,
  num_comorbidities = rpois(n, 3),
  prior_admissions = rpois(n, 1),
  discharge_hemoglobin = rnorm(n, 11, 2),
  discharge_creatinine = rlnorm(n, 0.2, 0.5),
  has_diabetes = rbinom(n, 1, 0.35),
  has_chf = rbinom(n, 1, 0.25),
  readmitted = factor(
    rbinom(n, 1, plogis(-3 + 0.02 * (rnorm(n, 68, 12) - 68) +
                          0.15 * rpois(n, 1) +
                          0.1 * rpois(n, 3) +
                          0.3 * rbinom(n, 1, 0.25) -
                          0.1 * rnorm(n, 11, 2))),
    labels = c("No", "Yes")
  )
)

# --- Part 1: Train/test split ---
set.seed(42)
split <- initial_split(readmit_data, prop = 0.8, strata = readmitted)
train <- training(split)
test  <- testing(split)

cat("Training set size:", nrow(train), "\n")
cat("Test set size:", nrow(test), "\n")

# 5-fold CV on training set
folds <- vfold_cv(train, v = 5, strata = readmitted)

# Recipe (shared)
base_recipe <- recipe(readmitted ~ ., data = train)

# --- Part 2a: Tune Random Forest ---
rf_spec <- rand_forest(
  trees = 500,
  mtry = tune(),
  min_n = tune()
) %>%
  set_engine("ranger", importance = "impurity") %>%
  set_mode("classification")

rf_wf <- workflow() %>%
  add_model(rf_spec) %>%
  add_recipe(base_recipe)

rf_grid <- grid_regular(
  mtry(range = c(2, 7)),
  min_n(range = c(5, 30)),
  levels = 5
)

set.seed(42)
rf_tune <- tune_grid(rf_wf, resamples = folds, grid = rf_grid,
                     metrics = metric_set(roc_auc))

cat("\n--- Random Forest Tuning Results (Top 5) ---\n")
print(show_best(rf_tune, metric = "roc_auc", n = 5))

best_rf <- select_best(rf_tune, metric = "roc_auc")
cat("\nBest RF params - mtry:", best_rf$mtry, "min_n:", best_rf$min_n, "\n")

# --- Part 2b: Tune XGBoost ---
xgb_spec <- boost_tree(
  trees = 500,
  tree_depth = tune(),
  learn_rate = tune(),
  min_n = tune()
) %>%
  set_engine("xgboost") %>%
  set_mode("classification")

xgb_wf <- workflow() %>%
  add_model(xgb_spec) %>%
  add_recipe(base_recipe)

xgb_grid <- grid_regular(
  tree_depth(range = c(2, 6)),
  learn_rate(range = c(-3, -1)),   # log10 scale: 0.001 to 0.1
  min_n(range = c(5, 20)),
  levels = 4
)

set.seed(42)
xgb_tune <- tune_grid(xgb_wf, resamples = folds, grid = xgb_grid,
                      metrics = metric_set(roc_auc))

cat("\n--- XGBoost Tuning Results (Top 5) ---\n")
print(show_best(xgb_tune, metric = "roc_auc", n = 5))

best_xgb <- select_best(xgb_tune, metric = "roc_auc")
cat("\nBest XGB params - depth:", best_xgb$tree_depth,
    "learn_rate:", best_xgb$learn_rate,
    "min_n:", best_xgb$min_n, "\n")

# --- Part 3: Finalize and fit on full training set ---
final_rf_wf <- finalize_workflow(rf_wf, best_rf)
final_xgb_wf <- finalize_workflow(xgb_wf, best_xgb)

rf_final_fit <- fit(final_rf_wf, data = train)
xgb_final_fit <- fit(final_xgb_wf, data = train)

# --- Part 4: Evaluate on test set ---
rf_test_pred  <- predict(rf_final_fit, test, type = "prob") %>%
  bind_cols(test %>% select(readmitted))
xgb_test_pred <- predict(xgb_final_fit, test, type = "prob") %>%
  bind_cols(test %>% select(readmitted))

rf_auc  <- roc_auc(rf_test_pred, truth = readmitted, .pred_Yes)
xgb_auc <- roc_auc(xgb_test_pred, truth = readmitted, .pred_Yes)

cat("\n=== Test Set Performance ===\n")
cat("Random Forest AUC:", rf_auc$.estimate, "\n")
cat("XGBoost AUC:      ", xgb_auc$.estimate, "\n")

if (xgb_auc$.estimate > rf_auc$.estimate) {
  cat("\nXGBoost performs better on the test set.\n")
} else {
  cat("\nRandom Forest performs better on the test set.\n")
}

# --- Part 5: Variable importance plots ---
# Random Forest
rf_vi <- rf_final_fit %>%
  extract_fit_parsnip() %>%
  vip(num_features = 8) +
  ggtitle("Random Forest Variable Importance")
print(rf_vi)

# XGBoost
xgb_vi <- xgb_final_fit %>%
  extract_fit_parsnip() %>%
  vip(num_features = 8) +
  ggtitle("XGBoost Variable Importance")
print(xgb_vi)

cat("\nBoth models should generally agree on the most important features,\n")
cat("though rankings may differ. Continuous variables with more possible\n")
cat("split points (e.g., age, creatinine) often rank higher in tree-based\n")
cat("importance measures than binary variables (e.g., has_chf).\n")

Show Python solution

# =============================================================================
# Chapter 8, Exercise 2: Random Forest vs XGBoost Tuning Challenge
# 1. Split data 80/20. 2. Tune RF and XGBoost with 5-fold CV.
# 3. Select best hyperparameters. 4. Evaluate on test set.
# 5. Create variable importance plots.
# =============================================================================

import numpy as np
import pandas as pd
from sklearn.model_selection import (train_test_split, RandomizedSearchCV,
                                      StratifiedKFold)
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
import xgboost as xgb
import matplotlib.pyplot as plt

# --- Simulate the readmission dataset (same as chapter) ---
np.random.seed(42)
n = 1000

df = pd.DataFrame({
    'age': np.random.normal(68, 12, n),
    'length_of_stay': np.random.poisson(5, n) + 1,
    'num_comorbidities': np.random.poisson(3, n),
    'prior_admissions': np.random.poisson(1, n),
    'discharge_hemoglobin': np.random.normal(11, 2, n),
    'discharge_creatinine': np.random.lognormal(0.2, 0.5, n),
    'has_diabetes': np.random.binomial(1, 0.35, n),
    'has_chf': np.random.binomial(1, 0.25, n)
})
y = np.random.binomial(1, 0.18, n)

feature_names = list(df.columns)

# --- Part 1: Train/test split ---
X_train, X_test, y_train, y_test = train_test_split(
    df, y, test_size=0.2, stratify=y, random_state=42
)
print(f"Training set: {X_train.shape[0]} | Test set: {X_test.shape[0]}")

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# --- Part 2a: Tune Random Forest ---
rf_param_dist = {
    'n_estimators': [200, 500],
    'max_features': ['sqrt', 'log2', 3, 5, 7],
    'min_samples_leaf': [5, 10, 15, 20, 30],
    'max_depth': [5, 10, 15, 20, None]
}

rf = RandomForestClassifier(random_state=42, n_jobs=-1)

rf_search = RandomizedSearchCV(
    rf, rf_param_dist, n_iter=30, cv=cv, scoring='roc_auc',
    random_state=42, n_jobs=-1, verbose=0
)
rf_search.fit(X_train, y_train)

print(f"\n--- Random Forest Tuning ---")
print(f"Best CV AUC: {rf_search.best_score_:.3f}")
print(f"Best params: {rf_search.best_params_}")

# --- Part 2b: Tune XGBoost ---
xgb_param_dist = {
    'n_estimators': [200, 500, 1000],
    'learning_rate': [0.01, 0.05, 0.1],
    'max_depth': [2, 3, 4, 5, 6],
    'subsample': [0.7, 0.8, 0.9, 1.0],
    'colsample_bytree': [0.7, 0.8, 0.9, 1.0],
    'min_child_weight': [1, 5, 10]
}

xgb_model = xgb.XGBClassifier(
    random_state=42, use_label_encoder=False, eval_metric='logloss'
)

xgb_search = RandomizedSearchCV(
    xgb_model, xgb_param_dist, n_iter=30, cv=cv, scoring='roc_auc',
    random_state=42, n_jobs=-1, verbose=0
)
xgb_search.fit(X_train, y_train)

print(f"\n--- XGBoost Tuning ---")
print(f"Best CV AUC: {xgb_search.best_score_:.3f}")
print(f"Best params: {xgb_search.best_params_}")

# --- Part 4: Evaluate on test set ---
rf_best = rf_search.best_estimator_
xgb_best = xgb_search.best_estimator_

rf_test_probs = rf_best.predict_proba(X_test)[:, 1]
xgb_test_probs = xgb_best.predict_proba(X_test)[:, 1]

rf_test_auc = roc_auc_score(y_test, rf_test_probs)
xgb_test_auc = roc_auc_score(y_test, xgb_test_probs)

print(f"\n=== Test Set Performance ===")
print(f"Random Forest AUC: {rf_test_auc:.3f}")
print(f"XGBoost AUC:       {xgb_test_auc:.3f}")

if xgb_test_auc > rf_test_auc:
    print("\nXGBoost performs better on the test set.")
else:
    print("\nRandom Forest performs better on the test set.")

# --- Part 5: Variable importance plots ---
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Random Forest importance
rf_imp = pd.Series(rf_best.feature_importances_, index=feature_names)
rf_imp = rf_imp.sort_values(ascending=True)
rf_imp.plot(kind='barh', ax=axes[0], color='#2E86AB')
axes[0].set_xlabel("Feature Importance (Gini)")
axes[0].set_title("Random Forest Variable Importance")

# XGBoost importance
xgb_imp = pd.Series(xgb_best.feature_importances_, index=feature_names)
xgb_imp = xgb_imp.sort_values(ascending=True)
xgb_imp.plot(kind='barh', ax=axes[1], color='#D55E00')
axes[1].set_xlabel("Feature Importance (Gain)")
axes[1].set_title("XGBoost Variable Importance")

plt.tight_layout()
plt.show()

# Compare top features
print("\nRF top 3 features:", rf_imp.sort_values(ascending=False).head(3).index.tolist())
print("XGB top 3 features:", xgb_imp.sort_values(ascending=False).head(3).index.tolist())
print("\nBoth models should generally agree on the most important features,")
print("though rankings may differ due to how each algorithm uses features.")

D.5.3 Exercise 3

Show R solution

# =============================================================================
# Chapter 8, Exercise 3: Interpreting a Clinical Prediction Model
# Conceptual exercise: write answers as comments.
# =============================================================================

# =============================================================================
# Part 1: Non-technical summary for a hospital quality committee
# =============================================================================
#
# "We developed a computer-based prediction model that estimates each patient's
# risk of being readmitted to the hospital within 30 days of discharge. The
# model uses information routinely collected during hospitalisation -- such as
# lab values, age, prior admissions, and existing medical conditions -- to
# assign each patient a risk score between 0% and 100%. In cross-validated
# testing, the model correctly distinguished between patients who were and
# were not readmitted approximately [AUC]% of the time. This tool could help
# care teams focus discharge planning resources on the patients at highest
# risk, potentially reducing readmission rates and associated penalties."

# =============================================================================
# Part 2: Explaining variable importance to a clinical audience
# =============================================================================
#
# "The model identified 'number of prior admissions' and 'discharge creatinine'
# as the two variables that contribute most to the model's predictions. This
# means that knowing these values provides the most useful information for
# distinguishing patients who will be readmitted from those who will not.
#
# WHAT THIS MEANS:
# - Patients with more prior admissions tend to have higher predicted risk.
# - Patients with elevated discharge creatinine (indicating impaired kidney
#   function) also tend to have higher predicted risk.
# - These findings align with clinical intuition: patients with a history of
#   recurrent hospitalisations and those with renal impairment are known to
#   be at elevated risk.
#
# WHAT THIS DOES NOT MEAN:
# - Variable importance does NOT imply causation. We cannot say that reducing
#   creatinine at discharge will reduce readmission risk. The model identifies
#   associations, not causes.
# - A variable with high importance may be a proxy for something else. For
#   example, 'prior admissions' may reflect underlying disease severity,
#   social determinants, or healthcare access patterns rather than being
#   a direct cause of readmission.
# - We should not intervene on these variables based on importance alone.
#   Clinical trials or causal inference methods would be needed to establish
#   whether modifying these factors actually changes outcomes."

# =============================================================================
# Part 3: Validation strategy before clinical deployment
# =============================================================================
#
# PROPOSED VALIDATION PLAN:
#
# 1. Temporal validation: Test the model on data from a time period AFTER the
#    training data (e.g., train on 2020-2022, validate on 2023-2024). This
#    tests whether the model's performance holds over time, as clinical
#    practices and patient populations may shift.
#
# 2. External validation: Apply the model to data from a different hospital
#    system. A model developed at one institution may not generalise to
#    another due to differences in patient demographics, coding practices,
#    discharge protocols, and local disease patterns.
#
# 3. Subgroup analysis: Evaluate performance across key demographic groups
#    (age, sex, race/ethnicity, insurance status). A model that performs
#    well overall but poorly for specific populations could worsen existing
#    health disparities.
#
# 4. Calibration assessment: Verify that predicted probabilities match
#    observed readmission rates. A model that says "30% risk" should be
#    right about 30% of the time across all risk levels.
#
# 5. Prospective pilot: Before full deployment, run the model in parallel
#    alongside current practice (silent mode) to monitor performance in
#    real-time without affecting clinical decisions.
#
# RISKS OF SKIPPING EXTERNAL VALIDATION:
# - The model may be overfit to idiosyncrasies of the development data
#   (specific EMR system, local coding conventions, patient mix).
# - Performance reported from internal validation (even cross-validation)
#   tends to be optimistically biased.
# - Deploying an unvalidated model could misallocate resources, either
#   missing high-risk patients (false negatives) or overwhelming care
#   teams with false alarms (false positives).
# - Regulatory and ethical risks: deploying a model without adequate
#   validation may violate institutional policies and could cause patient
#   harm.

cat("This exercise is conceptual. See the comments in this file for the\n")
cat("complete answers to all three parts.\n")

Show Python solution

# =============================================================================
# Chapter 8, Exercise 3: Interpreting a Clinical Prediction Model
# Conceptual exercise: write answers as comments.
# =============================================================================

# =============================================================================
# Part 1: Non-technical summary for a hospital quality committee
# =============================================================================
#
# "We developed a computer-based prediction model that estimates each patient's
# risk of being readmitted to the hospital within 30 days of discharge. The
# model uses information routinely collected during hospitalisation -- such as
# lab values, age, prior admissions, and existing medical conditions -- to
# assign each patient a risk score between 0% and 100%. In cross-validated
# testing, the model correctly distinguished between patients who were and
# were not readmitted approximately [AUC]% of the time. This tool could help
# care teams focus discharge planning resources on the patients at highest
# risk, potentially reducing readmission rates and associated penalties."

# =============================================================================
# Part 2: Explaining variable importance to a clinical audience
# =============================================================================
#
# "The model identified 'number of prior admissions' and 'discharge creatinine'
# as the two variables that contribute most to the model's predictions. This
# means that knowing these values provides the most useful information for
# distinguishing patients who will be readmitted from those who will not.
#
# WHAT THIS MEANS:
# - Patients with more prior admissions tend to have higher predicted risk.
# - Patients with elevated discharge creatinine (indicating impaired kidney
#   function) also tend to have higher predicted risk.
# - These findings align with clinical intuition: patients with a history of
#   recurrent hospitalisations and those with renal impairment are known to
#   be at elevated risk.
#
# WHAT THIS DOES NOT MEAN:
# - Variable importance does NOT imply causation. We cannot say that reducing
#   creatinine at discharge will reduce readmission risk. The model identifies
#   associations, not causes.
# - A variable with high importance may be a proxy for something else. For
#   example, 'prior admissions' may reflect underlying disease severity,
#   social determinants, or healthcare access patterns rather than being
#   a direct cause of readmission.
# - We should not intervene on these variables based on importance alone.
#   Clinical trials or causal inference methods would be needed to establish
#   whether modifying these factors actually changes outcomes."

# =============================================================================
# Part 3: Validation strategy before clinical deployment
# =============================================================================
#
# PROPOSED VALIDATION PLAN:
#
# 1. Temporal validation: Test the model on data from a time period AFTER the
#    training data (e.g., train on 2020-2022, validate on 2023-2024). This
#    tests whether the model's performance holds over time.
#
# 2. External validation: Apply the model to data from a different hospital
#    system. A model developed at one institution may not generalise to
#    another due to differences in patient demographics, coding practices,
#    discharge protocols, and local disease patterns.
#
# 3. Subgroup analysis: Evaluate performance across key demographic groups
#    (age, sex, race/ethnicity, insurance status). A model that performs
#    well overall but poorly for specific populations could worsen existing
#    health disparities.
#
# 4. Calibration assessment: Verify that predicted probabilities match
#    observed readmission rates. A model that says "30% risk" should be
#    right about 30% of the time across all risk levels.
#
# 5. Prospective pilot: Before full deployment, run the model in parallel
#    alongside current practice (silent mode) to monitor performance in
#    real-time without affecting clinical decisions.
#
# RISKS OF SKIPPING EXTERNAL VALIDATION:
# - The model may be overfit to idiosyncrasies of the development data.
# - Performance from internal validation tends to be optimistically biased.
# - Deploying an unvalidated model could misallocate resources, either
#   missing high-risk patients or overwhelming care teams with false alarms.
# - Regulatory and ethical risks: deploying without adequate validation may
#   violate institutional policies and could cause patient harm.

print("This exercise is conceptual. See the comments in this file for the")
print("complete answers to all three parts.")

D.6 Chapter 08b: Introduction to Deep Learning

D.6.1 Exercise 1

Show R solution

# =============================================================================
# Chapter 8b, Exercise 1: Neural Network vs XGBoost on Tabular Data
# Fit both a neural network and XGBoost on the readmission dataset.
# Compare 5-fold cross-validated AUC.
# =============================================================================

library(tidyverse)
library(tidymodels)
library(keras3)

# --- Simulate readmission data (same as chapter) ---
set.seed(42)
n <- 1000

readmit_data <- tibble(
  age = rnorm(n, 68, 12),
  length_of_stay = rpois(n, 5) + 1,
  num_comorbidities = rpois(n, 3),
  prior_admissions = rpois(n, 1),
  discharge_hgb = rnorm(n, 11, 2),
  discharge_creatinine = rlnorm(n, 0.2, 0.5),
  has_diabetes = rbinom(n, 1, 0.35),
  has_chf = rbinom(n, 1, 0.25)
)

readmit_prob <- plogis(-3 + 0.02 * (readmit_data$age - 68) +
                         0.15 * readmit_data$prior_admissions +
                         0.1 * readmit_data$num_comorbidities +
                         0.3 * readmit_data$has_chf -
                         0.1 * readmit_data$discharge_hgb)
readmit_data$readmitted <- factor(rbinom(n, 1, readmit_prob),
                                   labels = c("No", "Yes"))

cat("Readmission rate:", mean(readmit_data$readmitted == "Yes"), "\n")

# --- XGBoost with 5-fold CV (using tidymodels) ---
set.seed(42)
folds <- vfold_cv(readmit_data, v = 5, strata = readmitted)

xgb_spec <- boost_tree(trees = 500, tree_depth = 4, learn_rate = 0.05,
                        min_n = 10) %>%
  set_engine("xgboost") %>%
  set_mode("classification")

xgb_wf <- workflow() %>%
  add_model(xgb_spec) %>%
  add_recipe(recipe(readmitted ~ ., data = readmit_data))

xgb_res <- fit_resamples(xgb_wf, resamples = folds,
                         metrics = metric_set(roc_auc))

xgb_metrics <- collect_metrics(xgb_res)
cat("\nXGBoost CV AUC:", xgb_metrics$mean, "+/-", xgb_metrics$std_err, "\n")

# --- Neural Network with 5-fold CV (manual loop) ---
x_all <- readmit_data %>% select(-readmitted) %>% as.matrix()
y_all <- as.numeric(readmit_data$readmitted == "Yes")

# Standardize features
x_mean <- apply(x_all, 2, mean)
x_sd   <- apply(x_all, 2, sd)
x_scaled <- scale(x_all, center = x_mean, scale = x_sd)

set.seed(42)
fold_ids <- vfold_cv(readmit_data, v = 5, strata = readmitted)
nn_aucs <- numeric(5)

for (i in seq_len(5)) {
  # Get train/validation indices
  train_idx <- fold_ids$splits[[i]] %>% analysis() %>% rownames() %>% as.integer()
  val_idx   <- fold_ids$splits[[i]] %>% assessment() %>% rownames() %>% as.integer()

  x_train <- x_scaled[train_idx, ]
  y_train <- y_all[train_idx]
  x_val   <- x_scaled[val_idx, ]
  y_val   <- y_all[val_idx]

  # Build neural network
  model <- keras_model_sequential(input_shape = ncol(x_train)) %>%
    layer_dense(units = 32, activation = "relu") %>%
    layer_dropout(rate = 0.3) %>%
    layer_dense(units = 16, activation = "relu") %>%
    layer_dropout(rate = 0.3) %>%
    layer_dense(units = 1, activation = "sigmoid")

  model %>% compile(
    optimizer = optimizer_adam(learning_rate = 0.001),
    loss = "binary_crossentropy",
    metrics = "AUC"
  )

  # Train with early stopping
  history <- model %>% fit(
    x_train, y_train,
    epochs = 50,
    batch_size = 32,
    validation_data = list(x_val, y_val),
    callbacks = list(
      callback_early_stopping(patience = 5, restore_best_weights = TRUE)
    ),
    verbose = 0
  )

  # Evaluate
  results <- model %>% evaluate(x_val, y_val, verbose = 0)
  nn_aucs[i] <- results[[2]]  # AUC metric
  cat(sprintf("  Fold %d: NN AUC = %.3f\n", i, nn_aucs[i]))
}

cat("\nNeural Network CV AUC:", mean(nn_aucs), "+/-", sd(nn_aucs) / sqrt(5), "\n")

# --- Comparison ---
cat("\n=== Comparison ===\n")
cat("XGBoost CV AUC:        ", round(xgb_metrics$mean, 3), "\n")
cat("Neural Network CV AUC: ", round(mean(nn_aucs), 3), "\n")

cat("\nInterpretation:\n")
cat("XGBoost typically matches or outperforms neural networks on tabular\n")
cat("clinical data. This is expected -- the Grinsztajn et al. (2022)\n")
cat("NeurIPS benchmark showed that tree-based models consistently\n")
cat("outperform neural networks on typical tabular datasets. Deep learning\n")
cat("excels on images, text, and sequences, not spreadsheets.\n")

Show Python solution

# =============================================================================
# Chapter 8b, Exercise 1: Neural Network vs XGBoost on Tabular Data
# Fit both a neural network and XGBoost on the readmission dataset.
# Compare 5-fold cross-validated AUC.
# =============================================================================

import numpy as np
import pandas as pd
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score
import xgboost as xgb
import tensorflow as tf
from tensorflow.keras import layers, models, callbacks

# --- Simulate readmission data (same as chapter) ---
np.random.seed(42)
n = 1000

age = np.random.normal(68, 12, n)
length_of_stay = np.random.poisson(5, n) + 1
num_comorbidities = np.random.poisson(3, n)
prior_admissions = np.random.poisson(1, n)
discharge_hgb = np.random.normal(11, 2, n)
discharge_creatinine = np.random.lognormal(0.2, 0.5, n)
has_diabetes = np.random.binomial(1, 0.35, n)
has_chf = np.random.binomial(1, 0.25, n)

X = np.column_stack([age, length_of_stay, num_comorbidities,
                     prior_admissions, discharge_hgb,
                     discharge_creatinine, has_diabetes, has_chf])

# Generate outcome with known logistic relationship
prob = 1 / (1 + np.exp(-(-3 + 0.02 * (age - 68) +
                          0.15 * prior_admissions +
                          0.1 * num_comorbidities +
                          0.3 * has_chf -
                          0.1 * discharge_hgb)))
y = np.random.binomial(1, prob)
print(f"Readmission rate: {y.mean():.3f}")

# --- XGBoost with 5-fold CV ---
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

xgb_model = xgb.XGBClassifier(
    n_estimators=500, learning_rate=0.05, max_depth=4,
    subsample=0.8, colsample_bytree=0.8,
    random_state=42, use_label_encoder=False, eval_metric='logloss'
)

xgb_scores = cross_val_score(xgb_model, X, y, cv=cv, scoring='roc_auc')
print(f"\nXGBoost CV AUC: {xgb_scores.mean():.3f} (+/- {xgb_scores.std():.3f})")

# --- Neural Network with 5-fold CV (manual loop) ---
nn_aucs = []
fold_idx = 0

for train_idx, val_idx in cv.split(X, y):
    fold_idx += 1

    # Split and scale
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X[train_idx])
    X_val = scaler.transform(X[val_idx])
    y_train = y[train_idx]
    y_val = y[val_idx]

    # Build neural network (same architecture as chapter)
    tf.random.set_seed(42)
    model = models.Sequential([
        layers.Dense(32, activation="relu", input_shape=(X_train.shape[1],)),
        layers.Dropout(0.3),
        layers.Dense(16, activation="relu"),
        layers.Dropout(0.3),
        layers.Dense(1, activation="sigmoid")
    ])

    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
        loss="binary_crossentropy",
        metrics=[tf.keras.metrics.AUC(name="auc")]
    )

    # Train with early stopping
    history = model.fit(
        X_train, y_train,
        epochs=50,
        batch_size=32,
        validation_data=(X_val, y_val),
        callbacks=[
            callbacks.EarlyStopping(patience=5, restore_best_weights=True)
        ],
        verbose=0
    )

    # Evaluate
    y_pred_prob = model.predict(X_val, verbose=0).ravel()
    fold_auc = roc_auc_score(y_val, y_pred_prob)
    nn_aucs.append(fold_auc)
    print(f"  Fold {fold_idx}: NN AUC = {fold_auc:.3f}")

nn_mean_auc = np.mean(nn_aucs)
nn_std_auc = np.std(nn_aucs)
print(f"\nNeural Network CV AUC: {nn_mean_auc:.3f} (+/- {nn_std_auc:.3f})")

# --- Comparison ---
print(f"\n=== Comparison ===")
print(f"XGBoost CV AUC:        {xgb_scores.mean():.3f}")
print(f"Neural Network CV AUC: {nn_mean_auc:.3f}")

print("\nInterpretation:")
print("XGBoost typically matches or outperforms neural networks on tabular")
print("clinical data. This is expected -- the Grinsztajn et al. (2022)")
print("NeurIPS benchmark showed that tree-based models consistently")
print("outperform neural networks on typical tabular datasets. Deep learning")
print("excels on images, text, and sequences, not spreadsheets.")

D.6.2 Exercise 2

Show R solution

# =============================================================================
# Chapter 8b, Exercise 2: Architecture Matching
# For each clinical task, identify the best DL architecture, whether DL
# is likely to outperform gradient-boosted trees, and the reporting guideline.
# =============================================================================

# =============================================================================
# Task 1: Predicting 30-day mortality from 15 structured EHR variables
# =============================================================================
#
# (a) Architecture: Feedforward neural network (multi-layer perceptron) or,
#     better yet, gradient-boosted trees (XGBoost/LightGBM). With only 15
#     structured variables, a simple architecture suffices.
#
# (b) DL likely to outperform GBT? NO. This is classic tabular data with a
#     modest number of features. The Grinsztajn et al. (2022) NeurIPS
#     benchmark and subsequent clinical benchmarks consistently show that
#     tree-based models match or outperform neural networks on tabular data.
#     Logistic regression or XGBoost is the appropriate starting point.
#
# (c) Reporting guideline: TRIPOD+AI (Collins et al., BMJ 2024). This is a
#     standard clinical prediction model using structured data.

# =============================================================================
# Task 2: Classifying skin lesions as benign/malignant from dermoscopy images
# =============================================================================
#
# (a) Architecture: Convolutional Neural Network (CNN), specifically a
#     pretrained model like ResNet, EfficientNet, or Inception fine-tuned
#     on the dermoscopy images. A domain-specific foundation model could
#     also be used if available.
#
# (b) DL likely to outperform GBT? YES. Image data contains spatial
#     structure (edges, textures, shapes) that CNNs are specifically designed
#     to exploit. GBTs cannot process raw images and would require manual
#     feature extraction, which is inferior to learned CNN features. Esteva
#     et al. (Nature 2017) demonstrated dermatologist-level performance.
#
# (c) Reporting guideline: CLAIM (Checklist for AI in Medical Imaging,
#     Mongan et al., Radiology: AI 2020), supplemented by TRIPOD+AI.

# =============================================================================
# Task 3: Detecting atrial fibrillation from 12-lead ECG tracings
# =============================================================================
#
# (a) Architecture: Transformer-based model or temporal CNN. The chapter
#     notes that transformers are the current preferred architecture for ECG
#     data. Medformer (NeurIPS 2024) is specifically designed for medical
#     time series classification.
#
# (b) DL likely to outperform GBT? YES. ECG data is sequential with complex
#     temporal patterns. GBTs would require extensive manual feature
#     engineering (interval measurements, morphology features), whereas DL
#     can learn directly from the raw waveform. Hannun et al. (Nature
#     Medicine 2019) demonstrated cardiologist-level arrhythmia detection.
#
# (c) Reporting guideline: TRIPOD+AI for the prediction model, potentially
#     supplemented by CLAIM if imaging is involved (e.g., ECG images rather
#     than signal data).

# =============================================================================
# Task 4: Extracting medication names from unstructured discharge summaries
# =============================================================================
#
# (a) Architecture: Transformer-based language model, such as ClinicalBERT
#     or PubMedBERT for named entity recognition (NER). These pretrained
#     models understand medical vocabulary and can be fine-tuned for NER.
#
# (b) DL likely to outperform GBT? YES, decisively. This is a natural
#     language processing task. GBTs cannot process raw text meaningfully.
#     Transformer-based models understand context, synonyms, and medical
#     abbreviations. Rule-based and dictionary approaches are alternatives,
#     but modern NER with transformers is superior.
#
# (c) Reporting guideline: TRIPOD+AI. No specific imaging guideline applies.
#     MINIMAR (Hernandez-Boussard et al., JAMIA 2020) may also be relevant
#     as a minimum reporting standard.

# =============================================================================
# Task 5: Predicting length of stay from structured EHR + chest X-ray
# =============================================================================
#
# (a) Architecture: Multimodal fusion model. A CNN (e.g., pretrained
#     ResNet) processes the chest X-ray to extract image features. These
#     are concatenated with the structured EHR features and fed into a
#     combined prediction head (either a feedforward network or GBT on
#     the fused features).
#
# (b) DL likely to outperform GBT? PARTIALLY. The DL component is necessary
#     for the image. For the structured data alone, GBTs may be equal or
#     better. The optimal approach may be a hybrid: use a CNN to extract
#     image features, then combine those features with structured data in
#     a GBT. Recent work on multimodal fusion suggests this hybrid approach
#     can outperform either modality alone.
#
# (c) Reporting guideline: TRIPOD+AI for the overall prediction model,
#     supplemented by CLAIM for the imaging component. Both should be
#     addressed since the model involves medical imaging.

cat("This exercise is conceptual. See the comments in this file for the\n")
cat("complete answers to all five clinical tasks.\n")

Show Python solution

# =============================================================================
# Chapter 8b, Exercise 2: Architecture Matching
# For each clinical task, identify the best DL architecture, whether DL
# is likely to outperform gradient-boosted trees, and the reporting guideline.
# =============================================================================

# =============================================================================
# Task 1: Predicting 30-day mortality from 15 structured EHR variables
# =============================================================================
#
# (a) Architecture: Feedforward neural network (MLP) or, better yet,
#     gradient-boosted trees (XGBoost/LightGBM). With only 15 structured
#     variables, a simple architecture suffices.
#
# (b) DL likely to outperform GBT? NO. This is classic tabular data with a
#     modest number of features. The Grinsztajn et al. (2022) NeurIPS
#     benchmark and subsequent clinical benchmarks consistently show that
#     tree-based models match or outperform neural networks on tabular data.
#
# (c) Reporting guideline: TRIPOD+AI (Collins et al., BMJ 2024). Standard
#     clinical prediction model using structured data.

# =============================================================================
# Task 2: Classifying skin lesions as benign/malignant from dermoscopy images
# =============================================================================
#
# (a) Architecture: CNN (e.g., pretrained ResNet, EfficientNet, or Inception
#     fine-tuned on dermoscopy images). A domain-specific foundation model
#     could also be used if available.
#
# (b) DL likely to outperform GBT? YES. Image data contains spatial
#     structure that CNNs exploit. GBTs cannot process raw images. Esteva
#     et al. (Nature 2017) demonstrated dermatologist-level performance.
#
# (c) Reporting guideline: CLAIM (Mongan et al., Radiology: AI 2020),
#     supplemented by TRIPOD+AI.

# =============================================================================
# Task 3: Detecting atrial fibrillation from 12-lead ECG tracings
# =============================================================================
#
# (a) Architecture: Transformer-based model or temporal CNN. Transformers
#     are the current preferred architecture for ECG data. Medformer
#     (NeurIPS 2024) is designed for medical time series classification.
#
# (b) DL likely to outperform GBT? YES. ECG data is sequential with complex
#     temporal patterns. DL can learn directly from raw waveforms, whereas
#     GBTs require extensive manual feature engineering.
#
# (c) Reporting guideline: TRIPOD+AI, potentially supplemented by CLAIM if
#     ECG images rather than signal data are used.

# =============================================================================
# Task 4: Extracting medication names from unstructured discharge summaries
# =============================================================================
#
# (a) Architecture: Transformer-based language model (ClinicalBERT or
#     PubMedBERT) for named entity recognition (NER). These models
#     understand medical vocabulary and context.
#
# (b) DL likely to outperform GBT? YES, decisively. This is an NLP task.
#     GBTs cannot process raw text meaningfully. Transformer-based NER
#     models understand context, synonyms, and medical abbreviations.
#
# (c) Reporting guideline: TRIPOD+AI. MINIMAR (Hernandez-Boussard et al.,
#     JAMIA 2020) also applicable as minimum reporting standard.

# =============================================================================
# Task 5: Predicting length of stay from structured EHR + chest X-ray
# =============================================================================
#
# (a) Architecture: Multimodal fusion model. A CNN (pretrained ResNet)
#     extracts image features from the chest X-ray. These are concatenated
#     with structured EHR features and fed into a combined prediction head.
#
# (b) DL likely to outperform GBT? PARTIALLY. DL is necessary for the image
#     component. For structured data alone, GBTs may be equal or better.
#     A hybrid approach (CNN for image features, then GBT on fused features)
#     may be optimal.
#
# (c) Reporting guideline: TRIPOD+AI for the prediction model, supplemented
#     by CLAIM for the imaging component.

print("This exercise is conceptual. See the comments in this file for the")
print("complete answers to all five clinical tasks.")

D.6.3 Exercise 3

Show R solution

# =============================================================================
# Chapter 8b, Exercise 3: Critical Appraisal of a Deep Learning Study
# Evaluate a DL paper against the CLAIM or TRIPOD+AI checklist.
# This is a conceptual/guided exercise -- the template below provides
# the framework for appraising any DL study.
# =============================================================================

# =============================================================================
# INSTRUCTIONS:
# Find a recent (2024 or later) paper applying deep learning to a clinical
# task in your area of interest. Use this template to evaluate it.
# =============================================================================

# =============================================================================
# Question 1: Was the model externally validated?
# =============================================================================
#
# Look for:
# - Was the model tested on data from a DIFFERENT institution, time period,
#   or geographic region than the training data?
# - If yes, how did external performance compare to internal (e.g., was there
#   a drop in AUC)?
#
# Example answer:
# "The model was validated on data from Hospital B after training on Hospital A.
#  Internal AUC was 0.92; external AUC dropped to 0.84 -- a 0.08 decrease.
#  This is consistent with the systematic review finding that 81% of DL models
#  show decreased accuracy on external datasets."
#
# Red flag: If NO external validation was performed, the results should be
# treated as preliminary. Internal CV alone is insufficient for clinical claims.

# =============================================================================
# Question 2: Were subgroup analyses reported?
# =============================================================================
#
# Look for:
# - Performance broken down by age, sex, race/ethnicity
# - Any mention of fairness or equity analysis
# - Performance in clinically important subgroups (e.g., patients with
#   comorbidities, different disease severity)
#
# Example answer:
# "The paper reported AUC by sex (male: 0.90, female: 0.87) but did not
#  report performance by race/ethnicity or age group. Given the known issue
#  of fairness non-transferability across sites (Nature Medicine 2024),
#  this is a significant omission."

# =============================================================================
# Question 3: Was the model compared to a simpler baseline?
# =============================================================================
#
# Look for:
# - Comparison against logistic regression, random forest, or XGBoost
# - If the DL model only marginally outperforms the baseline, the added
#   complexity may not be justified
#
# Example answer:
# "The paper compared DL (AUC 0.91) to logistic regression (AUC 0.85) and
#  random forest (AUC 0.88). The improvement over RF is modest (0.03),
#  raising questions about whether the DL model's added complexity and
#  reduced interpretability are justified."

# =============================================================================
# Question 4: Were training data, code, and model weights shared?
# =============================================================================
#
# Look for:
# - Public dataset or data sharing agreement
# - Code repository (GitHub, GitLab)
# - Pretrained model weights available for download
# - FAIR data principles
#
# Example answer:
# "Code was shared on GitHub. The training data is from a private hospital
#  system and is not publicly available, though the authors describe a
#  data sharing agreement. Model weights were not released."

# =============================================================================
# Question 5: How close is this model to clinical deployment?
# =============================================================================
#
# Consider:
# - Has it been externally validated across multiple sites?
# - Has it been tested prospectively (not just retrospectively)?
# - Has a clinical workflow been designed for how it would be used?
# - Has a regulatory pathway been identified (e.g., FDA 510(k), EU MDR)?
# - Has a monitoring plan for post-deployment performance been described?
# - What are the potential failure modes and harms?
#
# Example answer:
# "This model is at an early research stage. While the internal results are
#  promising, the model has been validated at only one external site with a
#  modest sample. Before clinical deployment, I would want to see:
#  (1) multi-site external validation,
#  (2) a prospective pilot study in clinical workflow,
#  (3) subgroup analysis across demographics,
#  (4) calibration assessment,
#  (5) a monitoring plan for detecting model drift over time."

cat("This exercise is a guided template for critical appraisal.\n")
cat("See the comments for the framework to evaluate any DL paper.\n")
cat("Students should find their own paper and fill in the answers.\n")

Show Python solution

# =============================================================================
# Chapter 8b, Exercise 3: Critical Appraisal of a Deep Learning Study
# Evaluate a DL paper against the CLAIM or TRIPOD+AI checklist.
# This is a conceptual/guided exercise -- the template below provides
# the framework for appraising any DL study.
# =============================================================================

# =============================================================================
# INSTRUCTIONS:
# Find a recent (2024 or later) paper applying deep learning to a clinical
# task in your area of interest. Use this template to evaluate it.
# =============================================================================

# =============================================================================
# Question 1: Was the model externally validated?
# =============================================================================
#
# Look for:
# - Was the model tested on data from a DIFFERENT institution, time period,
#   or geographic region than the training data?
# - If yes, how did external performance compare to internal (e.g., was there
#   a drop in AUC)?
#
# Example answer:
# "The model was validated on data from Hospital B after training on Hospital A.
#  Internal AUC was 0.92; external AUC dropped to 0.84 -- a 0.08 decrease.
#  This is consistent with the systematic review finding that 81% of DL models
#  show decreased accuracy on external datasets."
#
# Red flag: If NO external validation was performed, the results should be
# treated as preliminary.

# =============================================================================
# Question 2: Were subgroup analyses reported?
# =============================================================================
#
# Look for:
# - Performance broken down by age, sex, race/ethnicity
# - Any mention of fairness or equity analysis
# - Performance in clinically important subgroups
#
# Example answer:
# "The paper reported AUC by sex (male: 0.90, female: 0.87) but did not
#  report performance by race/ethnicity or age group. Given the known issue
#  of fairness non-transferability across sites, this is a significant
#  omission."

# =============================================================================
# Question 3: Was the model compared to a simpler baseline?
# =============================================================================
#
# Look for:
# - Comparison against logistic regression, random forest, or XGBoost
# - If the DL model only marginally outperforms the baseline, the added
#   complexity may not be justified
#
# Example answer:
# "The paper compared DL (AUC 0.91) to logistic regression (AUC 0.85) and
#  random forest (AUC 0.88). The improvement over RF is modest (0.03)."

# =============================================================================
# Question 4: Were training data, code, and model weights shared?
# =============================================================================
#
# Look for:
# - Public dataset or data sharing agreement
# - Code repository (GitHub, GitLab)
# - Pretrained model weights available
#
# Example answer:
# "Code was shared on GitHub. Training data is from a private hospital
#  system. Model weights were not released."

# =============================================================================
# Question 5: How close is this model to clinical deployment?
# =============================================================================
#
# Consider:
# - Has it been externally validated across multiple sites?
# - Has it been tested prospectively?
# - Has a clinical workflow been designed?
# - Has a regulatory pathway been identified?
# - Has a monitoring plan been described?
#
# Example answer:
# "This model is at an early research stage. Before clinical deployment,
#  I would want to see:
#  (1) multi-site external validation,
#  (2) a prospective pilot study,
#  (3) subgroup analysis across demographics,
#  (4) calibration assessment,
#  (5) a monitoring plan for model drift."

print("This exercise is a guided template for critical appraisal.")
print("See the comments for the framework to evaluate any DL paper.")
print("Students should find their own paper and fill in the answers.")

D.7 Chapter 09: Model Evaluation

D.7.1 Exercise 1

Show R solution

# =============================================================================
# Chapter 9, Exercise 1: Bayes' Theorem in Practice (PPV Calculations)
# Calculate PPV at different prevalence levels and plot the relationship.
# =============================================================================

library(tidyverse)

# --- PPV function using Bayes' theorem ---
# PPV = (Sensitivity * Prevalence) /
#       (Sensitivity * Prevalence + (1 - Specificity) * (1 - Prevalence))
calculate_ppv <- function(sensitivity, specificity, prevalence) {
  numerator <- sensitivity * prevalence
  denominator <- numerator + (1 - specificity) * (1 - prevalence)
  return(numerator / denominator)
}

# --- Calculate NPV ---
calculate_npv <- function(sensitivity, specificity, prevalence) {
  numerator <- specificity * (1 - prevalence)
  denominator <- numerator + (1 - sensitivity) * prevalence
  return(numerator / denominator)
}

# --- Example: HIV rapid test (sensitivity = 99.7%, specificity = 99.5%) ---

# In a general population (prevalence ~ 0.4%)
ppv_general <- calculate_ppv(0.997, 0.995, 0.004)
npv_general <- calculate_npv(0.997, 0.995, 0.004)
cat("=== HIV Rapid Test (Sens=99.7%, Spec=99.5%) ===\n\n")
cat("General population (prevalence 0.4%):\n")
cat("  PPV:", round(ppv_general * 100, 1), "%\n")
cat("  NPV:", round(npv_general * 100, 1), "%\n")

# In an STI clinic population (prevalence ~ 5%)
ppv_clinic <- calculate_ppv(0.997, 0.995, 0.05)
npv_clinic <- calculate_npv(0.997, 0.995, 0.05)
cat("\nSTI clinic (prevalence 5%):\n")
cat("  PPV:", round(ppv_clinic * 100, 1), "%\n")
cat("  NPV:", round(npv_clinic * 100, 1), "%\n")

# In a population with known exposure (prevalence ~ 30%)
ppv_exposed <- calculate_ppv(0.997, 0.995, 0.30)
npv_exposed <- calculate_npv(0.997, 0.995, 0.30)
cat("\nKnown exposure (prevalence 30%):\n")
cat("  PPV:", round(ppv_exposed * 100, 1), "%\n")
cat("  NPV:", round(npv_exposed * 100, 1), "%\n")

# --- Plot PPV across a range of prevalences ---
prevalences <- seq(0.001, 0.5, by = 0.001)
ppvs <- sapply(prevalences, function(p) calculate_ppv(0.997, 0.995, p))
npvs <- sapply(prevalences, function(p) calculate_npv(0.997, 0.995, p))

plot_df <- tibble(
  prevalence = rep(prevalences, 2),
  value = c(ppvs, npvs),
  metric = rep(c("PPV", "NPV"), each = length(prevalences))
)

ggplot(plot_df, aes(x = prevalence * 100, y = value * 100, color = metric)) +
  geom_line(linewidth = 1.5) +
  geom_hline(yintercept = 50, linetype = "dashed", color = "grey50") +
  scale_color_manual(values = c("PPV" = "steelblue", "NPV" = "darkorange")) +
  labs(x = "Prevalence (%)",
       y = "Predictive Value (%)",
       title = "PPV and NPV Depend Heavily on Prevalence",
       subtitle = "HIV rapid test: Sensitivity=99.7%, Specificity=99.5%",
       color = "Metric") +
  theme_minimal(base_size = 14) +
  theme(legend.position = "top")

# --- Additional: Compare tests with different sensitivity/specificity ---
cat("\n\n=== Comparing Tests at 1% Prevalence ===\n")

tests <- tibble(
  Test = c("High Sens/Low Spec", "Balanced", "Low Sens/High Spec"),
  Sensitivity = c(0.99, 0.95, 0.80),
  Specificity = c(0.90, 0.95, 0.99)
)

for (i in seq_len(nrow(tests))) {
  ppv <- calculate_ppv(tests$Sensitivity[i], tests$Specificity[i], 0.01)
  npv <- calculate_npv(tests$Sensitivity[i], tests$Specificity[i], 0.01)
  cat(sprintf("%s (Sens=%.0f%%, Spec=%.0f%%): PPV=%.1f%%, NPV=%.1f%%\n",
              tests$Test[i], tests$Sensitivity[i]*100, tests$Specificity[i]*100,
              ppv*100, npv*100))
}

cat("\nKey takeaway: Even excellent tests have low PPV when prevalence is low.\n")
cat("This is why screening should target high-risk populations.\n")

Show Python solution

# =============================================================================
# Chapter 9, Exercise 1: Bayes' Theorem in Practice (PPV Calculations)
# Calculate PPV at different prevalence levels and plot the relationship.
# =============================================================================

import numpy as np
import matplotlib.pyplot as plt


# --- PPV function using Bayes' theorem ---
def calculate_ppv(sensitivity, specificity, prevalence):
    """PPV = (Sens * Prev) / (Sens * Prev + (1-Spec) * (1-Prev))"""
    numerator = sensitivity * prevalence
    denominator = numerator + (1 - specificity) * (1 - prevalence)
    return numerator / denominator


def calculate_npv(sensitivity, specificity, prevalence):
    """NPV = (Spec * (1-Prev)) / (Spec * (1-Prev) + (1-Sens) * Prev)"""
    numerator = specificity * (1 - prevalence)
    denominator = numerator + (1 - sensitivity) * prevalence
    return numerator / denominator


# --- Example: HIV rapid test (sensitivity = 99.7%, specificity = 99.5%) ---
print("=== HIV Rapid Test (Sens=99.7%, Spec=99.5%) ===\n")

# General population (prevalence ~ 0.4%)
ppv_general = calculate_ppv(0.997, 0.995, 0.004)
npv_general = calculate_npv(0.997, 0.995, 0.004)
print(f"General population (prevalence 0.4%):")
print(f"  PPV: {ppv_general*100:.1f}%")
print(f"  NPV: {npv_general*100:.1f}%")

# STI clinic (prevalence ~ 5%)
ppv_clinic = calculate_ppv(0.997, 0.995, 0.05)
npv_clinic = calculate_npv(0.997, 0.995, 0.05)
print(f"\nSTI clinic (prevalence 5%):")
print(f"  PPV: {ppv_clinic*100:.1f}%")
print(f"  NPV: {npv_clinic*100:.1f}%")

# Known exposure (prevalence ~ 30%)
ppv_exposed = calculate_ppv(0.997, 0.995, 0.30)
npv_exposed = calculate_npv(0.997, 0.995, 0.30)
print(f"\nKnown exposure (prevalence 30%):")
print(f"  PPV: {ppv_exposed*100:.1f}%")
print(f"  NPV: {npv_exposed*100:.1f}%")

# --- Plot PPV and NPV across a range of prevalences ---
prevalences = np.linspace(0.001, 0.5, 500)
ppvs = [calculate_ppv(0.997, 0.995, p) for p in prevalences]
npvs = [calculate_npv(0.997, 0.995, p) for p in prevalences]

fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(prevalences * 100, np.array(ppvs) * 100, lw=2, color="steelblue",
        label="PPV")
ax.plot(prevalences * 100, np.array(npvs) * 100, lw=2, color="darkorange",
        label="NPV")
ax.axhline(y=50, linestyle="--", color="grey", alpha=0.7)
ax.set_xlabel("Prevalence (%)")
ax.set_ylabel("Predictive Value (%)")
ax.set_title("PPV and NPV Depend Heavily on Prevalence\n"
             "HIV rapid test: Sens=99.7%, Spec=99.5%")
ax.legend()
plt.tight_layout()
plt.show()

# --- Additional: Compare tests with different sensitivity/specificity ---
print("\n=== Comparing Tests at 1% Prevalence ===")

tests = [
    ("High Sens/Low Spec", 0.99, 0.90),
    ("Balanced",           0.95, 0.95),
    ("Low Sens/High Spec", 0.80, 0.99),
]

for name, sens, spec in tests:
    ppv = calculate_ppv(sens, spec, 0.01)
    npv = calculate_npv(sens, spec, 0.01)
    print(f"{name} (Sens={sens*100:.0f}%, Spec={spec*100:.0f}%): "
          f"PPV={ppv*100:.1f}%, NPV={npv*100:.1f}%")

print("\nKey takeaway: Even excellent tests have low PPV when prevalence is low.")
print("This is why screening should target high-risk populations.")

D.7.2 Exercise 2

Show R solution

# =============================================================================
# Chapter 9, Exercise 2: Comparing ROC and Precision-Recall Curves
# Simulate an imbalanced dataset (2% prevalence), plot ROC and PR curves,
# and discuss why PR curves are more informative for rare outcomes.
# =============================================================================

library(tidyverse)
library(pROC)
library(PRROC)

# --- Simulate an imbalanced dataset (2% prevalence) ---
set.seed(123)
n <- 5000
true_outcome <- rbinom(n, 1, 0.02)
# Simulate a moderately good model
pred_prob <- plogis(rnorm(n, mean = -2 + 3 * true_outcome, sd = 1.5))

cat("Number of observations:", n, "\n")
cat("Number of events:", sum(true_outcome), "\n")
cat("Prevalence:", mean(true_outcome), "\n")

# --- ROC Curve ---
roc_obj <- roc(true_outcome, pred_prob, quiet = TRUE)
auroc <- auc(roc_obj)

par(mfrow = c(1, 2))

# Plot ROC
plot(roc_obj,
     main = paste("ROC Curve\nAUROC =", round(auroc, 3)),
     col = "steelblue", lwd = 2,
     legacy.axes = TRUE)
abline(0, 1, lty = 2, col = "grey50")

# --- Precision-Recall Curve ---
pr_obj <- pr.curve(
  scores.class0 = pred_prob[true_outcome == 1],
  scores.class1 = pred_prob[true_outcome == 0],
  curve = TRUE
)
auprc <- pr_obj$auc.integral

# Plot PR curve
plot(pr_obj,
     main = paste("Precision-Recall Curve\nAUPRC =", round(auprc, 3)),
     color = "darkorange", lwd = 2)
abline(h = mean(true_outcome), lty = 2, col = "grey50")

par(mfrow = c(1, 1))

# --- Report metrics ---
cat("\n=== Summary ===\n")
cat("AUROC:", round(auroc, 3), "\n")
cat("AUPRC:", round(auprc, 3), "\n")
cat("Baseline AUPRC (prevalence):", round(mean(true_outcome), 3), "\n")

# --- Find optimal threshold (Youden's J) ---
coords_best <- coords(roc_obj, "best", ret = c("threshold", "sensitivity", "specificity"))
cat("\nOptimal threshold (Youden's J):", round(coords_best$threshold, 3), "\n")
cat("Sensitivity:", round(coords_best$sensitivity, 3), "\n")
cat("Specificity:", round(coords_best$specificity, 3), "\n")

# PPV at this threshold
pred_class <- ifelse(pred_prob >= coords_best$threshold, 1, 0)
tp <- sum(pred_class == 1 & true_outcome == 1)
fp <- sum(pred_class == 1 & true_outcome == 0)
ppv_at_optimal <- tp / (tp + fp)
cat("PPV at optimal threshold:", round(ppv_at_optimal, 3), "\n")

# --- Interpretation ---
cat("\n=== Interpretation ===\n")
cat("The AUROC looks excellent (", round(auroc, 3), "), suggesting the model\n")
cat("discriminates well. However, the AUPRC (", round(auprc, 3), ") reveals\n")
cat("the real challenge: achieving high recall while maintaining reasonable\n")
cat("precision is difficult with a 2% prevalence rate.\n\n")
cat("At the Youden-optimal threshold, the PPV is only", round(ppv_at_optimal * 100, 1), "%.\n")
cat("This means that even at the 'best' threshold, most positive predictions\n")
cat("are false positives.\n\n")
cat("KEY LESSON: For rare outcomes, always examine the PR curve alongside\n")
cat("the ROC curve. AUROC can paint an overly optimistic picture because\n")
cat("specificity is calculated over the large majority class.\n")

Show Python solution

# =============================================================================
# Chapter 9, Exercise 2: Comparing ROC and Precision-Recall Curves
# Simulate an imbalanced dataset (2% prevalence), plot ROC and PR curves,
# and discuss why PR curves are more informative for rare outcomes.
# =============================================================================

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import (roc_curve, roc_auc_score,
                              precision_recall_curve, average_precision_score)
from scipy.special import expit

# --- Simulate an imbalanced dataset (2% prevalence) ---
np.random.seed(123)
n = 5000
true_outcome = np.random.binomial(1, 0.02, n)
# Simulate a moderately good model
pred_prob = expit(np.random.normal(-2 + 3 * true_outcome, 1.5))

print(f"Number of observations: {n}")
print(f"Number of events: {true_outcome.sum()}")
print(f"Prevalence: {true_outcome.mean():.3f}")

# --- ROC Curve ---
fpr, tpr, roc_thresholds = roc_curve(true_outcome, pred_prob)
auroc = roc_auc_score(true_outcome, pred_prob)

# --- Precision-Recall Curve ---
precision, recall, pr_thresholds = precision_recall_curve(true_outcome, pred_prob)
auprc = average_precision_score(true_outcome, pred_prob)
prevalence = true_outcome.mean()

# --- Plot side by side ---
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# ROC curve
axes[0].plot(fpr, tpr, color="steelblue", lw=2)
axes[0].plot([0, 1], [0, 1], "--", color="grey")
axes[0].set_title(f"ROC Curve (AUROC = {auroc:.3f})")
axes[0].set_xlabel("False Positive Rate (1 - Specificity)")
axes[0].set_ylabel("True Positive Rate (Sensitivity)")

# PR curve
axes[1].plot(recall, precision, color="darkorange", lw=2)
axes[1].axhline(y=prevalence, linestyle="--", color="grey",
                label=f"Baseline (prevalence={prevalence:.3f})")
axes[1].set_title(f"Precision-Recall Curve (AUPRC = {auprc:.3f})")
axes[1].set_xlabel("Recall (Sensitivity)")
axes[1].set_ylabel("Precision (PPV)")
axes[1].legend()

plt.tight_layout()
plt.show()

# --- Report metrics ---
print(f"\n=== Summary ===")
print(f"AUROC: {auroc:.3f}")
print(f"AUPRC: {auprc:.3f}")
print(f"Baseline AUPRC (prevalence): {prevalence:.3f}")

# --- Find optimal threshold (Youden's J) ---
j_scores = tpr - fpr
best_idx = np.argmax(j_scores)
optimal_threshold = roc_thresholds[best_idx]
print(f"\nOptimal threshold (Youden's J): {optimal_threshold:.3f}")
print(f"Sensitivity: {tpr[best_idx]:.3f}")
print(f"Specificity: {1 - fpr[best_idx]:.3f}")

# PPV at this threshold
pred_class = (pred_prob >= optimal_threshold).astype(int)
tp = ((pred_class == 1) & (true_outcome == 1)).sum()
fp = ((pred_class == 1) & (true_outcome == 0)).sum()
ppv_at_optimal = tp / (tp + fp) if (tp + fp) > 0 else 0
print(f"PPV at optimal threshold: {ppv_at_optimal:.3f}")

# --- Interpretation ---
print(f"\n=== Interpretation ===")
print(f"The AUROC looks excellent ({auroc:.3f}), suggesting the model")
print(f"discriminates well. However, the AUPRC ({auprc:.3f}) reveals")
print(f"the real challenge: achieving high recall while maintaining")
print(f"reasonable precision is difficult with a 2% prevalence rate.")
print(f"\nAt the Youden-optimal threshold, the PPV is only {ppv_at_optimal*100:.1f}%.")
print(f"This means that even at the 'best' threshold, most positive")
print(f"predictions are false positives.")
print(f"\nKEY LESSON: For rare outcomes, always examine the PR curve")
print(f"alongside the ROC curve. AUROC can paint an overly optimistic")
print(f"picture because specificity is calculated over the large")
print(f"majority class.")

D.7.3 Exercise 3

Show R solution

# =============================================================================
# Chapter 9, Exercise 3: Evaluating Fairness
# Simulate data with two demographic groups, evaluate whether the model
# performs equitably across groups.
# =============================================================================

library(tidyverse)
library(pROC)

# --- Simulate data with two demographic groups ---
set.seed(42)
n <- 2000

group <- sample(c("A", "B"), n, replace = TRUE)

# Group B has slightly different disease prevalence and predictor distribution
true_outcome <- ifelse(group == "A",
                       rbinom(n, 1, 0.10),
                       rbinom(n, 1, 0.15))

# Model performs slightly worse for Group B (weaker signal)
pred_prob <- ifelse(group == "A",
                    plogis(rnorm(n, -1.5 + 2.5 * true_outcome, 1.0)),
                    plogis(rnorm(n, -1.5 + 1.8 * true_outcome, 1.2)))

df <- tibble(group = group, true_outcome = true_outcome, pred_prob = pred_prob)

# --- 1. Calculate AUC by group ---
cat("=== Discrimination (AUC) by Group ===\n")
for (g in c("A", "B")) {
  idx <- df$group == g
  roc_g <- roc(df$true_outcome[idx], df$pred_prob[idx], quiet = TRUE)
  ci_g <- ci.auc(roc_g)
  cat(sprintf("Group %s: AUC = %.3f (95%% CI: %.3f - %.3f), Prevalence = %.1f%%\n",
              g, auc(roc_g), ci_g[1], ci_g[3],
              mean(df$true_outcome[idx]) * 100))
}

# --- 2. Sensitivity and specificity at various thresholds ---
cat("\n=== Performance at Different Thresholds ===\n")
thresholds <- c(0.15, 0.20, 0.30, 0.40, 0.50)

for (threshold in thresholds) {
  cat(sprintf("\nThreshold = %.2f:\n", threshold))
  for (g in c("A", "B")) {
    idx <- df$group == g
    pred_class <- ifelse(df$pred_prob[idx] >= threshold, 1, 0)
    actual <- df$true_outcome[idx]

    tp <- sum(pred_class == 1 & actual == 1)
    fn <- sum(pred_class == 0 & actual == 1)
    tn <- sum(pred_class == 0 & actual == 0)
    fp <- sum(pred_class == 1 & actual == 0)

    sens <- ifelse(tp + fn > 0, tp / (tp + fn), NA)
    spec <- ifelse(tn + fp > 0, tn / (tn + fp), NA)
    ppv  <- ifelse(tp + fp > 0, tp / (tp + fp), NA)

    cat(sprintf("  Group %s: Sens=%.3f  Spec=%.3f  PPV=%.3f  (TP=%d FP=%d FN=%d TN=%d)\n",
                g, sens, spec, ppv, tp, fp, fn, tn))
  }
}

# --- 3. Positive prediction rate (demographic parity) ---
cat("\n=== Positive Prediction Rate (Demographic Parity) ===\n")
threshold <- 0.30
for (g in c("A", "B")) {
  idx <- df$group == g
  pred_class <- ifelse(df$pred_prob[idx] >= threshold, 1, 0)
  pos_rate <- mean(pred_class)
  cat(sprintf("Group %s: %.1f%% predicted positive at threshold %.2f\n",
              g, pos_rate * 100, threshold))
}

# --- 4. ROC curves overlaid ---
roc_a <- roc(df$true_outcome[df$group == "A"],
             df$pred_prob[df$group == "A"], quiet = TRUE)
roc_b <- roc(df$true_outcome[df$group == "B"],
             df$pred_prob[df$group == "B"], quiet = TRUE)

plot(roc_a, col = "steelblue", lwd = 2, legacy.axes = TRUE,
     main = "ROC Curves by Demographic Group")
plot(roc_b, col = "darkorange", lwd = 2, add = TRUE)
legend("bottomright",
       legend = c(paste("Group A (AUC =", round(auc(roc_a), 3), ")"),
                  paste("Group B (AUC =", round(auc(roc_b), 3), ")")),
       col = c("steelblue", "darkorange"), lwd = 2)

# --- Interpretation ---
cat("\n=== Interpretation ===\n")
cat("1. The model shows different AUC values across groups, indicating\n")
cat("   unequal discrimination. Group B (higher prevalence, noisier data)\n")
cat("   has lower AUC than Group A.\n\n")
cat("2. At any fixed threshold, sensitivity and specificity differ between\n")
cat("   groups. This means a single threshold does not provide equitable\n")
cat("   performance. Group-specific thresholds could equalize sensitivity\n")
cat("   but would raise questions about fairness.\n\n")
cat("3. Different positive prediction rates reflect both different prevalence\n")
cat("   and different model performance. Demographic parity (equal positive\n")
cat("   rates) may conflict with calibration if true prevalence differs.\n\n")
cat("4. CLINICAL IMPLICATION: Before deploying a model, always evaluate\n")
cat("   performance across demographic subgroups. A model that works well\n")
cat("   'on average' may perform poorly for specific populations, potentially\n")
cat("   widening health disparities. Report subgroup-specific metrics.\n")

Show Python solution

# =============================================================================
# Chapter 9, Exercise 3: Evaluating Fairness
# Simulate data with two demographic groups, evaluate whether the model
# performs equitably across groups.
# =============================================================================

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import roc_auc_score, roc_curve
from scipy.special import expit

# --- Simulate data with two demographic groups ---
np.random.seed(42)
n = 2000

group = np.random.choice(["A", "B"], n)

# Group B has slightly different disease prevalence
true_outcome = np.where(group == "A",
                        np.random.binomial(1, 0.10, n),
                        np.random.binomial(1, 0.15, n))

# Model performs slightly worse for Group B (weaker signal, more noise)
pred_prob = np.where(group == "A",
                     expit(np.random.normal(-1.5 + 2.5 * true_outcome, 1.0)),
                     expit(np.random.normal(-1.5 + 1.8 * true_outcome, 1.2)))

# --- 1. Calculate AUC by group ---
print("=== Discrimination (AUC) by Group ===")
for g in ["A", "B"]:
    mask = group == g
    auc = roc_auc_score(true_outcome[mask], pred_prob[mask])
    prev = true_outcome[mask].mean()
    print(f"Group {g}: AUC = {auc:.3f}, Prevalence = {prev*100:.1f}%")

# --- 2. Sensitivity and specificity at various thresholds ---
print("\n=== Performance at Different Thresholds ===")
thresholds = [0.15, 0.20, 0.30, 0.40, 0.50]

for threshold in thresholds:
    print(f"\nThreshold = {threshold:.2f}:")
    for g in ["A", "B"]:
        mask = group == g
        pred_class = (pred_prob[mask] >= threshold).astype(int)
        actual = true_outcome[mask]

        tp = ((pred_class == 1) & (actual == 1)).sum()
        fn = ((pred_class == 0) & (actual == 1)).sum()
        tn = ((pred_class == 0) & (actual == 0)).sum()
        fp = ((pred_class == 1) & (actual == 0)).sum()

        sens = tp / (tp + fn) if (tp + fn) > 0 else 0
        spec = tn / (tn + fp) if (tn + fp) > 0 else 0
        ppv = tp / (tp + fp) if (tp + fp) > 0 else 0

        print(f"  Group {g}: Sens={sens:.3f}  Spec={spec:.3f}  "
              f"PPV={ppv:.3f}  (TP={tp} FP={fp} FN={fn} TN={tn})")

# --- 3. Positive prediction rate (demographic parity) ---
print("\n=== Positive Prediction Rate (Demographic Parity) ===")
threshold = 0.30
for g in ["A", "B"]:
    mask = group == g
    pred_class = (pred_prob[mask] >= threshold).astype(int)
    pos_rate = pred_class.mean()
    print(f"Group {g}: {pos_rate*100:.1f}% predicted positive "
          f"at threshold {threshold:.2f}")

# --- 4. ROC curves overlaid ---
fig, ax = plt.subplots(figsize=(7, 7))

for g, color, label_prefix in [("A", "steelblue", "Group A"),
                                ("B", "darkorange", "Group B")]:
    mask = group == g
    fpr, tpr, _ = roc_curve(true_outcome[mask], pred_prob[mask])
    auc = roc_auc_score(true_outcome[mask], pred_prob[mask])
    ax.plot(fpr, tpr, color=color, lw=2,
            label=f"{label_prefix} (AUC = {auc:.3f})")

ax.plot([0, 1], [0, 1], "--", color="grey")
ax.set_xlabel("False Positive Rate (1 - Specificity)")
ax.set_ylabel("True Positive Rate (Sensitivity)")
ax.set_title("ROC Curves by Demographic Group")
ax.legend(loc="lower right")
plt.tight_layout()
plt.show()

# --- Interpretation ---
print("\n=== Interpretation ===")
print("1. The model shows different AUC values across groups, indicating")
print("   unequal discrimination. Group B (higher prevalence, noisier data)")
print("   has lower AUC than Group A.")
print("\n2. At any fixed threshold, sensitivity and specificity differ between")
print("   groups. A single threshold does not provide equitable performance.")
print("   Group-specific thresholds could equalize sensitivity but raise")
print("   questions about fairness.")
print("\n3. Different positive prediction rates reflect both different prevalence")
print("   and different model performance. Demographic parity (equal positive")
print("   rates) may conflict with calibration if true prevalence differs.")
print("\n4. CLINICAL IMPLICATION: Before deploying a model, always evaluate")
print("   performance across demographic subgroups. A model that works well")
print("   'on average' may perform poorly for specific populations, potentially")
print("   widening health disparities. Report subgroup-specific metrics.")

D.8 Chapter 11: Performance and Validation

D.8.1 Exercise 1

Show R solution

# Chapter 11, Exercise 1: Calibration Assessment
# Using the stroke mortality model from the chapter

library(rms)

# ---- Simulate the stroke data (same as chapter) ----
set.seed(2024)
n <- 1500

stroke_data <- data.frame(
  age = round(rnorm(n, 72, 12)),
  nihss = round(pmax(0, rnorm(n, 8, 6))),
  glucose = round(rnorm(n, 140, 50)),
  afib = rbinom(n, 1, 0.25),
  thrombolysis = rbinom(n, 1, 0.30)
)

lp <- -5 + 0.04 * stroke_data$age +
  0.12 * stroke_data$nihss +
  0.003 * stroke_data$glucose +
  0.3 * stroke_data$afib -
  0.5 * stroke_data$thrombolysis

stroke_data$death_30d <- rbinom(n, 1, plogis(lp))

# Fit the prediction model
dd <- datadist(stroke_data)
options(datadist = "dd")

fit <- lrm(death_30d ~ age + nihss + glucose + afib + thrombolysis,
           data = stroke_data, x = TRUE, y = TRUE)

pred_prob <- predict(fit, type = "fitted")

# ---- (a) Calibration plot using deciles of predicted risk ----
cat("=== Part (a): Calibration Plot ===\n")
val.prob(pred_prob, stroke_data$death_30d, m = 150, cex = 0.5,
         main = "Calibration Plot (Deciles): 30-Day Stroke Mortality")
# The model appears well-calibrated since the points cluster near the diagonal.
# This is expected because we are evaluating apparent performance on the
# training data.

# ---- (b) O:E ratio ----
cat("\n=== Part (b): O:E Ratio ===\n")
obs_rate <- mean(stroke_data$death_30d)
mean_pred <- mean(pred_prob)
oe_ratio <- obs_rate / mean_pred

cat("Observed event rate:", round(obs_rate, 3), "\n")
cat("Mean predicted probability:", round(mean_pred, 3), "\n")
cat("O:E ratio:", round(oe_ratio, 3), "\n")
# An O:E ratio near 1.0 indicates good calibration-in-the-large.
# The model's average predictions match the observed event rate.

# ---- (c) Calibration slope ----
cat("\n=== Part (c): Calibration Slope ===\n")
cal_model <- glm(stroke_data$death_30d ~ qlogis(pred_prob), family = binomial)
cal_slope <- coef(cal_model)[2]
cal_intercept <- coef(cal_model)[1]

cat("Calibration slope:", round(cal_slope, 3), "\n")
cat("Calibration intercept:", round(cal_intercept, 3), "\n")
# A calibration slope of 1.0 indicates no overfitting.
# The apparent slope is typically close to 1 on the training data.
# Values < 1 on new data would suggest overfitting (predictions too extreme).

# ---- (d) Bootstrap optimism correction ----
cat("\n=== Part (d): Bootstrap Optimism Correction ===\n")
set.seed(42)
val <- validate(fit, B = 200)

cat("Bootstrap Validation Results:\n")
print(val)

# Extract optimism-corrected C-statistic
dxy_corrected <- val["Dxy", "index.corrected"]
c_apparent <- fit$stats["C"]
c_corrected <- (dxy_corrected + 1) / 2

cat("\nApparent C-statistic:", round(c_apparent, 4), "\n")
cat("Optimism-corrected C-statistic:", round(c_corrected, 4), "\n")
cat("C-statistic decrease:", round(c_apparent - c_corrected, 4), "\n")

# Calibration slope from bootstrap
slope_corrected <- val["Slope", "index.corrected"]
cat("\nApparent calibration slope: 1.000\n")
cat("Optimism-corrected calibration slope:", round(slope_corrected, 4), "\n")
cat("Slope decrease:", round(1 - slope_corrected, 4), "\n")

# The C-statistic decreases slightly after optimism correction, reflecting
# the mild optimism in apparent performance. The calibration slope also
# decreases below 1, indicating some overfitting that inflates apparent
# performance.

Show Python solution

# Chapter 11, Exercise 1: Calibration Assessment
# Using the stroke mortality model from the chapter

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, brier_score_loss
from sklearn.calibration import calibration_curve
from scipy.special import expit, logit

# ---- Simulate the stroke data (same as chapter) ----
np.random.seed(2024)
n = 1500

stroke_data = pd.DataFrame({
    "age": np.round(np.random.normal(72, 12, n)),
    "nihss": np.round(np.maximum(0, np.random.normal(8, 6, n))),
    "glucose": np.round(np.random.normal(140, 50, n)),
    "afib": np.random.binomial(1, 0.25, n),
    "thrombolysis": np.random.binomial(1, 0.30, n)
})

lp = (-5 + 0.04 * stroke_data["age"] +
      0.12 * stroke_data["nihss"] +
      0.003 * stroke_data["glucose"] +
      0.3 * stroke_data["afib"] -
      0.5 * stroke_data["thrombolysis"])

stroke_data["death_30d"] = np.random.binomial(1, expit(lp))

predictors = ["age", "nihss", "glucose", "afib", "thrombolysis"]
X = stroke_data[predictors].values
y = stroke_data["death_30d"].values

model = LogisticRegression(max_iter=5000, random_state=42)
model.fit(X, y)
pred_prob = model.predict_proba(X)[:, 1]

# ---- (a) Calibration plot using deciles ----
print("=== Part (a): Calibration Plot ===")
prob_true, prob_pred = calibration_curve(y, pred_prob, n_bins=10, strategy="quantile")

fig, ax = plt.subplots(figsize=(7, 6))
ax.plot(prob_pred, prob_true, "o-", color="steelblue", lw=2, markersize=8, label="Model")
ax.plot([0, 1], [0, 1], "--", color="grey", label="Perfect calibration")
ax.set_xlabel("Mean Predicted Probability")
ax.set_ylabel("Observed Proportion")
ax.set_title("Calibration Plot (Deciles): 30-Day Stroke Mortality")
ax.legend()
plt.tight_layout()
plt.savefig("ch11_ex1_calibration.png", dpi=150)
plt.show()
# The points cluster near the diagonal, suggesting good apparent calibration.

# ---- (b) O:E ratio ----
print("\n=== Part (b): O:E Ratio ===")
obs_rate = y.mean()
mean_pred = pred_prob.mean()
oe_ratio = obs_rate / mean_pred

print(f"Observed event rate: {obs_rate:.3f}")
print(f"Mean predicted probability: {mean_pred:.3f}")
print(f"O:E ratio: {oe_ratio:.3f}")
# An O:E ratio near 1.0 indicates good calibration-in-the-large.

# ---- (c) Calibration slope ----
print("\n=== Part (c): Calibration Slope ===")
lp_pred = logit(np.clip(pred_prob, 1e-8, 1 - 1e-8))
cal_model = LogisticRegression(max_iter=5000, penalty=None)
cal_model.fit(lp_pred.reshape(-1, 1), y)

print(f"Calibration slope: {cal_model.coef_[0][0]:.3f}")
print(f"Calibration intercept: {cal_model.intercept_[0]:.3f}")
# A slope near 1.0 on training data is expected. Values < 1 on new data
# indicate overfitting.

# ---- (d) Bootstrap optimism correction ----
print("\n=== Part (d): Bootstrap Optimism Correction ===")
np.random.seed(42)

# Apparent performance
apparent_auc = roc_auc_score(y, pred_prob)
apparent_brier = brier_score_loss(y, pred_prob)

n_boot = 200
optimism_auc = []
optimism_brier = []

for b in range(n_boot):
    boot_idx = np.random.choice(n, n, replace=True)
    X_boot, y_boot = X[boot_idx], y[boot_idx]

    boot_model = LogisticRegression(max_iter=5000, random_state=42)
    boot_model.fit(X_boot, y_boot)

    # Apparent on bootstrap sample
    pred_boot = boot_model.predict_proba(X_boot)[:, 1]
    auc_boot = roc_auc_score(y_boot, pred_boot)

    # Test on original data
    pred_orig = boot_model.predict_proba(X)[:, 1]
    auc_orig = roc_auc_score(y, pred_orig)

    optimism_auc.append(auc_boot - auc_orig)

mean_opt_auc = np.mean(optimism_auc)
corrected_auc = apparent_auc - mean_opt_auc

print(f"Apparent C-statistic:          {apparent_auc:.4f}")
print(f"Mean optimism (AUC):           {mean_opt_auc:.4f}")
print(f"Optimism-corrected C-statistic:{corrected_auc:.4f}")
print(f"C-statistic decrease:          {mean_opt_auc:.4f}")

# The C-statistic decreases slightly after correction, reflecting mild
# optimism in the apparent performance. With 1500 observations and 5
# predictors, overfitting is modest.

D.8.2 Exercise 2

Show R solution

# Chapter 11, Exercise 2: External Validation Simulation
# Create three external validation populations and assess model performance

library(rms)
library(pROC)

# ---- Simulate development data and fit model (same as chapter) ----
set.seed(2024)
n <- 1500

stroke_data <- data.frame(
  age = round(rnorm(n, 72, 12)),
  nihss = round(pmax(0, rnorm(n, 8, 6))),
  glucose = round(rnorm(n, 140, 50)),
  afib = rbinom(n, 1, 0.25),
  thrombolysis = rbinom(n, 1, 0.30)
)

lp <- -5 + 0.04 * stroke_data$age +
  0.12 * stroke_data$nihss +
  0.003 * stroke_data$glucose +
  0.3 * stroke_data$afib -
  0.5 * stroke_data$thrombolysis

stroke_data$death_30d <- rbinom(n, 1, plogis(lp))

dd <- datadist(stroke_data)
options(datadist = "dd")

fit <- lrm(death_30d ~ age + nihss + glucose + afib + thrombolysis,
           data = stroke_data, x = TRUE, y = TRUE)

# ---- Helper function: evaluate model performance ----
evaluate_ext <- function(ext_data, fit, label) {
  ext_data$pred <- predict(fit, newdata = ext_data, type = "fitted")
  obs_rate <- mean(ext_data$death_30d)
  mean_pred <- mean(ext_data$pred)
  oe <- obs_rate / mean_pred

  # C-statistic
  roc_obj <- roc(ext_data$death_30d, ext_data$pred, quiet = TRUE)
  c_stat <- auc(roc_obj)


  # Calibration slope
  lp_ext <- qlogis(ext_data$pred)
  cal_model <- glm(death_30d ~ lp_ext, data = ext_data, family = binomial)
  cal_slope <- coef(cal_model)[2]
  cal_int <- coef(cal_model)[1]

  cat(sprintf("\n=== %s ===\n", label))
  cat(sprintf("  N = %d, Observed mortality: %.3f\n", nrow(ext_data), obs_rate))
  cat(sprintf("  Mean predicted: %.3f\n", mean_pred))
  cat(sprintf("  C-statistic: %.3f\n", c_stat))
  cat(sprintf("  O:E ratio: %.3f\n", oe))
  cat(sprintf("  Calibration slope: %.3f\n", cal_slope))
  cat(sprintf("  Calibration intercept: %.3f\n", cal_int))

  return(ext_data)
}

# ---- Population A: Temporal validation (3 years later) ----
# Same demographics, slightly different practice patterns
set.seed(101)
n_a <- 800
pop_a <- data.frame(
  age = round(rnorm(n_a, 73, 12)),        # Similar age
  nihss = round(pmax(0, rnorm(n_a, 8, 6))),
  glucose = round(rnorm(n_a, 138, 48)),
  afib = rbinom(n_a, 1, 0.27),
  thrombolysis = rbinom(n_a, 1, 0.40)     # More thrombolysis over time
)

lp_a <- -5 + 0.04 * pop_a$age + 0.12 * pop_a$nihss +
  0.003 * pop_a$glucose + 0.3 * pop_a$afib - 0.5 * pop_a$thrombolysis
pop_a$death_30d <- rbinom(n_a, 1, plogis(lp_a))

pop_a <- evaluate_ext(pop_a, fit, "Population A: Temporal (3 years later)")

# ---- Population B: Geographical (younger patients) ----
set.seed(202)
n_b <- 600
pop_b <- data.frame(
  age = round(rnorm(n_b, 62, 10)),        # Younger
  nihss = round(pmax(0, rnorm(n_b, 6, 5))),  # Lower severity
  glucose = round(rnorm(n_b, 130, 45)),
  afib = rbinom(n_b, 1, 0.15),            # Less AF
  thrombolysis = rbinom(n_b, 1, 0.35)
)

lp_b <- -5 + 0.04 * pop_b$age + 0.12 * pop_b$nihss +
  0.003 * pop_b$glucose + 0.3 * pop_b$afib - 0.5 * pop_b$thrombolysis
pop_b$death_30d <- rbinom(n_b, 1, plogis(lp_b))

pop_b <- evaluate_ext(pop_b, fit, "Population B: Geographical (younger patients)")

# ---- Population C: Domain (primary care, lower severity) ----
set.seed(303)
n_c <- 500
pop_c <- data.frame(
  age = round(rnorm(n_c, 68, 14)),
  nihss = round(pmax(0, rnorm(n_c, 3, 3))),  # Much lower severity
  glucose = round(rnorm(n_c, 120, 35)),
  afib = rbinom(n_c, 1, 0.20),
  thrombolysis = rbinom(n_c, 1, 0.10)        # Rarely used in primary care
)

# Different outcome model: lower baseline risk in primary care
lp_c <- -6 + 0.03 * pop_c$age + 0.10 * pop_c$nihss +
  0.002 * pop_c$glucose + 0.2 * pop_c$afib - 0.3 * pop_c$thrombolysis
pop_c$death_30d <- rbinom(n_c, 1, plogis(lp_c))

pop_c <- evaluate_ext(pop_c, fit, "Population C: Domain (primary care)")

# ---- (b) Calibration plots ----
par(mfrow = c(1, 3), mar = c(4, 4, 3, 1))
val.prob(pop_a$pred, pop_a$death_30d, m = 80, cex = 0.5,
         main = "A: Temporal")
val.prob(pop_b$pred, pop_b$death_30d, m = 60, cex = 0.5,
         main = "B: Geographical")
val.prob(pop_c$pred, pop_c$death_30d, m = 50, cex = 0.5,
         main = "C: Domain")

# ---- (c) Which population shows worst calibration? ----
cat("\n=== Part (c): Worst Calibration ===\n")
cat("Population C (primary care / domain validation) shows the worst calibration.\n")
cat("This is because the outcome model differs from the development setting:\n")
cat("  - Different baseline risk (intercept)\n")
cat("  - Different predictor-outcome relationships (coefficients)\n")
cat("  - Different case-mix (lower severity patients)\n")
cat("Domain validation is the most stringent test of transportability.\n")

# ---- (d) Logistic recalibration ----
cat("\n=== Part (d): Logistic Recalibration ===\n")

recalibrate <- function(ext_data, label) {
  lp_ext <- qlogis(ext_data$pred)
  cal_fit <- glm(death_30d ~ lp_ext, data = ext_data, family = binomial)
  ext_data$pred_recal <- predict(cal_fit, type = "response")

  oe_before <- mean(ext_data$death_30d) / mean(ext_data$pred)
  oe_after  <- mean(ext_data$death_30d) / mean(ext_data$pred_recal)

  cat(sprintf("\n%s:\n", label))
  cat(sprintf("  O:E before recalibration: %.3f\n", oe_before))
  cat(sprintf("  O:E after recalibration:  %.3f\n", oe_after))
}

recalibrate(pop_a, "Population A")
recalibrate(pop_b, "Population B")
recalibrate(pop_c, "Population C")

cat("\nLogistic recalibration corrects for differences in baseline risk\n")
cat("and prediction spread, bringing O:E ratios closer to 1.0.\n")

Show Python solution

# Chapter 11, Exercise 2: External Validation Simulation
# Create three external validation populations and assess model performance

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, brier_score_loss
from sklearn.calibration import calibration_curve
from scipy.special import expit, logit

# ---- Simulate development data and fit model (same as chapter) ----
np.random.seed(2024)
n = 1500

stroke_data = pd.DataFrame({
    "age": np.round(np.random.normal(72, 12, n)),
    "nihss": np.round(np.maximum(0, np.random.normal(8, 6, n))),
    "glucose": np.round(np.random.normal(140, 50, n)),
    "afib": np.random.binomial(1, 0.25, n),
    "thrombolysis": np.random.binomial(1, 0.30, n)
})

lp = (-5 + 0.04 * stroke_data["age"] +
      0.12 * stroke_data["nihss"] +
      0.003 * stroke_data["glucose"] +
      0.3 * stroke_data["afib"] -
      0.5 * stroke_data["thrombolysis"])

stroke_data["death_30d"] = np.random.binomial(1, expit(lp))

predictors = ["age", "nihss", "glucose", "afib", "thrombolysis"]
X = stroke_data[predictors].values
y = stroke_data["death_30d"].values

model = LogisticRegression(max_iter=5000, random_state=42)
model.fit(X, y)


# ---- Helper function ----
def evaluate_ext(ext_df, model, predictors, label):
    """Evaluate model on external data and return predictions."""
    X_ext = ext_df[predictors].values
    y_ext = ext_df["death_30d"].values
    pred = model.predict_proba(X_ext)[:, 1]

    obs_rate = y_ext.mean()
    mean_pred = pred.mean()
    oe = obs_rate / mean_pred
    auc = roc_auc_score(y_ext, pred)

    # Calibration slope
    lp_ext = logit(np.clip(pred, 1e-8, 1 - 1e-8))
    cal = LogisticRegression(max_iter=5000, penalty=None)
    cal.fit(lp_ext.reshape(-1, 1), y_ext)

    print(f"\n=== {label} ===")
    print(f"  N = {len(y_ext)}, Observed mortality: {obs_rate:.3f}")
    print(f"  Mean predicted: {mean_pred:.3f}")
    print(f"  C-statistic: {auc:.3f}")
    print(f"  O:E ratio: {oe:.3f}")
    print(f"  Calibration slope: {cal.coef_[0][0]:.3f}")
    print(f"  Calibration intercept: {cal.intercept_[0]:.3f}")

    return pred


# ---- Population A: Temporal validation (3 years later) ----
np.random.seed(101)
n_a = 800
pop_a = pd.DataFrame({
    "age": np.round(np.random.normal(73, 12, n_a)),
    "nihss": np.round(np.maximum(0, np.random.normal(8, 6, n_a))),
    "glucose": np.round(np.random.normal(138, 48, n_a)),
    "afib": np.random.binomial(1, 0.27, n_a),
    "thrombolysis": np.random.binomial(1, 0.40, n_a)  # More thrombolysis
})
lp_a = (-5 + 0.04 * pop_a["age"] + 0.12 * pop_a["nihss"] +
        0.003 * pop_a["glucose"] + 0.3 * pop_a["afib"] -
        0.5 * pop_a["thrombolysis"])
pop_a["death_30d"] = np.random.binomial(1, expit(lp_a))
pred_a = evaluate_ext(pop_a, model, predictors, "Population A: Temporal")

# ---- Population B: Geographical (younger patients) ----
np.random.seed(202)
n_b = 600
pop_b = pd.DataFrame({
    "age": np.round(np.random.normal(62, 10, n_b)),
    "nihss": np.round(np.maximum(0, np.random.normal(6, 5, n_b))),
    "glucose": np.round(np.random.normal(130, 45, n_b)),
    "afib": np.random.binomial(1, 0.15, n_b),
    "thrombolysis": np.random.binomial(1, 0.35, n_b)
})
lp_b = (-5 + 0.04 * pop_b["age"] + 0.12 * pop_b["nihss"] +
        0.003 * pop_b["glucose"] + 0.3 * pop_b["afib"] -
        0.5 * pop_b["thrombolysis"])
pop_b["death_30d"] = np.random.binomial(1, expit(lp_b))
pred_b = evaluate_ext(pop_b, model, predictors, "Population B: Geographical")

# ---- Population C: Domain (primary care, lower severity) ----
np.random.seed(303)
n_c = 500
pop_c = pd.DataFrame({
    "age": np.round(np.random.normal(68, 14, n_c)),
    "nihss": np.round(np.maximum(0, np.random.normal(3, 3, n_c))),
    "glucose": np.round(np.random.normal(120, 35, n_c)),
    "afib": np.random.binomial(1, 0.20, n_c),
    "thrombolysis": np.random.binomial(1, 0.10, n_c)
})
# Different outcome model in primary care
lp_c = (-6 + 0.03 * pop_c["age"] + 0.10 * pop_c["nihss"] +
        0.002 * pop_c["glucose"] + 0.2 * pop_c["afib"] -
        0.3 * pop_c["thrombolysis"])
pop_c["death_30d"] = np.random.binomial(1, expit(lp_c))
pred_c = evaluate_ext(pop_c, model, predictors, "Population C: Domain")

# ---- (b) Calibration plots ----
fig, axes = plt.subplots(1, 3, figsize=(16, 5))
pops = [(pop_a, pred_a, "A: Temporal"), (pop_b, pred_b, "B: Geographical"),
        (pop_c, pred_c, "C: Domain")]

for ax, (pop, pred, title) in zip(axes, pops):
    y_ext = pop["death_30d"].values
    prob_true, prob_pred = calibration_curve(y_ext, pred, n_bins=10,
                                             strategy="quantile")
    ax.plot(prob_pred, prob_true, "o-", color="steelblue", lw=2)
    ax.plot([0, 1], [0, 1], "--", color="grey")
    ax.set_xlabel("Predicted Probability")
    ax.set_ylabel("Observed Proportion")
    ax.set_title(title)

plt.tight_layout()
plt.savefig("ch11_ex2_calibration_plots.png", dpi=150)
plt.show()

# ---- (c) Worst calibration ----
print("\n=== Part (c): Worst Calibration ===")
print("Population C (primary care / domain validation) shows worst calibration.")
print("The outcome model differs from the development setting:")
print("  - Different baseline risk and predictor-outcome relationships")
print("  - Much lower severity patients (NIHSS ~ 3 vs 8)")
print("  - Domain validation is the most stringent test of transportability.")

# ---- (d) Logistic recalibration ----
print("\n=== Part (d): Logistic Recalibration ===")

for pop, pred, label in pops:
    y_ext = pop["death_30d"].values
    lp_ext = logit(np.clip(pred, 1e-8, 1 - 1e-8))
    recal = LogisticRegression(max_iter=5000, penalty=None)
    recal.fit(lp_ext.reshape(-1, 1), y_ext)
    pred_recal = recal.predict_proba(lp_ext.reshape(-1, 1))[:, 1]

    oe_before = y_ext.mean() / pred.mean()
    oe_after = y_ext.mean() / pred_recal.mean()

    print(f"\n{label}:")
    print(f"  O:E before: {oe_before:.3f}")
    print(f"  O:E after:  {oe_after:.3f}")

print("\nLogistic recalibration adjusts intercept and slope, bringing")
print("O:E ratios closer to 1.0 in each external population.")

D.8.3 Exercise 3

Show R solution

# Chapter 11, Exercise 3: Decision Curve Interpretation
# Conceptual exercise about a preeclampsia prediction model

# This exercise is primarily conceptual. We simulate a plausible preeclampsia
# model and generate a decision curve to support the discussion.

set.seed(42)
n <- 2000

# Simulate data for preeclampsia prediction
preeclampsia_data <- data.frame(
  age = round(rnorm(n, 30, 5)),
  bmi = round(rnorm(n, 27, 5), 1),
  nulliparous = rbinom(n, 1, 0.4),
  history_pe = rbinom(n, 1, 0.05),
  map = round(rnorm(n, 85, 10))  # mean arterial pressure
)

lp <- -5.5 + 0.03 * preeclampsia_data$age +
  0.04 * preeclampsia_data$bmi +
  0.5 * preeclampsia_data$nulliparous +
  1.5 * preeclampsia_data$history_pe +
  0.02 * preeclampsia_data$map

preeclampsia_data$pe <- rbinom(n, 1, plogis(lp))
cat("Preeclampsia rate:", mean(preeclampsia_data$pe), "\n")

# Fit model
model <- glm(pe ~ age + bmi + nulliparous + history_pe + map,
             data = preeclampsia_data, family = binomial)

pred_prob <- predict(model, type = "response")
y_obs <- preeclampsia_data$pe

# ---- Generate decision curve ----
thresholds <- seq(0.01, 0.50, by = 0.01)
n_total <- length(y_obs)
nb_model <- nb_all <- numeric(length(thresholds))

for (i in seq_along(thresholds)) {
  t <- thresholds[i]
  pred_pos <- as.numeric(pred_prob >= t)
  tp <- sum(pred_pos == 1 & y_obs == 1)
  fp <- sum(pred_pos == 1 & y_obs == 0)
  nb_model[i] <- tp / n_total - fp / n_total * (t / (1 - t))
  nb_all[i] <- mean(y_obs) - (1 - mean(y_obs)) * (t / (1 - t))
}

plot(thresholds, nb_model, type = "l", lwd = 2, col = "steelblue",
     main = "Decision Curve: Preeclampsia Prediction Model\n(AUC ~ 0.82)",
     xlab = "Threshold Probability", ylab = "Net Benefit",
     ylim = c(-0.05, max(c(nb_model, nb_all)) * 1.1))
lines(thresholds, nb_all, lwd = 2, col = "coral", lty = 2)
abline(h = 0, col = "black")
legend("topright", legend = c("Model", "Treat All", "Treat None"),
       col = c("steelblue", "coral", "black"),
       lty = c(1, 2, 1), lwd = 2)

# ---- Part (a): Range of useful thresholds ----
cat("\n=== Part (a): Range of Useful Thresholds ===\n")
cat("The model is useful at thresholds where its net benefit curve\n")
cat("exceeds BOTH the 'treat all' and 'treat none' lines.\n")
cat("From the decision curve, the model appears useful approximately\n")
cat("in the range of threshold probabilities from about 2% to 30-40%.\n")
cat("Below ~2%, 'treat all' has similar or higher net benefit.\n")
cat("Above ~30-40%, the model's net benefit approaches zero.\n")

# ---- Part (b): Counter-argument to 'AUC is only 0.82' ----
cat("\n=== Part (b): Counter-argument ===\n")
cat("An AUC of 0.82 is not 'only' -- it is strong discrimination.\n")
cat("More importantly, the AUC does not directly tell you whether\n")
cat("using the model improves clinical decisions.\n\n")
cat("The decision curve shows that across the clinically relevant\n")
cat("threshold range (e.g., 5-20%), the model provides positive\n")
cat("net benefit above both default strategies.\n\n")
cat("This means that using the model to guide closer monitoring\n")
cat("decisions would identify more true preeclampsia cases per\n")
cat("'unnecessary' monitoring than either monitoring everyone or\n")
cat("monitoring no one. Clinical utility is what matters for\n")
cat("patient care -- not the AUC alone.\n")

# ---- Part (c): Effect of action cost on the decision curve ----
cat("\n=== Part (c): Effect of Action Cost ===\n")
cat("The threshold probability reflects the implicit cost-benefit\n")
cat("ratio of the action (closer monitoring).\n\n")
cat("LOW-COST action (e.g., closer monitoring, extra visits):\n")
cat("  - Clinicians would use a LOW threshold (e.g., 3-5%)\n")
cat("  - They accept many false positives to avoid missing cases\n")
cat("  - The relevant region of the decision curve shifts LEFT\n")
cat("  - 'Treat all' remains competitive at low thresholds\n\n")
cat("HIGH-COST action (e.g., prophylactic medication with side effects):\n")
cat("  - Clinicians would use a HIGHER threshold (e.g., 15-25%)\n")
cat("  - They want more certainty before acting\n")
cat("  - The relevant region shifts RIGHT\n")
cat("  - The model has more value here because it avoids unnecessary\n")
cat("    treatment of low-risk patients\n\n")
cat("In summary: the decision curve itself does not change, but the\n")
cat("clinically relevant REGION changes depending on the cost of action.\n")

Show Python solution

# Chapter 11, Exercise 3: Decision Curve Interpretation
# Conceptual exercise about a preeclampsia prediction model

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from scipy.special import expit

# Simulate a plausible preeclampsia model to generate a decision curve
np.random.seed(42)
n = 2000

age = np.round(np.random.normal(30, 5, n))
bmi = np.round(np.random.normal(27, 5, n), 1)
nulliparous = np.random.binomial(1, 0.4, n)
history_pe = np.random.binomial(1, 0.05, n)
map_val = np.round(np.random.normal(85, 10, n))

lp = (-5.5 + 0.03 * age + 0.04 * bmi + 0.5 * nulliparous +
      1.5 * history_pe + 0.02 * map_val)
pe = np.random.binomial(1, expit(lp))
print(f"Preeclampsia rate: {pe.mean():.3f}")

X = np.column_stack([age, bmi, nulliparous, history_pe, map_val])
model = LogisticRegression(max_iter=5000, random_state=42)
model.fit(X, pe)
pred_prob = model.predict_proba(X)[:, 1]
print(f"AUC: {roc_auc_score(pe, pred_prob):.3f}")

# ---- Generate decision curve ----
thresholds = np.arange(0.01, 0.51, 0.01)
n_total = len(pe)


def net_benefit(y_true, y_pred, threshold):
    pred_pos = (y_pred >= threshold).astype(int)
    tp = np.sum((pred_pos == 1) & (y_true == 1))
    fp = np.sum((pred_pos == 1) & (y_true == 0))
    return tp / n_total - fp / n_total * (threshold / (1 - threshold))


nb_model = [net_benefit(pe, pred_prob, t) for t in thresholds]
nb_all = [pe.mean() - (1 - pe.mean()) * t / (1 - t) for t in thresholds]

plt.figure(figsize=(9, 6))
plt.plot(thresholds, nb_model, color="steelblue", lw=2, label="Prediction Model")
plt.plot(thresholds, nb_all, color="coral", lw=2, ls="--", label="Treat All")
plt.axhline(y=0, color="black", lw=1, label="Treat None")
plt.xlabel("Threshold Probability")
plt.ylabel("Net Benefit")
plt.title("Decision Curve: Preeclampsia Prediction Model (AUC ~ 0.82)")
plt.legend()
plt.ylim(-0.05, max(max(nb_model), max(nb_all)) * 1.1)
plt.tight_layout()
plt.savefig("ch11_ex3_decision_curve.png", dpi=150)
plt.show()

# ---- Part (a): Range of useful thresholds ----
print("\n=== Part (a): Range of Useful Thresholds ===")
print("The model is useful at thresholds where its net benefit curve")
print("exceeds BOTH 'treat all' and 'treat none'.")
print("From the decision curve, the model is useful approximately")
print("from about 2% to 30-40% threshold probability.")
print("Below ~2%, 'treat all' has similar or higher net benefit.")
print("Above ~30-40%, the model's net benefit approaches zero.")

# ---- Part (b): Counter-argument to 'AUC is only 0.82' ----
print("\n=== Part (b): Counter-argument ===")
print("An AUC of 0.82 is strong discrimination, not 'only'.")
print("More importantly, AUC does not directly measure clinical utility.")
print("")
print("The decision curve shows the model provides positive net benefit")
print("above both default strategies across the clinically relevant")
print("threshold range (e.g., 5-20%). This means using the model to")
print("guide monitoring decisions identifies more true preeclampsia")
print("cases per 'unnecessary' monitoring episode than either monitoring")
print("everyone or no one. Clinical utility -- not AUC -- determines")
print("whether a model should be used in practice.")

# ---- Part (c): Effect of action cost ----
print("\n=== Part (c): Effect of Action Cost ===")
print("The threshold probability reflects the cost-benefit ratio of acting.")
print("")
print("LOW-COST action (closer monitoring, extra visits):")
print("  - Clinicians use a LOW threshold (e.g., 3-5%)")
print("  - Many false positives are tolerable")
print("  - The relevant region of the decision curve is on the LEFT")
print("  - 'Treat all' remains competitive at low thresholds")
print("")
print("HIGH-COST action (prophylactic medication with side effects):")
print("  - Clinicians use a HIGHER threshold (e.g., 15-25%)")
print("  - More certainty is needed before acting")
print("  - The relevant region shifts RIGHT")
print("  - The model adds more value by avoiding unnecessary treatment")
print("")
print("The curve itself does not change, but the clinically relevant")
print("region changes depending on the cost of the follow-up action.")

D.8.4 Exercise 4

Show R solution

# Chapter 11, Exercise 4: Complete Evaluation of a Published Model
# Example: QRISK3 Cardiovascular Risk Prediction Model
#
# This exercise is primarily a literature review exercise. Below we provide
# a structured answer based on published validation data for QRISK3.

# ---- Part (a): C-statistic with confidence interval ----
cat("=== Part (a): C-statistic ===\n")
cat("QRISK3 (Hippisley-Cox et al., 2017, BMJ) was developed to predict\n")
cat("10-year cardiovascular disease risk.\n\n")
cat("Development cohort:\n")
cat("  C-statistic (women): 0.880 (95% CI: 0.878-0.882)\n")
cat("  C-statistic (men):   0.858 (95% CI: 0.856-0.860)\n\n")
cat("Validation cohort (separate 25% held out):\n")
cat("  C-statistic (women): 0.879 (95% CI: 0.876-0.882)\n")
cat("  C-statistic (men):   0.858 (95% CI: 0.855-0.861)\n")
cat("The discrimination is excellent and stable between development\n")
cat("and validation sets.\n")

# ---- Part (b): Calibration ----
cat("\n=== Part (b): Calibration ===\n")
cat("QRISK3 provides calibration plots in the original publication.\n")
cat("In the validation cohort:\n")
cat("  - The calibration was generally good, with predicted risks\n")
cat("    closely matching observed event rates across deciles.\n")
cat("  - Some overestimation of risk was observed at higher risk\n")
cat("    levels, particularly in older age groups.\n")
cat("  - External validations in different populations (e.g., outside\n")
cat("    the UK) have shown variable calibration, often with\n")
cat("    overestimation in lower-risk populations.\n")

# ---- Part (c): Decision curve analysis ----
cat("\n=== Part (c): Decision Curve Analysis ===\n")
cat("The original QRISK3 paper does not include decision curve analysis.\n\n")
cat("Clinically relevant threshold range for statin initiation:\n")
cat("  - UK NICE guidelines: 10% 10-year CVD risk threshold\n")
cat("  - US ACC/AHA guidelines: 7.5% threshold\n")
cat("  - A decision curve would be most informative in the 5-20%\n")
cat("    threshold range, where the decision to start statins is\n")
cat("    most uncertain.\n")
cat("  - Below 5%, most clinicians would not start statins\n")
cat("  - Above 20%, most clinicians would start statins regardless\n")

# ---- Part (d): External validation ----
cat("\n=== Part (d): External Validation ===\n")
cat("QRISK3 has been validated in multiple populations:\n")
cat("  - Internal-external: UK CPRD data (separate time period)\n")
cat("  - Geographic: Various European populations\n")
cat("  - Ethnic subgroups: South Asian, Black, Chinese populations\n")
cat("  - Temporal: Validated across different calendar periods\n\n")
cat("Key findings from external validations:\n")
cat("  - Discrimination generally remains good (C > 0.80)\n")
cat("  - Calibration can deteriorate in non-UK populations\n")
cat("  - May overestimate risk in populations with lower baseline\n")
cat("    cardiovascular event rates\n")

# ---- Part (e): Recommendation for implementation ----
cat("\n=== Part (e): Recommendation ===\n")
cat("Recommendation depends on the clinical setting:\n\n")
cat("FOR a UK primary care setting:\n")
cat("  - YES, recommend implementation. QRISK3 was developed and\n")
cat("    validated in UK primary care, has excellent discrimination,\n")
cat("    good calibration, and is already integrated into UK\n")
cat("    clinical guidelines.\n\n")
cat("FOR a non-UK setting:\n")
cat("  - CONDITIONAL recommendation. Would first require:\n")
cat("    1. External validation in the local population\n")
cat("    2. Assessment of calibration (likely needs recalibration)\n")
cat("    3. Decision curve analysis at locally relevant thresholds\n")
cat("    4. Comparison with locally developed risk scores\n")
cat("  - If calibration is poor, logistic recalibration should be\n")
cat("    considered before implementation.\n")

Show Python solution

# Chapter 11, Exercise 4: Complete Evaluation of a Published Model
# Example: QRISK3 Cardiovascular Risk Prediction Model
#
# This exercise is primarily a literature review exercise. Below we provide
# a structured answer based on published validation data for QRISK3.

# ---- Part (a): C-statistic with confidence interval ----
print("=== Part (a): C-statistic ===")
print("QRISK3 (Hippisley-Cox et al., 2017, BMJ) was developed to predict")
print("10-year cardiovascular disease risk.")
print()
print("Development cohort:")
print("  C-statistic (women): 0.880 (95% CI: 0.878-0.882)")
print("  C-statistic (men):   0.858 (95% CI: 0.856-0.860)")
print()
print("Validation cohort (separate 25% held out):")
print("  C-statistic (women): 0.879 (95% CI: 0.876-0.882)")
print("  C-statistic (men):   0.858 (95% CI: 0.855-0.861)")
print("The discrimination is excellent and stable between development")
print("and validation sets.")

# ---- Part (b): Calibration ----
print("\n=== Part (b): Calibration ===")
print("QRISK3 provides calibration plots in the original publication.")
print("In the validation cohort:")
print("  - Calibration was generally good, with predicted risks closely")
print("    matching observed event rates across deciles.")
print("  - Some overestimation at higher risk levels, particularly in")
print("    older age groups.")
print("  - External validations outside the UK have shown variable")
print("    calibration, often with overestimation in lower-risk populations.")

# ---- Part (c): Decision curve analysis ----
print("\n=== Part (c): Decision Curve Analysis ===")
print("The original QRISK3 paper does not include decision curve analysis.")
print()
print("Clinically relevant threshold range for statin initiation:")
print("  - UK NICE guidelines: 10% 10-year CVD risk threshold")
print("  - US ACC/AHA guidelines: 7.5% threshold")
print("  - A decision curve would be most informative in the 5-20%")
print("    threshold range, where the statin decision is most uncertain.")
print("  - Below 5%, most clinicians would not start statins")
print("  - Above 20%, most clinicians would start statins regardless")

# ---- Part (d): External validation ----
print("\n=== Part (d): External Validation ===")
print("QRISK3 has been validated in multiple populations:")
print("  - Internal-external: UK CPRD data (separate time period)")
print("  - Geographic: Various European populations")
print("  - Ethnic subgroups: South Asian, Black, Chinese populations")
print("  - Temporal: Validated across different calendar periods")
print()
print("Key findings from external validations:")
print("  - Discrimination generally remains good (C > 0.80)")
print("  - Calibration can deteriorate in non-UK populations")
print("  - May overestimate risk in populations with lower baseline")
print("    cardiovascular event rates")

# ---- Part (e): Recommendation ----
print("\n=== Part (e): Recommendation ===")
print("Recommendation depends on the clinical setting:")
print()
print("FOR a UK primary care setting:")
print("  YES, recommend implementation. QRISK3 was developed and")
print("  validated in UK primary care, has excellent discrimination,")
print("  good calibration, and is integrated into UK clinical guidelines.")
print()
print("FOR a non-UK setting:")
print("  CONDITIONAL recommendation. Would first require:")
print("    1. External validation in the local population")
print("    2. Assessment of calibration (likely needs recalibration)")
print("    3. Decision curve analysis at locally relevant thresholds")
print("    4. Comparison with locally developed risk scores")
print("  If calibration is poor, logistic recalibration should be")
print("  considered before implementation.")

D.9 Chapter 13: Bayesian Inference

D.9.1 Exercise 1

Show R solution

# Chapter 13, Exercise 1: Diagnostic Test Updating
# Sequential Bayesian updating with mammogram and ultrasound

# ---- Part (a): Posterior probability after positive mammogram ----
cat("=== Part (a): Positive Mammogram ===\n")

sens_mammo <- 0.90   # Sensitivity of mammogram
spec_mammo <- 0.92   # Specificity of mammogram
prevalence <- 0.02   # Prior probability (prevalence in women aged 50-59)

# Apply Bayes' theorem:
# P(cancer | +) = P(+ | cancer) * P(cancer) / P(+)
# P(+) = P(+ | cancer)*P(cancer) + P(+ | no cancer)*P(no cancer)

p_pos <- sens_mammo * prevalence + (1 - spec_mammo) * (1 - prevalence)
post_mammo <- (sens_mammo * prevalence) / p_pos

cat("Prior (prevalence): ", prevalence, "\n")
cat("P(positive test):   ", round(p_pos, 4), "\n")
cat("P(cancer | positive mammogram):", round(post_mammo, 4), "\n")
cat("\nDespite 90% sensitivity and 92% specificity, the posterior\n")
cat("probability is only about", round(post_mammo * 100, 1), "% because\n")
cat("the prior probability (prevalence) is very low at 2%.\n")

# ---- Part (b): Sequential update with positive ultrasound ----
cat("\n=== Part (b): Sequential Update with Positive Ultrasound ===\n")

sens_us <- 0.95   # Sensitivity of ultrasound
spec_us <- 0.85   # Specificity of ultrasound

# The posterior from (a) becomes the new prior
prior_us <- post_mammo

p_pos_us <- sens_us * prior_us + (1 - spec_us) * (1 - prior_us)
post_us <- (sens_us * prior_us) / p_pos_us

cat("Prior (posterior from mammogram):", round(prior_us, 4), "\n")
cat("P(cancer | positive mammogram AND positive ultrasound):",
    round(post_us, 4), "\n")
cat("\nAfter two positive tests, the posterior probability rises to about",
    round(post_us * 100, 1), "%.\n")

# ---- Part (c): What does this illustrate? ----
cat("\n=== Part (c): What Sequential Updating Illustrates ===\n")
cat("1. NATURAL SEQUENTIAL UPDATING: In the Bayesian framework,\n")
cat("   today's posterior becomes tomorrow's prior. Each new piece\n")
cat("   of evidence updates our belief incrementally.\n\n")
cat("2. THE PRIOR MATTERS: Starting from a low prevalence (2%),\n")
cat("   even a sensitive test only raises the probability to ~19%.\n")
cat("   A second positive test raises it further to ~", round(post_us * 100), "%.\n\n")
cat("3. CLINICAL REASONING IS BAYESIAN: Clinicians intuitively do\n")
cat("   this -- they order confirmatory tests precisely because a\n")
cat("   single screening test in a low-prevalence population is\n")
cat("   insufficient. The Bayesian framework formalises this logic.\n\n")
cat("4. COHERENCE: The Bayesian approach naturally handles sequential\n")
cat("   evidence without the multiple-testing corrections that\n")
cat("   frequentist methods would require.\n")

# ---- Visualise the updating process ----
stages <- c("Prior\n(prevalence)", "After\nmammogram (+)", "After\nultrasound (+)")
probs <- c(prevalence, post_mammo, post_us)

barplot(probs, names.arg = stages, col = c("steelblue", "goldenrod", "firebrick"),
        main = "Sequential Bayesian Updating\nBreast Cancer Diagnosis",
        ylab = "P(Cancer)", ylim = c(0, max(probs) * 1.2),
        border = NA)
text(x = c(0.7, 1.9, 3.1), y = probs + 0.02,
     labels = paste0(round(probs * 100, 1), "%"), cex = 0.9)

Show Python solution

# Chapter 13, Exercise 1: Diagnostic Test Updating
# Sequential Bayesian updating with mammogram and ultrasound

import matplotlib.pyplot as plt

# ---- Part (a): Posterior probability after positive mammogram ----
print("=== Part (a): Positive Mammogram ===")

sens_mammo = 0.90   # Sensitivity of mammogram
spec_mammo = 0.92   # Specificity of mammogram
prevalence = 0.02   # Prior probability (prevalence in women aged 50-59)

# Apply Bayes' theorem:
# P(cancer | +) = P(+ | cancer) * P(cancer) / P(+)
p_pos = sens_mammo * prevalence + (1 - spec_mammo) * (1 - prevalence)
post_mammo = (sens_mammo * prevalence) / p_pos

print(f"Prior (prevalence):  {prevalence}")
print(f"P(positive test):    {p_pos:.4f}")
print(f"P(cancer | positive mammogram): {post_mammo:.4f}")
print(f"\nDespite 90% sensitivity and 92% specificity, the posterior")
print(f"probability is only about {post_mammo*100:.1f}% because the")
print(f"prior probability (prevalence) is very low at 2%.")

# ---- Part (b): Sequential update with positive ultrasound ----
print("\n=== Part (b): Sequential Update with Positive Ultrasound ===")

sens_us = 0.95   # Sensitivity of ultrasound
spec_us = 0.85   # Specificity of ultrasound

# The posterior from (a) becomes the new prior
prior_us = post_mammo
p_pos_us = sens_us * prior_us + (1 - spec_us) * (1 - prior_us)
post_us = (sens_us * prior_us) / p_pos_us

print(f"Prior (posterior from mammogram): {prior_us:.4f}")
print(f"P(cancer | pos mammogram AND pos ultrasound): {post_us:.4f}")
print(f"\nAfter two positive tests, the posterior probability rises to about {post_us*100:.1f}%.")

# ---- Part (c): What sequential updating illustrates ----
print("\n=== Part (c): What Sequential Updating Illustrates ===")
print("1. NATURAL SEQUENTIAL UPDATING: In the Bayesian framework,")
print("   today's posterior becomes tomorrow's prior. Each new piece")
print("   of evidence updates our belief incrementally.")
print()
print("2. THE PRIOR MATTERS: Starting from a low prevalence (2%),")
print(f"   even a sensitive test only raises the probability to ~{post_mammo*100:.0f}%.")
print(f"   A second positive test raises it further to ~{post_us*100:.0f}%.")
print()
print("3. CLINICAL REASONING IS BAYESIAN: Clinicians intuitively do")
print("   this -- they order confirmatory tests precisely because a")
print("   single screening test in a low-prevalence population is")
print("   insufficient. The Bayesian framework formalises this logic.")
print()
print("4. COHERENCE: The Bayesian approach naturally handles sequential")
print("   evidence without the multiple-testing corrections that")
print("   frequentist methods would require.")

# ---- Visualise the updating process ----
stages = ["Prior\n(prevalence)", "After\nmammogram (+)", "After\nultrasound (+)"]
probs = [prevalence, post_mammo, post_us]
colors = ["steelblue", "goldenrod", "firebrick"]

fig, ax = plt.subplots(figsize=(7, 5))
bars = ax.bar(stages, probs, color=colors, edgecolor="none", width=0.6)
for bar, p in zip(bars, probs):
    ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.01,
            f"{p*100:.1f}%", ha="center", fontsize=11)
ax.set_ylabel("P(Cancer)")
ax.set_title("Sequential Bayesian Updating\nBreast Cancer Diagnosis")
ax.set_ylim(0, max(probs) * 1.3)
plt.tight_layout()
plt.savefig("ch13_ex1_updating.png", dpi=150)
plt.show()

D.9.2 Exercise 2

Show R solution

# Chapter 13, Exercise 2: Prior Sensitivity Analysis
# Phase II antibiotic trial: 21/30 patients achieve clinical cure

library(tidyverse)

y <- 21  # successes
n <- 30  # total patients

# Define three priors
priors <- list(
  list(name = "Beta(1,1) - Uninformative", a = 1, b = 1),
  list(name = "Beta(5,5) - Weakly informative", a = 5, b = 5),
  list(name = "Beta(20,10) - Informative", a = 20, b = 10)
)

theta <- seq(0, 1, length.out = 500)

# ---- Part (a): Posterior mean and 95% credible interval ----
cat("=== Part (a): Posterior Summaries ===\n\n")

plot_df <- tibble()

for (p in priors) {
  a_post <- p$a + y
  b_post <- p$b + n - y
  post_mean <- a_post / (a_post + b_post)
  ci <- qbeta(c(0.025, 0.975), a_post, b_post)

  cat(sprintf("Prior: %s\n", p$name))
  cat(sprintf("  Prior mean: %.3f\n", p$a / (p$a + p$b)))
  cat(sprintf("  Posterior: Beta(%d, %d)\n", a_post, b_post))
  cat(sprintf("  Posterior mean: %.3f\n", post_mean))
  cat(sprintf("  95%% credible interval: [%.3f, %.3f]\n\n", ci[1], ci[2]))

  plot_df <- bind_rows(plot_df,
    tibble(
      theta = theta,
      Density = dbeta(theta, a_post, b_post),
      Prior = p$name
    )
  )
}

# ---- Part (b): Plot all three posteriors ----
cat("=== Part (b): Posterior Plot ===\n")

ggplot(plot_df, aes(x = theta, y = Density, colour = Prior)) +
  geom_line(linewidth = 1.2) +
  geom_vline(xintercept = y / n, linetype = "dashed", colour = "grey40") +
  annotate("text", x = y / n + 0.02, y = max(plot_df$Density) * 0.9,
           label = paste0("MLE = ", round(y / n, 3)),
           hjust = 0, size = 3.5) +
  labs(x = expression(theta ~ "(cure rate)"),
       y = "Density",
       title = "Posterior Distributions Under Different Priors",
       subtitle = "21/30 patients cured in Phase II trial") +
  theme_minimal(base_size = 13) +
  theme(legend.position = "bottom")

# ---- Part (c): Which prior has most influence? ----
cat("=== Part (c): Which Prior Has Most Influence? ===\n\n")
cat("The Beta(5,5) prior has the MOST influence on the posterior.\n\n")
cat("Why? The effective sample size of a Beta(a,b) prior is a + b.\n")
cat("  - Beta(1,1):   effective n = 2  (trivial influence)\n")
cat("  - Beta(5,5):   effective n = 10 (modest influence)\n")
cat("  - Beta(20,10): effective n = 30 (substantial influence)\n\n")
cat("However, while Beta(20,10) has the largest effective sample size,\n")
cat("its prior mean (0.667) is close to the data (21/30 = 0.700),\n")
cat("so the prior and data 'agree', and the posterior is not pulled\n")
cat("far from the MLE.\n\n")
cat("Beta(5,5) has a prior mean of 0.500, which is DIFFERENT from\n")
cat("the data. It pulls the posterior mean toward 0.5, producing the\n")
cat("most visible shift from the MLE. With n=30 observed data, even\n")
cat("an effective sample size of 10 produces noticeable pull when\n")
cat("the prior and data disagree.\n\n")
cat("Key insight: prior influence depends on BOTH the effective sample\n")
cat("size AND how much the prior disagrees with the data.\n")

Show Python solution

# Chapter 13, Exercise 2: Prior Sensitivity Analysis
# Phase II antibiotic trial: 21/30 patients achieve clinical cure

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import beta

y, n = 21, 30  # successes and total
theta = np.linspace(0, 1, 500)

# Define three priors
priors = [
    ("Beta(1,1) - Uninformative", 1, 1),
    ("Beta(5,5) - Weakly informative", 5, 5),
    ("Beta(20,10) - Informative", 20, 10),
]

# ---- Part (a): Posterior mean and 95% credible interval ----
print("=== Part (a): Posterior Summaries ===\n")

fig, ax = plt.subplots(figsize=(8, 5))

for name, a0, b0 in priors:
    a_post = a0 + y
    b_post = b0 + n - y
    post_mean = a_post / (a_post + b_post)
    ci = beta.ppf([0.025, 0.975], a_post, b_post)

    print(f"Prior: {name}")
    print(f"  Prior mean: {a0 / (a0 + b0):.3f}")
    print(f"  Posterior: Beta({a_post}, {b_post})")
    print(f"  Posterior mean: {post_mean:.3f}")
    print(f"  95% credible interval: [{ci[0]:.3f}, {ci[1]:.3f}]")
    print()

    # ---- Part (b): Plot all three posteriors ----
    ax.plot(theta, beta.pdf(theta, a_post, b_post), label=name, linewidth=1.5)

ax.axvline(y / n, linestyle="--", color="grey", linewidth=1)
ax.text(y / n + 0.02, ax.get_ylim()[1] * 0.1,
        f"MLE = {y/n:.3f}", fontsize=9, color="grey")
ax.set_xlabel(r"$\theta$ (cure rate)")
ax.set_ylabel("Density")
ax.set_title("Posterior Distributions Under Different Priors\n"
             "21/30 patients cured in Phase II trial")
ax.legend(fontsize=9)
plt.tight_layout()
plt.savefig("ch13_ex2_posteriors.png", dpi=150)
plt.show()

# ---- Part (c): Which prior has the most influence? ----
print("=== Part (c): Which Prior Has Most Influence? ===\n")
print("The Beta(5,5) prior has the MOST influence on the posterior.")
print()
print("Why? The effective sample size of a Beta(a,b) prior is a + b.")
print("  - Beta(1,1):   effective n = 2  (trivial influence)")
print("  - Beta(5,5):   effective n = 10 (modest influence)")
print("  - Beta(20,10): effective n = 30 (substantial influence)")
print()
print("However, while Beta(20,10) has the largest effective sample size,")
print("its prior mean (0.667) is close to the data (21/30 = 0.700),")
print("so the prior and data 'agree', and the posterior is not pulled")
print("far from the MLE.")
print()
print("Beta(5,5) has a prior mean of 0.500, which DISAGREES with the")
print("data. It pulls the posterior mean toward 0.5, producing the most")
print("visible shift from the MLE. With n=30 observed data, even an")
print("effective sample size of 10 produces noticeable pull when the")
print("prior and data disagree.")
print()
print("Key insight: prior influence depends on BOTH the effective sample")
print("size AND how much the prior disagrees with the data.")

D.9.3 Exercise 3

Show R solution

# Chapter 13, Exercise 3: Interpreting MCMC Diagnostics
# Conceptual exercise about convergence problems

# Given diagnostics for a Bayesian logistic regression treatment effect:
#   - R-hat: 1.08
#   - Bulk ESS: 150
#   - Tail ESS: 85
#   - Trace plot: two chains exploring different regions

# ---- Part (a): Is there evidence of convergence problems? ----
cat("=== Part (a): Evidence of Convergence Problems ===\n\n")

cat("YES, there is clear evidence of convergence problems.\n")
cat("ALL THREE diagnostics are concerning:\n\n")

cat("1. R-hat = 1.08 (threshold: < 1.01)\n")
cat("   - R-hat compares between-chain to within-chain variance.\n")
cat("   - A value of 1.08 far exceeds the 1.01 threshold.\n")
cat("   - This means chains are NOT sampling from the same distribution.\n")
cat("   - The trace plot confirms this: chains explore different regions.\n\n")

cat("2. Bulk ESS = 150 (recommended: > 400 per chain, i.e., > 1600 total)\n")
cat("   - Bulk ESS estimates the effective number of independent samples\n")
cat("     for estimating posterior means and medians.\n")
cat("   - 150 is far too low for reliable posterior summaries.\n")
cat("   - High autocorrelation or poor mixing causes low ESS.\n\n")

cat("3. Tail ESS = 85 (recommended: > 400)\n")
cat("   - Tail ESS measures the effective samples for tail quantiles\n")
cat("     (e.g., 2.5th and 97.5th percentiles for credible intervals).\n")
cat("   - 85 is dangerously low -- credible intervals will be unreliable.\n")
cat("   - Tail ESS is almost always lower than bulk ESS, so if bulk is\n")
cat("     already too low, tail ESS will be even worse.\n\n")

cat("4. Trace plot: chains exploring different regions\n")
cat("   - This is the most visually obvious sign of non-convergence.\n")
cat("   - Well-behaved chains should overlap ('hairy caterpillar').\n")
cat("   - Chains stuck in different regions indicate multimodality or\n")
cat("     insufficient sampling to traverse the parameter space.\n")

# ---- Part (b): Steps to fix these problems ----
cat("\n=== Part (b): Steps to Fix These Problems ===\n\n")

cat("1. INCREASE WARMUP AND ITERATIONS:\n")
cat("   - Current chains may not have had enough warmup to find\n")
cat("     the typical set. Try warmup = 2000-4000, iter = 4000-8000.\n\n")

cat("2. INCREASE adapt_delta (Stan/brms):\n")
cat("   - Set control = list(adapt_delta = 0.95 or 0.99).\n")
cat("   - This makes the sampler take smaller steps, reducing\n")
cat("     divergent transitions at the cost of slower sampling.\n\n")

cat("3. CHECK THE MODEL SPECIFICATION:\n")
cat("   - Priors may be too vague or conflicting with the likelihood.\n")
cat("   - Use prior predictive checks to ensure priors are reasonable.\n")
cat("   - Consider stronger (more informative) priors if justified.\n\n")

cat("4. REPARAMETERISE THE MODEL:\n")
cat("   - Some parameterisations create difficult posterior geometries.\n")
cat("   - For hierarchical models, use non-centred parameterisation.\n")
cat("   - Standardise predictors to improve sampler efficiency.\n\n")

cat("5. SIMPLIFY THE MODEL:\n")
cat("   - If the model is too complex for the data, consider\n")
cat("     removing weak predictors or reducing random effects.\n\n")

cat("6. RUN MORE CHAINS:\n")
cat("   - Running 6-8 chains (instead of 4) with different\n")
cat("     starting values helps diagnose multimodality.\n")

# ---- Part (c): Trust the posterior summaries? ----
cat("\n=== Part (c): Can We Trust the Posterior Summaries? ===\n\n")

cat("NO, absolutely not. The posterior summaries should NOT be trusted.\n\n")

cat("Reasons:\n")
cat("1. With R-hat = 1.08, the chains have not converged to the same\n")
cat("   distribution. Any summary (mean, median, CI) is averaging\n")
cat("   across different distributions -- the result is meaningless.\n\n")

cat("2. With bulk ESS = 150, the posterior mean is estimated from\n")
cat("   only ~150 effective independent samples. The Monte Carlo\n")
cat("   error is too large for the estimates to be precise.\n\n")

cat("3. With tail ESS = 85, the 95% credible interval is based on\n")
cat("   only ~85 effective tail samples. The interval bounds could\n")
cat("   shift substantially with additional sampling.\n\n")

cat("4. Reporting results from a non-converged model is a serious\n")
cat("   methodological error. The correct action is to fix the\n")
cat("   convergence issues BEFORE interpreting any results.\n\n")

cat("Rule: If R-hat > 1.01 or ESS < 400, do NOT interpret results.\n")
cat("Fix the model first.\n")

Show Python solution

# Chapter 13, Exercise 3: Interpreting MCMC Diagnostics
# Conceptual exercise about convergence problems

# Given diagnostics for a Bayesian logistic regression treatment effect:
#   - R-hat: 1.08
#   - Bulk ESS: 150
#   - Tail ESS: 85
#   - Trace plot: two chains exploring different regions

# ---- Part (a): Is there evidence of convergence problems? ----
print("=== Part (a): Evidence of Convergence Problems ===\n")

print("YES, there is clear evidence of convergence problems.")
print("ALL THREE diagnostics are concerning:\n")

print("1. R-hat = 1.08 (threshold: < 1.01)")
print("   - R-hat compares between-chain to within-chain variance.")
print("   - A value of 1.08 far exceeds the 1.01 threshold.")
print("   - This means chains are NOT sampling from the same distribution.")
print("   - The trace plot confirms this: chains explore different regions.\n")

print("2. Bulk ESS = 150 (recommended: > 400 per chain, i.e., > 1600 total)")
print("   - Bulk ESS estimates the effective number of independent samples")
print("     for estimating posterior means and medians.")
print("   - 150 is far too low for reliable posterior summaries.")
print("   - High autocorrelation or poor mixing causes low ESS.\n")

print("3. Tail ESS = 85 (recommended: > 400)")
print("   - Tail ESS measures effective samples for tail quantiles")
print("     (2.5th and 97.5th percentiles for credible intervals).")
print("   - 85 is dangerously low -- credible intervals will be unreliable.")
print("   - Tail ESS is almost always lower than bulk ESS.\n")

print("4. Trace plot: chains exploring different regions")
print("   - Well-behaved chains should overlap ('hairy caterpillar').")
print("   - Chains in different regions indicate multimodality or")
print("     insufficient sampling.\n")

# ---- Part (b): Steps to fix ----
print("=== Part (b): Steps to Fix These Problems ===\n")

print("1. INCREASE WARMUP AND ITERATIONS:")
print("   Try warmup=2000-4000, draws=4000-8000.\n")

print("2. INCREASE target_accept (PyMC) / adapt_delta (Stan):")
print("   Set target_accept=0.95 or 0.99 to reduce divergences.\n")

print("3. CHECK MODEL SPECIFICATION:")
print("   - Are priors too vague or conflicting with the likelihood?")
print("   - Run prior predictive checks.")
print("   - Consider stronger priors if justified.\n")

print("4. REPARAMETERISE THE MODEL:")
print("   - Use non-centred parameterisation for hierarchical models.")
print("   - Standardise predictors.\n")

print("5. SIMPLIFY THE MODEL:")
print("   - Remove weak predictors or reduce random effects.\n")

print("6. RUN MORE CHAINS:")
print("   - 6-8 chains with different starting values to diagnose multimodality.\n")

# ---- Part (c): Trust the posterior summaries? ----
print("=== Part (c): Can We Trust the Posterior Summaries? ===\n")

print("NO, absolutely not. The posterior summaries should NOT be trusted.\n")

print("Reasons:")
print("1. With R-hat = 1.08, chains have not converged to the same")
print("   distribution. Any summary averages across different")
print("   distributions -- the result is meaningless.\n")

print("2. With bulk ESS = 150, the posterior mean is estimated from")
print("   only ~150 effective samples. Monte Carlo error is too large.\n")

print("3. With tail ESS = 85, the 95% credible interval is based on")
print("   ~85 effective tail samples. Bounds could shift substantially.\n")

print("4. Reporting results from a non-converged model is a serious")
print("   methodological error.\n")

print("Rule: If R-hat > 1.01 or ESS < 400, do NOT interpret results.")
print("Fix the model first.")

D.10 Chapter 14: Applied Bayesian Modelling

D.10.1 Exercise 1

Show R solution

# Chapter 14, Exercise 1: Bayesian Logistic Regression for ICU Mortality
# 500 ICU admissions with age, APACHE II, ventilation status, 28-day mortality

library(brms)
library(tidyverse)
library(bayesplot)

# ---- (a) Simulate dataset ----
set.seed(101)
n <- 500

icu_data <- tibble(
  age = round(rnorm(n, 62, 15)),
  apache = round(rnorm(n, 18, 7)),
  ventilated = rbinom(n, 1, 0.35)
)

# True model: mortality increases with age, APACHE, and ventilation
lp <- -4.5 + 0.02 * icu_data$age +
  0.12 * icu_data$apache +
  0.8 * icu_data$ventilated

icu_data$mortality <- rbinom(n, 1, plogis(lp))

cat("Dataset summary:\n")
cat("  N:", n, "\n")
cat("  Mortality rate:", mean(icu_data$mortality), "\n")
cat("  Mean age:", round(mean(icu_data$age), 1), "\n")
cat("  Mean APACHE:", round(mean(icu_data$apache), 1), "\n")
cat("  % Ventilated:", round(mean(icu_data$ventilated) * 100, 1), "%\n")

# ---- (b) Fit Bayesian logistic regression ----
fit_icu <- brm(
  mortality ~ age + apache + ventilated,
  data = icu_data,
  family = bernoulli(link = "logit"),
  prior = c(
    prior(normal(0, 2.5), class = "b"),       # weakly informative on log-odds
    prior(normal(0, 5), class = "Intercept")
  ),
  chains = 4,
  iter = 2000,
  warmup = 1000,
  seed = 42,
  silent = 2,
  refresh = 0
)

cat("\nModel summary:\n")
summary(fit_icu)

# ---- (c) Prior predictive check ----
cat("\n=== Part (c): Prior Predictive Check ===\n")

# Fit with priors only (no data influence)
fit_prior <- brm(
  mortality ~ age + apache + ventilated,
  data = icu_data,
  family = bernoulli(link = "logit"),
  prior = c(
    prior(normal(0, 2.5), class = "b"),
    prior(normal(0, 5), class = "Intercept")
  ),
  sample_prior = "only",
  chains = 4,
  iter = 2000,
  warmup = 1000,
  seed = 42,
  silent = 2,
  refresh = 0
)

# Simulate from prior predictive
pp_prior <- posterior_predict(fit_prior)
prior_mort_rates <- rowMeans(pp_prior)

cat("Prior predictive mortality rates:\n")
cat("  Mean:", round(mean(prior_mort_rates), 3), "\n")
cat("  SD:", round(sd(prior_mort_rates), 3), "\n")
cat("  Range:", round(min(prior_mort_rates), 3), "to",
    round(max(prior_mort_rates), 3), "\n")
cat("The priors allow mortality rates from near 0 to near 1,\n")
cat("which covers all plausible ICU mortality rates. The priors\n")
cat("are appropriately weakly informative.\n")

# ---- (d) Posterior odds ratios with 95% credible intervals ----
cat("\n=== Part (d): Posterior Odds Ratios ===\n")

post <- as_draws_df(fit_icu)

or_age <- exp(post$b_age)
or_apache <- exp(post$b_apache)
or_vent <- exp(post$b_ventilated)

cat(sprintf("OR Age:        %.3f [%.3f, %.3f]\n",
            mean(or_age), quantile(or_age, 0.025), quantile(or_age, 0.975)))
cat(sprintf("OR APACHE:     %.3f [%.3f, %.3f]\n",
            mean(or_apache), quantile(or_apache, 0.025), quantile(or_apache, 0.975)))
cat(sprintf("OR Ventilated: %.3f [%.3f, %.3f]\n",
            mean(or_vent), quantile(or_vent, 0.025), quantile(or_vent, 0.975)))

# Plot posterior OR distributions
mcmc_areas(fit_icu, pars = c("b_age", "b_apache", "b_ventilated"),
           prob = 0.95, prob_outer = 0.99,
           transformations = exp) +
  geom_vline(xintercept = 1, linetype = "dashed", colour = "grey40") +
  labs(title = "Posterior Odds Ratios (95% CrI)",
       x = "Odds Ratio") +
  theme_minimal(base_size = 13)

# ---- (e) P(OR_APACHE > 1.10 | data) ----
cat("\n=== Part (e): P(OR_APACHE > 1.10 | data) ===\n")

prob_or_gt_110 <- mean(or_apache > 1.10)
cat(sprintf("P(OR_APACHE > 1.10 | data) = %.3f\n", prob_or_gt_110))
cat("\nInterpretation: There is a", round(prob_or_gt_110 * 100, 1),
    "% posterior probability\n")
cat("that each unit increase in APACHE score increases the odds of\n")
cat("28-day mortality by more than 10%. This is a direct probability\n")
cat("statement about the parameter -- something only Bayesian inference\n")
cat("can provide.\n")

Show Python solution

# Chapter 14, Exercise 1: Bayesian Logistic Regression for ICU Mortality
# 500 ICU admissions with age, APACHE II, ventilation status, 28-day mortality

import numpy as np
import pandas as pd
import pymc as pm
import arviz as az
from scipy.special import expit

# ---- (a) Simulate dataset ----
np.random.seed(101)
n = 500

icu_data = pd.DataFrame({
    'age': np.round(np.random.normal(62, 15, n)),
    'apache': np.round(np.random.normal(18, 7, n)),
    'ventilated': np.random.binomial(1, 0.35, n)
})

# True model
lp = (-4.5 + 0.02 * icu_data['age'] +
      0.12 * icu_data['apache'] +
      0.8 * icu_data['ventilated'])
icu_data['mortality'] = np.random.binomial(1, expit(lp))

print("Dataset summary:")
print(f"  N: {n}")
print(f"  Mortality rate: {icu_data['mortality'].mean():.3f}")
print(f"  Mean age: {icu_data['age'].mean():.1f}")
print(f"  Mean APACHE: {icu_data['apache'].mean():.1f}")
print(f"  % Ventilated: {icu_data['ventilated'].mean()*100:.1f}%")

# ---- (b) Fit Bayesian logistic regression ----
with pm.Model() as icu_model:
    # Weakly informative priors
    intercept = pm.Normal("Intercept", mu=0, sigma=5)
    b_age = pm.Normal("b_age", mu=0, sigma=2.5)
    b_apache = pm.Normal("b_apache", mu=0, sigma=2.5)
    b_vent = pm.Normal("b_ventilated", mu=0, sigma=2.5)

    # Linear predictor
    logit_p = (intercept +
               b_age * icu_data['age'].values +
               b_apache * icu_data['apache'].values +
               b_vent * icu_data['ventilated'].values)

    # Likelihood
    y_obs = pm.Bernoulli("mortality", logit_p=logit_p,
                          observed=icu_data['mortality'].values)

    # Sample
    trace = pm.sample(1000, tune=1000, chains=4, random_seed=42,
                       progressbar=True)

print("\nModel summary:")
print(az.summary(trace, var_names=["Intercept", "b_age", "b_apache",
                                     "b_ventilated"]))

# ---- (c) Prior predictive check ----
print("\n=== Part (c): Prior Predictive Check ===")

# Simulate mortality rates from the priors
np.random.seed(42)
n_sim = 2000
intercepts = np.random.normal(0, 5, n_sim)
b_ages = np.random.normal(0, 2.5, n_sim)
b_apaches = np.random.normal(0, 2.5, n_sim)
b_vents = np.random.normal(0, 2.5, n_sim)

# For a "typical" patient: age=62, apache=18, ventilated=0
sim_lp = intercepts + b_ages * 62 + b_apaches * 18 + b_vents * 0
sim_mort = expit(sim_lp)

print(f"Prior predictive mortality rates (typical patient):")
print(f"  Mean: {sim_mort.mean():.3f}")
print(f"  SD: {sim_mort.std():.3f}")
print(f"  Range: {sim_mort.min():.3f} to {sim_mort.max():.3f}")
print("The priors allow mortality rates from near 0 to near 1,")
print("covering all plausible ICU mortality rates. The priors are")
print("appropriately weakly informative.")

# ---- (d) Posterior odds ratios with 95% credible intervals ----
print("\n=== Part (d): Posterior Odds Ratios ===")

posterior = az.extract(trace)

for var_name, label in [("b_age", "Age"), ("b_apache", "APACHE"),
                          ("b_ventilated", "Ventilated")]:
    or_vals = np.exp(posterior[var_name].values)
    mean_or = or_vals.mean()
    ci_low = np.percentile(or_vals, 2.5)
    ci_high = np.percentile(or_vals, 97.5)
    print(f"OR {label:>12s}: {mean_or:.3f} [{ci_low:.3f}, {ci_high:.3f}]")

# ---- (e) P(OR_APACHE > 1.10 | data) ----
print("\n=== Part (e): P(OR_APACHE > 1.10 | data) ===")

or_apache = np.exp(posterior["b_apache"].values)
prob_gt_110 = (or_apache > 1.10).mean()

print(f"P(OR_APACHE > 1.10 | data) = {prob_gt_110:.3f}")
print(f"\nInterpretation: There is a {prob_gt_110*100:.1f}% posterior probability")
print("that each unit increase in APACHE score increases the odds of")
print("28-day mortality by more than 10%. This is a direct probability")
print("statement about the parameter -- something only Bayesian inference")
print("can provide.")

D.10.2 Exercise 2

Show R solution

# Chapter 14, Exercise 2: Hierarchical Model for Multi-Site Drug Trial
# 12 hospitals, LDL cholesterol change, new statin vs standard care

library(brms)
library(tidyverse)

# ---- (a) Simulate data ----
set.seed(789)
n_hospitals <- 12
patients_per_hospital <- c(20, 25, 30, 40, 50, 60, 70, 80, 100, 120, 150, 200)
true_grand_effect <- -25  # mg/dL reduction in LDL
tau_true <- 5             # between-hospital SD

hospital_effects <- rnorm(n_hospitals, 0, tau_true)

trial_data <- map2_dfr(1:n_hospitals, patients_per_hospital, function(j, nj) {
  tibble(
    hospital = factor(j),
    treatment = rep(0:1, length.out = nj),
    ldl_change = (true_grand_effect + hospital_effects[j]) * treatment +
      rnorm(nj, 0, 20)  # residual SD = 20 mg/dL
  )
})

cat("Data summary:\n")
cat("  Total patients:", nrow(trial_data), "\n")
cat("  Hospitals:", n_hospitals, "\n")
cat("  Patients per hospital:", paste(patients_per_hospital, collapse = ", "), "\n")
cat("  True grand effect:", true_grand_effect, "mg/dL\n")
cat("  True between-hospital SD:", tau_true, "mg/dL\n")

# ---- (b) Fit Bayesian hierarchical model ----
fit_hier <- brm(
  ldl_change ~ treatment + (treatment | hospital),
  data = trial_data,
  prior = c(
    prior(normal(0, 30), class = "b"),
    prior(normal(0, 30), class = "Intercept"),
    prior(cauchy(0, 5), class = "sd"),
    prior(exponential(0.05), class = "sigma")
  ),
  chains = 4,
  iter = 2000,
  warmup = 1000,
  seed = 42,
  silent = 2,
  refresh = 0,
  control = list(adapt_delta = 0.95)
)

cat("\nHierarchical model summary:\n")
summary(fit_hier)

# Grand mean treatment effect
grand_mean <- fixef(fit_hier)["treatment", "Estimate"]
cat("\nGrand mean treatment effect:", round(grand_mean, 1), "mg/dL\n")

# ---- (c) Shrinkage plot ----
# Extract hospital-specific treatment effects (partial pooling)
ranefs <- ranef(fit_hier)$hospital[, , "treatment"]
partial_pool <- grand_mean + ranefs[, "Estimate"]

# No-pooling estimates (separate OLS per hospital)
no_pool <- trial_data %>%
  group_by(hospital) %>%
  summarise(
    no_pool_est = coef(lm(ldl_change ~ treatment))[2],
    n = n(),
    .groups = "drop"
  ) %>%
  mutate(
    hospital_num = as.numeric(hospital),
    partial_pool_est = partial_pool
  )

# Plot shrinkage
ggplot(no_pool, aes(y = reorder(hospital, n))) +
  geom_point(aes(x = no_pool_est, colour = "No pooling"), size = 3) +
  geom_point(aes(x = partial_pool_est, colour = "Partial pooling"), size = 3) +
  geom_vline(xintercept = grand_mean, linetype = "dashed", colour = "grey40") +
  geom_segment(aes(x = no_pool_est, xend = partial_pool_est,
                   yend = reorder(hospital, n)),
               arrow = arrow(length = unit(0.15, "cm")),
               colour = "grey60") +
  annotate("text", x = grand_mean + 1, y = 12.5,
           label = paste0("Grand mean = ", round(grand_mean, 1)),
           hjust = 0, size = 3.5) +
  scale_colour_manual(values = c("No pooling" = "steelblue",
                                  "Partial pooling" = "firebrick")) +
  labs(x = "Treatment Effect (mg/dL change in LDL)",
       y = "Hospital (ordered by sample size)",
       colour = "Estimate Type",
       title = "Shrinkage in Hierarchical Model",
       subtitle = "Arrows show partial pooling pulling estimates toward the grand mean") +
  theme_minimal(base_size = 13) +
  theme(legend.position = "bottom")

# ---- (d) Compare hierarchical to separate OLS regressions ----
cat("\n=== Part (d): Comparison ===\n\n")
cat("Hospital  |  N   | No-Pooling | Partial Pooling | Shrinkage\n")
cat("----------|------|------------|-----------------|----------\n")

for (i in 1:nrow(no_pool)) {
  shrinkage <- abs(no_pool$no_pool_est[i] - no_pool$partial_pool_est[i])
  cat(sprintf("   %2d     | %3d  |   %6.1f   |     %6.1f      |  %5.1f\n",
              no_pool$hospital_num[i], no_pool$n[i],
              no_pool$no_pool_est[i], no_pool$partial_pool_est[i], shrinkage))
}

cat("\nThe hierarchical model is MORE APPROPRIATE because:\n")
cat("1. Small hospitals (n=20-30) have noisy OLS estimates that are\n")
cat("   shrunk toward the grand mean, reducing estimation error.\n")
cat("2. Large hospitals (n=150-200) retain their individual estimates\n")
cat("   since their data are informative enough.\n")
cat("3. The between-hospital SD (tau) is estimated from the data,\n")
cat("   quantifying the degree of heterogeneity across sites.\n")
cat("4. Hospital-by-hospital OLS ignores the shared structure --\n")
cat("   all hospitals are studying the same drug. The hierarchical\n")
cat("   model borrows strength across sites while allowing for\n")
cat("   genuine between-site variation.\n")

Show Python solution

# Chapter 14, Exercise 2: Hierarchical Model for Multi-Site Drug Trial
# 12 hospitals, LDL cholesterol change, new statin vs standard care

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# ---- (a) Simulate data ----
np.random.seed(789)
n_hospitals = 12
patients = [20, 25, 30, 40, 50, 60, 70, 80, 100, 120, 150, 200]
grand_effect = -25  # mg/dL
tau = 5             # between-hospital SD

hospital_effects = np.random.normal(0, tau, n_hospitals)

rows = []
for j in range(n_hospitals):
    nj = patients[j]
    trt = np.tile([0, 1], nj // 2 + 1)[:nj]
    ldl_change = ((grand_effect + hospital_effects[j]) * trt +
                  np.random.normal(0, 20, nj))
    for i in range(nj):
        rows.append({
            'hospital': j,
            'treatment': trt[i],
            'ldl_change': ldl_change[i]
        })

trial_data = pd.DataFrame(rows)

print("Data summary:")
print(f"  Total patients: {len(trial_data)}")
print(f"  Hospitals: {n_hospitals}")
print(f"  True grand effect: {grand_effect} mg/dL")
print(f"  True between-hospital SD: {tau} mg/dL")

# ---- No-pooling estimates (OLS per hospital) ----
no_pool = []
for j in range(n_hospitals):
    df_j = trial_data[trial_data['hospital'] == j]
    trt_mean = df_j[df_j['treatment'] == 1]['ldl_change'].mean()
    ctl_mean = df_j[df_j['treatment'] == 0]['ldl_change'].mean()
    no_pool.append(trt_mean - ctl_mean)

no_pool = np.array(no_pool)

# ---- (b) & (c) Partial pooling approximation ----
# For a full Bayesian fit, use PyMC. Here we approximate shrinkage
# to demonstrate the concept without requiring MCMC sampling.

# Estimate within-hospital variance from data
within_var = []
for j in range(n_hospitals):
    df_j = trial_data[trial_data['hospital'] == j]
    trt_vals = df_j[df_j['treatment'] == 1]['ldl_change'].values
    ctl_vals = df_j[df_j['treatment'] == 0]['ldl_change'].values
    # Variance of the treatment effect estimate
    se_j = np.sqrt(np.var(trt_vals, ddof=1) / len(trt_vals) +
                   np.var(ctl_vals, ddof=1) / len(ctl_vals))
    within_var.append(se_j**2)

within_var = np.array(within_var)

# Estimate between-hospital variance using method of moments
grand_mean_est = np.mean(no_pool)
tau_est_sq = max(0, np.var(no_pool, ddof=1) - np.mean(within_var))
tau_est = np.sqrt(tau_est_sq)

# Shrinkage factor: B_j = within_var_j / (within_var_j + tau^2)
shrinkage = within_var / (within_var + tau_est_sq)
partial_pool = grand_mean_est + (1 - shrinkage) * (no_pool - grand_mean_est)

print(f"\nEstimated grand mean effect: {grand_mean_est:.1f} mg/dL")
print(f"Estimated between-hospital SD: {tau_est:.1f} mg/dL")

# ---- (c) Shrinkage plot ----
fig, ax = plt.subplots(figsize=(10, 6))
order = np.argsort(patients)
y_pos = np.arange(n_hospitals)

for i, idx in enumerate(order):
    ax.annotate("", xy=(partial_pool[idx], i), xytext=(no_pool[idx], i),
                arrowprops=dict(arrowstyle="->", color="grey"))

ax.scatter(no_pool[order], y_pos, color="steelblue", s=60, zorder=2,
           label="No pooling (OLS)")
ax.scatter(partial_pool[order], y_pos, color="firebrick", s=60, zorder=2,
           label="Partial pooling")
ax.axvline(grand_mean_est, linestyle="--", color="grey", linewidth=1)
ax.text(grand_mean_est + 0.5, n_hospitals - 0.5,
        f"Grand mean = {grand_mean_est:.1f}", fontsize=9)

ax.set_yticks(y_pos)
ax.set_yticklabels([f"Hospital {order[i]+1}\n(n={patients[order[i]]})"
                     for i in range(n_hospitals)], fontsize=9)
ax.set_xlabel("Treatment Effect (mg/dL change in LDL)")
ax.set_title("Shrinkage in Hierarchical Model\n"
             "Arrows show estimates pulled toward the grand mean")
ax.legend()
plt.tight_layout()
plt.savefig("ch14_ex2_shrinkage.png", dpi=150)
plt.show()

# ---- (d) Compare approaches ----
print("\n=== Part (d): Comparison ===\n")
print(f"{'Hospital':>8} | {'N':>4} | {'No-Pool':>10} | {'Partial Pool':>12} | {'Shrinkage':>9}")
print("-" * 55)
for j in range(n_hospitals):
    shrink_amt = abs(no_pool[j] - partial_pool[j])
    print(f"   {j+1:>2}     | {patients[j]:>3}  | {no_pool[j]:>9.1f}  | {partial_pool[j]:>11.1f}  | {shrink_amt:>8.1f}")

print("\nThe hierarchical model is MORE APPROPRIATE because:")
print("1. Small hospitals have noisy OLS estimates that are shrunk")
print("   toward the grand mean, reducing estimation error.")
print("2. Large hospitals retain their individual estimates since")
print("   their data are informative enough.")
print("3. Between-hospital SD (tau) is estimated from data,")
print("   quantifying heterogeneity across sites.")
print("4. Hospital-by-hospital OLS ignores shared structure --")
print("   all hospitals study the same drug. The hierarchical model")
print("   borrows strength across sites while allowing genuine")
print("   between-site variation.")

# NOTE: For a full Bayesian hierarchical model, use PyMC:
#
# import pymc as pm
# with pm.Model() as hier_model:
#     mu_trt = pm.Normal("mu_trt", mu=0, sigma=30)
#     tau = pm.HalfCauchy("tau", beta=5)
#     trt_j = pm.Normal("trt_j", mu=mu_trt, sigma=tau, shape=n_hospitals)
#     sigma = pm.Exponential("sigma", lam=0.05)
#     mu = trt_j[hospital_idx] * treatment
#     y = pm.Normal("y", mu=mu, sigma=sigma, observed=ldl_change)
#     trace = pm.sample(1000, tune=1000, chains=4, random_seed=42)

D.10.3 Exercise 3

Show R solution

# Chapter 14, Exercise 3: Prior Sensitivity for Rare Events
# New surgical technique: 0 adverse events in 40 patients

# ---- (a) Compute posterior distributions under three priors ----
cat("=== Part (a): Posteriors Under Three Priors ===\n\n")

y <- 0   # adverse events
n <- 40  # total patients

priors <- list(
  list(name = "Beta(1,1) - Uniform", a = 1, b = 1),
  list(name = "Beta(0.5,0.5) - Jeffreys", a = 0.5, b = 0.5),
  list(name = "Beta(1,9) - Informative (~10%)", a = 1, b = 9)
)

theta <- seq(0, 0.3, length.out = 500)

par(mfrow = c(1, 1))
plot(NULL, xlim = c(0, 0.2), ylim = c(0, 50),
     xlab = "Adverse Event Rate", ylab = "Density",
     main = "Posterior Distributions: 0/40 Adverse Events")

colors <- c("steelblue", "firebrick", "forestgreen")

for (i in seq_along(priors)) {
  p <- priors[[i]]
  a_post <- p$a + y
  b_post <- p$b + n - y

  post_mean <- a_post / (a_post + b_post)
  ci <- qbeta(c(0.025, 0.975), a_post, b_post)
  upper_95 <- qbeta(0.95, a_post, b_post)

  cat(sprintf("Prior: %s\n", p$name))
  cat(sprintf("  Posterior: Beta(%.1f, %.1f)\n", a_post, b_post))
  cat(sprintf("  Posterior mean: %.4f (%.2f%%)\n", post_mean, post_mean * 100))
  cat(sprintf("  95%% credible interval: [%.4f, %.4f]\n", ci[1], ci[2]))
  cat(sprintf("  95%% upper credible bound: %.4f (%.2f%%)\n\n",
              upper_95, upper_95 * 100))

  lines(theta, dbeta(theta, a_post, b_post), col = colors[i], lwd = 2)
}

legend("topright", sapply(priors, function(p) p$name),
       col = colors, lwd = 2, cex = 0.8)

# ---- (b) Summary table ----
cat("\n=== Part (b): Summary Table ===\n\n")
cat("Prior              | Post Mean | 95% Upper Bound\n")
cat("-------------------|-----------|----------------\n")

for (p in priors) {
  a_post <- p$a + y
  b_post <- p$b + n - y
  post_mean <- a_post / (a_post + b_post)
  upper_95 <- qbeta(0.95, a_post, b_post)
  cat(sprintf("%-19s| %7.4f   | %7.4f (%.1f%%)\n",
              p$name, post_mean, upper_95, upper_95 * 100))
}

# ---- (c) Why Bayesian estimates are more useful ----
cat("\n=== Part (c): Why Bayesian Estimates Are More Useful ===\n\n")

cat("The frequentist point estimate is 0/40 = 0 (0%).\n\n")

cat("This is PROBLEMATIC for regulatory decision-making because:\n\n")

cat("1. ZERO IS NOT CREDIBLE: Just because no adverse events were\n")
cat("   observed in 40 patients does not mean the true rate is zero.\n")
cat("   The 'rule of 3' (frequentist) gives an upper bound of 3/40 = 7.5%,\n")
cat("   but this is ad hoc and does not provide a full distribution.\n\n")

cat("2. BAYESIAN ESTIMATES ARE HONEST: Each prior gives a non-zero\n")
cat("   estimate of the adverse event rate, which is more realistic.\n")
cat("   Even with the Jeffreys prior (minimal information), the\n")
cat("   posterior mean is about 1.2%, acknowledging that rare events\n")
cat("   can occur even if none were observed.\n\n")

cat("3. UPPER CREDIBLE BOUNDS: Regulators need worst-case estimates.\n")
cat("   The 95% upper credible bound provides a direct probability\n")
cat("   statement: 'There is a 95% probability that the true adverse\n")
cat("   event rate is below X%'. This is exactly what is needed for\n")
cat("   risk-benefit assessments.\n\n")

cat("4. PRIOR INFORMATION IS VALUABLE: If similar surgical procedures\n")
cat("   have known complication rates (~10%), the informative prior\n")
cat("   Beta(1,9) incorporates this, giving a more realistic estimate.\n")
cat("   The frequentist approach ignores all prior knowledge.\n\n")

cat("5. DECISION SUPPORT: The full posterior distribution allows\n")
cat("   calculation of quantities like P(rate < 5% | data), which\n")
cat("   directly supports regulatory decisions.\n")

# Compute P(rate < 5% | data) for each prior
cat("\n  P(rate < 5% | data):\n")
for (p in priors) {
  a_post <- p$a + y
  b_post <- p$b + n - y
  prob_lt_5 <- pbeta(0.05, a_post, b_post)
  cat(sprintf("    %s: %.3f\n", p$name, prob_lt_5))
}

Show Python solution

# Chapter 14, Exercise 3: Prior Sensitivity for Rare Events
# New surgical technique: 0 adverse events in 40 patients

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import beta

y = 0   # adverse events
n = 40  # total patients

priors = [
    ("Beta(1,1) - Uniform", 1, 1),
    ("Beta(0.5,0.5) - Jeffreys", 0.5, 0.5),
    ("Beta(1,9) - Informative (~10%)", 1, 9),
]

theta = np.linspace(0, 0.25, 500)
colors = ["steelblue", "firebrick", "forestgreen"]

# ---- (a) Compute posterior distributions ----
print("=== Part (a): Posteriors Under Three Priors ===\n")

fig, ax = plt.subplots(figsize=(8, 5))

for (name, a0, b0), color in zip(priors, colors):
    a_post = a0 + y
    b_post = b0 + n - y
    post_mean = a_post / (a_post + b_post)
    ci = beta.ppf([0.025, 0.975], a_post, b_post)
    upper_95 = beta.ppf(0.95, a_post, b_post)

    print(f"Prior: {name}")
    print(f"  Posterior: Beta({a_post}, {b_post})")
    print(f"  Posterior mean: {post_mean:.4f} ({post_mean*100:.2f}%)")
    print(f"  95% credible interval: [{ci[0]:.4f}, {ci[1]:.4f}]")
    print(f"  95% upper credible bound: {upper_95:.4f} ({upper_95*100:.2f}%)")
    print()

    ax.plot(theta, beta.pdf(theta, a_post, b_post),
            color=color, lw=2, label=name)

ax.set_xlabel("Adverse Event Rate")
ax.set_ylabel("Density")
ax.set_title("Posterior Distributions: 0/40 Adverse Events")
ax.legend(fontsize=9)
ax.set_xlim(0, 0.2)
plt.tight_layout()
plt.savefig("ch14_ex3_rare_events.png", dpi=150)
plt.show()

# ---- (b) Summary table ----
print("=== Part (b): Summary Table ===\n")
print(f"{'Prior':<30s} | {'Post Mean':>9s} | {'95% Upper Bound':>15s}")
print("-" * 60)

for name, a0, b0 in priors:
    a_post = a0 + y
    b_post = b0 + n - y
    post_mean = a_post / (a_post + b_post)
    upper_95 = beta.ppf(0.95, a_post, b_post)
    print(f"{name:<30s} | {post_mean:>9.4f} | {upper_95:>9.4f} ({upper_95*100:.1f}%)")

# ---- (c) Why Bayesian estimates are more useful ----
print("\n=== Part (c): Why Bayesian Estimates Are More Useful ===\n")

print("The frequentist point estimate is 0/40 = 0 (0%).\n")

print("This is PROBLEMATIC for regulatory decision-making because:\n")

print("1. ZERO IS NOT CREDIBLE: Just because no adverse events were")
print("   observed in 40 patients does not mean the true rate is zero.")
print("   The 'rule of 3' gives an upper bound of 3/40 = 7.5%, but")
print("   this is ad hoc and does not provide a full distribution.\n")

print("2. BAYESIAN ESTIMATES ARE HONEST: Each prior gives a non-zero")
print("   estimate, which is more realistic. Even the Jeffreys prior")
print("   yields ~1.2%, acknowledging rare events can occur.\n")

print("3. UPPER CREDIBLE BOUNDS: Regulators need worst-case estimates.")
print("   The 95% upper bound provides a direct probability statement:")
print("   'There is 95% probability the true rate is below X%'.\n")

print("4. PRIOR INFORMATION IS VALUABLE: Known complication rates from")
print("   similar procedures can be formally incorporated. The frequentist")
print("   approach ignores all prior knowledge.\n")

print("5. DECISION SUPPORT: The full posterior enables quantities like")
print("   P(rate < 5% | data), directly supporting regulatory decisions.\n")

print("  P(rate < 5% | data):")
for name, a0, b0 in priors:
    a_post = a0 + y
    b_post = b0 + n - y
    prob_lt_5 = beta.cdf(0.05, a_post, b_post)
    print(f"    {name}: {prob_lt_5:.3f}")

D.11 Chapter 15: Dimensionality Reduction

D.11.1 Exercise 1

Show R solution

# Chapter 15, Exercise 1: PCA on Clinical Lab Data
# Using the simulated metabolic panel data from the chapter

library(tidyverse)
library(MASS)

# ---- Simulate metabolic data (same as chapter) ----
set.seed(42)
n <- 300

Sigma <- matrix(c(
  1.0, 0.7, 0.5, 0.3,-0.2, 0.1, 0.2,-0.1,
  0.7, 1.0, 0.4, 0.2,-0.2, 0.1, 0.1,-0.1,
  0.5, 0.4, 1.0, 0.4,-0.3, 0.1, 0.3,-0.1,
  0.3, 0.2, 0.4, 1.0,-0.5, 0.1, 0.2, 0.0,
 -0.2,-0.2,-0.3,-0.5, 1.0,-0.1,-0.1, 0.2,
  0.1, 0.1, 0.1, 0.1,-0.1, 1.0, 0.1,-0.3,
  0.2, 0.1, 0.3, 0.2,-0.1, 0.1, 1.0,-0.2,
 -0.1,-0.1,-0.1, 0.0, 0.2,-0.3,-0.2, 1.0
), nrow = 8)

z <- mvrnorm(n, mu = rep(0, 8), Sigma = Sigma)
metabolic <- tibble(
  glucose       = round(z[,1] * 30 + 100),
  hba1c         = round(z[,2] * 1.0 + 5.8, 1),
  triglycerides = round(z[,3] * 50 + 150),
  ldl           = round(z[,4] * 30 + 120),
  hdl           = round(z[,5] * 12 + 55),
  creatinine    = round(z[,6] * 0.3 + 1.0, 2),
  alt           = round(z[,7] * 15 + 30),
  albumin       = round(z[,8] * 0.4 + 4.0, 1)
)

# ---- (a) Perform PCA on standardised data ----
cat("=== Part (a): PCA on Standardised Data ===\n")
pca_fit <- prcomp(metabolic, scale. = TRUE)
cat("PCA completed. Summary:\n")
print(summary(pca_fit))

# ---- (b) Scree plot and number of components ----
cat("\n=== Part (b): Scree Plot and Component Selection ===\n")

var_prop <- pca_fit$sdev^2 / sum(pca_fit$sdev^2)
cum_var  <- cumsum(var_prop)

par(mfrow = c(1, 2), mar = c(4, 4, 3, 1))

# Scree plot
barplot(var_prop, names.arg = paste0("PC", 1:8), col = "steelblue",
        main = "Scree Plot", ylab = "Proportion of Variance",
        xlab = "Component", ylim = c(0, 0.4))
abline(h = 1/8, lty = 2, col = "firebrick")
text(9, 1/8 + 0.015, "Kaiser threshold (1/p)", col = "firebrick", cex = 0.8)

# Cumulative variance
plot(1:8, cum_var, type = "b", pch = 16, col = "steelblue",
     main = "Cumulative Variance", xlab = "Number of Components",
     ylab = "Cumulative Proportion", ylim = c(0, 1))
abline(h = 0.80, col = "firebrick", lty = 2)
text(6, 0.82, "80% threshold", col = "firebrick", cex = 0.8)

# Kaiser criterion: eigenvalues > 1 (which equals variance > 1/p for scaled data)
eigenvalues <- pca_fit$sdev^2
n_kaiser <- sum(eigenvalues > 1)
n_80 <- which(cum_var >= 0.80)[1]

cat("\nKaiser criterion (eigenvalue > 1):", n_kaiser, "components\n")
cat("80% cumulative variance threshold:", n_80, "components\n")
cat("Eigenvalues:", round(eigenvalues, 3), "\n")
cat("Cumulative variance:", round(cum_var, 3), "\n")

# ---- (c) Loadings of first two PCs ----
cat("\n=== Part (c): Loadings and Interpretation ===\n")

cat("\nPC1 loadings (sorted by absolute value):\n")
pc1_loadings <- pca_fit$rotation[, 1]
print(round(sort(abs(pc1_loadings), decreasing = TRUE), 3))
cat("Signed loadings:\n")
print(round(pc1_loadings[order(abs(pc1_loadings), decreasing = TRUE)], 3))

cat("\nPC2 loadings (sorted by absolute value):\n")
pc2_loadings <- pca_fit$rotation[, 2]
print(round(sort(abs(pc2_loadings), decreasing = TRUE), 3))
cat("Signed loadings:\n")
print(round(pc2_loadings[order(abs(pc2_loadings), decreasing = TRUE)], 3))

cat("\nClinical interpretation:\n")
cat("PC1: Dominated by glucose, HbA1c, triglycerides (positive) and\n")
cat("     HDL (negative). This is a 'metabolic syndrome' axis.\n")
cat("     Patients scoring high on PC1 tend to have elevated glucose,\n")
cat("     HbA1c, triglycerides and lower HDL.\n\n")
cat("PC2: Dominated by creatinine and albumin (with opposite signs).\n")
cat("     This captures a 'renal/hepatic function' dimension.\n")
cat("     Patients with high creatinine and low albumin (suggesting\n")
cat("     renal impairment or liver dysfunction) score high on PC2.\n")

# ---- (d) Biplot coloured by diabetes status ----
cat("\n=== Part (d): Biplot with Diabetes Status ===\n")

set.seed(99)
metabolic$diabetes <- factor(
  ifelse(metabolic$glucose > 110 & metabolic$hba1c > 6.2,
         "Diabetic", "Non-diabetic")
)

cat("Diabetes prevalence:", mean(metabolic$diabetes == "Diabetic"), "\n")

# Extract PC scores
pc_scores <- as.data.frame(pca_fit$x[, 1:2])
pc_scores$diabetes <- metabolic$diabetes

par(mfrow = c(1, 1))

# Biplot with diabetes colouring
cols <- ifelse(metabolic$diabetes == "Diabetic", "firebrick", "steelblue")
plot(pc_scores$PC1, pc_scores$PC2, col = cols, pch = 16, cex = 0.7,
     xlab = paste0("PC1 (", round(var_prop[1] * 100, 1), "% var)"),
     ylab = paste0("PC2 (", round(var_prop[2] * 100, 1), "% var)"),
     main = "PCA Biplot Coloured by Diabetes Status")

# Add loading arrows
loadings <- pca_fit$rotation[, 1:2]
scale_factor <- 3
for (i in 1:nrow(loadings)) {
  arrows(0, 0, loadings[i, 1] * scale_factor, loadings[i, 2] * scale_factor,
         col = "grey30", length = 0.1, lwd = 1.5)
  text(loadings[i, 1] * scale_factor * 1.15,
       loadings[i, 2] * scale_factor * 1.15,
       rownames(loadings)[i], cex = 0.7, col = "grey20")
}

legend("topright", c("Diabetic", "Non-diabetic"),
       col = c("firebrick", "steelblue"), pch = 16, cex = 0.8)

cat("\nDiabetic patients tend to cluster toward higher PC1 values,\n")
cat("which aligns with the 'metabolic syndrome' interpretation.\n")
cat("This confirms that PC1 captures the metabolic dimension\n")
cat("along which diabetic patients differ from non-diabetic patients.\n")

Show Python solution

# Chapter 15, Exercise 1: PCA on Clinical Lab Data
# Using the simulated metabolic panel data from the chapter

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# ---- Simulate metabolic data (same as chapter) ----
np.random.seed(42)
n = 300

Sigma = np.array([
    [1.0, 0.7, 0.5, 0.3,-0.2, 0.1, 0.2,-0.1],
    [0.7, 1.0, 0.4, 0.2,-0.2, 0.1, 0.1,-0.1],
    [0.5, 0.4, 1.0, 0.4,-0.3, 0.1, 0.3,-0.1],
    [0.3, 0.2, 0.4, 1.0,-0.5, 0.1, 0.2, 0.0],
    [-0.2,-0.2,-0.3,-0.5, 1.0,-0.1,-0.1, 0.2],
    [0.1, 0.1, 0.1, 0.1,-0.1, 1.0, 0.1,-0.3],
    [0.2, 0.1, 0.3, 0.2,-0.1, 0.1, 1.0,-0.2],
    [-0.1,-0.1,-0.1, 0.0, 0.2,-0.3,-0.2, 1.0]
])

z = np.random.multivariate_normal(np.zeros(8), Sigma, n)
labels = ["glucose", "hba1c", "triglycerides", "ldl",
          "hdl", "creatinine", "alt", "albumin"]

metabolic = pd.DataFrame({
    "glucose": np.round(z[:, 0] * 30 + 100),
    "hba1c": np.round(z[:, 1] * 1.0 + 5.8, 1),
    "triglycerides": np.round(z[:, 2] * 50 + 150),
    "ldl": np.round(z[:, 3] * 30 + 120),
    "hdl": np.round(z[:, 4] * 12 + 55),
    "creatinine": np.round(z[:, 5] * 0.3 + 1.0, 2),
    "alt": np.round(z[:, 6] * 15 + 30),
    "albumin": np.round(z[:, 7] * 0.4 + 4.0, 1)
})

# ---- (a) PCA on standardised data ----
print("=== Part (a): PCA on Standardised Data ===")
X = metabolic.values
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
pca = PCA()
scores = pca.fit_transform(X_scaled)

print("Explained variance ratio:", np.round(pca.explained_variance_ratio_, 3))

# ---- (b) Scree plot and component selection ----
print("\n=== Part (b): Scree Plot and Component Selection ===")

var_prop = pca.explained_variance_ratio_
cum_var = np.cumsum(var_prop)

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Scree plot
axes[0].bar(range(1, 9), var_prop, color="steelblue", edgecolor="white")
axes[0].axhline(y=1/8, color="firebrick", linestyle="--", label="Kaiser threshold")
axes[0].set_xlabel("Component")
axes[0].set_ylabel("Proportion of Variance")
axes[0].set_title("Scree Plot")
axes[0].set_xticks(range(1, 9))
axes[0].legend()

# Cumulative variance
axes[1].plot(range(1, 9), cum_var, "o-", color="steelblue", lw=2)
axes[1].axhline(y=0.80, color="firebrick", linestyle="--", label="80% threshold")
axes[1].set_xlabel("Number of Components")
axes[1].set_ylabel("Cumulative Proportion")
axes[1].set_title("Cumulative Variance")
axes[1].set_xticks(range(1, 9))
axes[1].legend()

plt.tight_layout()
plt.savefig("ch15_ex1_scree.png", dpi=150)
plt.show()

# Kaiser criterion: eigenvalues > 1
eigenvalues = pca.explained_variance_
n_kaiser = np.sum(eigenvalues > 1)
n_80 = np.argmax(cum_var >= 0.80) + 1

print(f"\nKaiser criterion (eigenvalue > 1): {n_kaiser} components")
print(f"80% cumulative variance threshold: {n_80} components")
print(f"Eigenvalues: {np.round(eigenvalues, 3)}")
print(f"Cumulative variance: {np.round(cum_var, 3)}")

# ---- (c) Loadings of first two PCs ----
print("\n=== Part (c): Loadings and Interpretation ===")

for pc_idx in [0, 1]:
    loadings = pca.components_[pc_idx]
    order = np.argsort(np.abs(loadings))[::-1]
    print(f"\nPC{pc_idx+1} loadings (sorted by |loading|):")
    for idx in order:
        print(f"  {labels[idx]:>15s}: {loadings[idx]:+.3f}")

print("\nClinical interpretation:")
print("PC1: Dominated by glucose, HbA1c, triglycerides (positive) and")
print("     HDL (negative). This is a 'metabolic syndrome' axis.")
print("PC2: Dominated by creatinine and albumin (opposite signs).")
print("     This captures a 'renal/hepatic function' dimension.")

# ---- (d) Biplot coloured by diabetes status ----
print("\n=== Part (d): Biplot with Diabetes Status ===")

diabetes = np.where(
    (metabolic["glucose"] > 110) & (metabolic["hba1c"] > 6.2),
    "Diabetic", "Non-diabetic"
)
print(f"Diabetes prevalence: {(diabetes == 'Diabetic').mean():.3f}")

fig, ax = plt.subplots(figsize=(9, 7))

# Plot scores
for label, color in [("Non-diabetic", "steelblue"), ("Diabetic", "firebrick")]:
    mask = diabetes == label
    ax.scatter(scores[mask, 0], scores[mask, 1], c=color, s=20,
               alpha=0.6, label=label)

# Add loading arrows
loadings_2d = pca.components_[:2].T
scale_factor = 3
for i, lab in enumerate(labels):
    ax.annotate("", xy=(loadings_2d[i, 0]*scale_factor,
                          loadings_2d[i, 1]*scale_factor),
                 xytext=(0, 0),
                 arrowprops=dict(arrowstyle="->", color="grey30", lw=1.5))
    ax.text(loadings_2d[i, 0]*scale_factor*1.15,
             loadings_2d[i, 1]*scale_factor*1.15,
             lab, fontsize=8, color="grey20", ha="center")

ax.set_xlabel(f"PC1 ({var_prop[0]:.1%} var)")
ax.set_ylabel(f"PC2 ({var_prop[1]:.1%} var)")
ax.set_title("PCA Biplot Coloured by Diabetes Status")
ax.axhline(0, color="grey", linewidth=0.5)
ax.axvline(0, color="grey", linewidth=0.5)
ax.legend()
plt.tight_layout()
plt.savefig("ch15_ex1_biplot.png", dpi=150)
plt.show()

print("\nDiabetic patients tend to cluster toward higher PC1 values,")
print("aligning with the 'metabolic syndrome' interpretation.")
print("PC1 captures the metabolic dimension along which diabetic")
print("patients differ from non-diabetic patients.")

D.11.2 Exercise 2

Show R solution

# Chapter 15, Exercise 2: t-SNE Sensitivity to Perplexity
# Using the three-group simulated dataset from the chapter

library(Rtsne)

# ---- Simulate data (same as chapter) ----
set.seed(42)
n_per_group <- 100
p <- 20

group1 <- matrix(rnorm(n_per_group * p, mean = 0, sd = 1), ncol = p)
group2 <- matrix(rnorm(n_per_group * p, mean = 2, sd = 1.2), ncol = p)
group3 <- matrix(rnorm(n_per_group * p, mean = -1, sd = 0.8), ncol = p)
group2[, 1:5] <- group2[, 1:5] + 3
group3[, 10:15] <- group3[, 10:15] - 4

X <- rbind(group1, group2, group3)
labels <- factor(rep(c("Type A", "Type B", "Type C"), each = n_per_group))
cols <- c("steelblue", "firebrick", "forestgreen")

# ---- (a) & (b) Run t-SNE with 4 perplexity values ----
perplexities <- c(5, 15, 30, 50)

par(mfrow = c(2, 2), mar = c(4, 4, 3, 1))

for (perp in perplexities) {
  set.seed(42)  # Same seed for comparability
  tsne_result <- Rtsne(X, perplexity = perp, dims = 2,
                        verbose = FALSE, max_iter = 1000)

  plot(tsne_result$Y, col = cols[labels], pch = 16, cex = 0.8,
       main = paste("t-SNE (perplexity =", perp, ")"),
       xlab = "t-SNE 1", ylab = "t-SNE 2")

  if (perp == 5) {
    legend("topright", levels(labels), col = cols, pch = 16, cex = 0.7)
  }
}

# ---- (c) When do groups become clearly separated? ----
cat("=== Part (c): When Are Groups Clearly Separated? ===\n\n")
cat("The three groups become clearly separated at perplexity = 15.\n")
cat("At perplexity = 5, the embedding focuses on very local structure,\n")
cat("which can fragment the groups into smaller sub-clusters.\n")
cat("At perplexity = 15 and above, the groups are well-separated\n")
cat("with clear boundaries between them.\n\n")
cat("Higher perplexities (30, 50) also separate the groups clearly\n")
cat("but with slightly different visual arrangements and more\n")
cat("spread-out clusters.\n")

# ---- (d) Are distances between clusters consistent? ----
cat("\n=== Part (d): Consistency of Inter-Cluster Distances ===\n\n")
cat("NO, the distances between clusters are NOT consistent across\n")
cat("perplexity values. Key observations:\n\n")
cat("1. The relative positions of the three clusters change across\n")
cat("   perplexity settings. Type A might be nearest to Type C at\n")
cat("   one perplexity but nearest to Type B at another.\n\n")
cat("2. The absolute distances between cluster centres vary\n")
cat("   substantially. At low perplexity, clusters may appear close;\n")
cat("   at high perplexity, they may be farther apart (or vice versa).\n\n")
cat("3. Cluster sizes (spread) also change with perplexity.\n\n")
cat("What this tells us about interpreting t-SNE:\n")
cat("- Distances between clusters in t-SNE are MEANINGLESS.\n")
cat("- t-SNE preserves local neighbourhood structure, not global\n")
cat("  distances.\n")
cat("- You should NEVER conclude that two clusters are 'more similar'\n")
cat("  because they appear closer in a t-SNE plot.\n")
cat("- The only reliable information is WHETHER clusters exist,\n")
cat("  not HOW FAR APART they are.\n")
cat("- Always try multiple perplexity values. If the same clusters\n")
cat("  appear consistently, the structure is likely real.\n")

Show Python solution

# Chapter 15, Exercise 2: t-SNE Sensitivity to Perplexity
# Using the three-group simulated dataset from the chapter

import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

# ---- Simulate data (same as chapter) ----
np.random.seed(42)
n_per = 100
p = 20

g1 = np.random.normal(0, 1, (n_per, p))
g2 = np.random.normal(2, 1.2, (n_per, p))
g3 = np.random.normal(-1, 0.8, (n_per, p))
g2[:, :5] += 3
g3[:, 10:15] -= 4

X = np.vstack([g1, g2, g3])
labels = np.repeat(["Type A", "Type B", "Type C"], n_per)
colors = {"Type A": "steelblue", "Type B": "firebrick", "Type C": "forestgreen"}

# ---- (a) & (b) Run t-SNE with 4 perplexity values ----
perplexities = [5, 15, 30, 50]

fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.flatten()

for ax, perp in zip(axes, perplexities):
    tsne = TSNE(n_components=2, perplexity=perp, random_state=42, max_iter=1000)
    emb = tsne.fit_transform(X)

    for lab in ["Type A", "Type B", "Type C"]:
        mask = labels == lab
        ax.scatter(emb[mask, 0], emb[mask, 1], c=colors[lab],
                   s=15, alpha=0.7, label=lab)

    ax.set_title(f"t-SNE (perplexity = {perp})")
    ax.set_xlabel("t-SNE 1")
    ax.set_ylabel("t-SNE 2")
    ax.legend(fontsize=8)

plt.tight_layout()
plt.savefig("ch15_ex2_tsne_perplexity.png", dpi=150)
plt.show()

# ---- (c) When do groups become clearly separated? ----
print("=== Part (c): When Are Groups Clearly Separated? ===\n")
print("The three groups become clearly separated at perplexity = 15.")
print("At perplexity = 5, the embedding focuses on very local structure,")
print("which can fragment groups into smaller sub-clusters.")
print("At perplexity = 15 and above, groups are well-separated.")
print("Higher perplexities (30, 50) also separate groups clearly but")
print("with different visual arrangements and more spread-out clusters.")

# ---- (d) Are distances consistent? ----
print("\n=== Part (d): Consistency of Inter-Cluster Distances ===\n")
print("NO, the distances between clusters are NOT consistent across")
print("perplexity values. Key observations:\n")
print("1. Relative positions of the three clusters change across")
print("   perplexity settings.\n")
print("2. Absolute distances between cluster centres vary substantially.\n")
print("3. Cluster sizes (spread) also change with perplexity.\n")
print("What this tells us about interpreting t-SNE:")
print("- Distances between clusters are MEANINGLESS.")
print("- t-SNE preserves local neighbourhood structure, not global distances.")
print("- NEVER conclude two clusters are 'more similar' because they")
print("  appear closer in a t-SNE plot.")
print("- The only reliable information is WHETHER clusters exist,")
print("  not HOW FAR APART they are.")
print("- Always try multiple perplexity values. If clusters appear")
print("  consistently, the structure is likely real.")

D.11.3 Exercise 3

Show R solution

# Chapter 15, Exercise 3: UMAP Parameter Exploration
# Using the three-group simulated dataset from the chapter

library(uwot)

# ---- Simulate data (same as chapter) ----
set.seed(42)
n_per_group <- 100
p <- 20

group1 <- matrix(rnorm(n_per_group * p, mean = 0, sd = 1), ncol = p)
group2 <- matrix(rnorm(n_per_group * p, mean = 2, sd = 1.2), ncol = p)
group3 <- matrix(rnorm(n_per_group * p, mean = -1, sd = 0.8), ncol = p)
group2[, 1:5] <- group2[, 1:5] + 3
group3[, 10:15] <- group3[, 10:15] - 4

X <- rbind(group1, group2, group3)
labels <- factor(rep(c("Type A", "Type B", "Type C"), each = n_per_group))
cols <- c("steelblue", "firebrick", "forestgreen")

# ---- (a) UMAP with varying n_neighbors, min_dist = 0.1 ----
n_neighbors_vals <- c(5, 15, 50, 100)

par(mfrow = c(2, 2), mar = c(4, 4, 3, 1))

for (nn in n_neighbors_vals) {
  set.seed(42)
  umap_result <- umap(X, n_neighbors = nn, min_dist = 0.1, verbose = FALSE)

  plot(umap_result, col = cols[labels], pch = 16, cex = 0.8,
       main = paste("n_neighbors =", nn, ", min_dist = 0.1"),
       xlab = "UMAP 1", ylab = "UMAP 2")

  if (nn == 5) {
    legend("topright", levels(labels), col = cols, pch = 16, cex = 0.7)
  }
}

# ---- (b) UMAP with varying min_dist, n_neighbors = 15 ----
min_dist_vals <- c(0.0, 0.1, 0.5, 1.0)

par(mfrow = c(2, 2), mar = c(4, 4, 3, 1))

for (md in min_dist_vals) {
  set.seed(42)
  umap_result <- umap(X, n_neighbors = 15, min_dist = md, verbose = FALSE)

  plot(umap_result, col = cols[labels], pch = 16, cex = 0.8,
       main = paste("n_neighbors = 15, min_dist =", md),
       xlab = "UMAP 1", ylab = "UMAP 2")

  if (md == 0.0) {
    legend("topright", levels(labels), col = cols, pch = 16, cex = 0.7)
  }
}

# ---- (c) Already done above via the two panels of plots ----

# ---- (d) Description and recommendation ----
cat("=== Part (d): How Each Parameter Affects Visual Appearance ===\n\n")

cat("n_neighbors (with min_dist = 0.1 fixed):\n")
cat("  n_neighbors = 5:   Very tight, fragmented clusters. Each point\n")
cat("    considers only 5 neighbors, emphasising micro-structure.\n")
cat("    May split true clusters into sub-groups.\n")
cat("  n_neighbors = 15:  Well-separated, compact clusters. Good\n")
cat("    balance between local and global structure.\n")
cat("  n_neighbors = 50:  Broader clusters, more connected. Groups\n")
cat("    start to merge slightly as the algorithm considers wider\n")
cat("    neighbourhoods.\n")
cat("  n_neighbors = 100: Even more spread out. Global structure is\n")
cat("    emphasised over local detail. Clusters are less tightly packed.\n\n")

cat("min_dist (with n_neighbors = 15 fixed):\n")
cat("  min_dist = 0.0:  Very tight, dense clusters. Points are packed\n")
cat("    as close together as possible. Maximum visual separation.\n")
cat("  min_dist = 0.1:  Slightly looser clusters. Good default.\n")
cat("  min_dist = 0.5:  More spread-out embedding. Internal cluster\n")
cat("    structure becomes more visible but inter-cluster gaps shrink.\n")
cat("  min_dist = 1.0:  Very spread out, almost uniform. Clusters\n")
cat("    overlap, and the embedding loses much of its structure.\n\n")

cat("Recommendation for this dataset:\n")
cat("  n_neighbors = 15, min_dist = 0.1\n")
cat("  This combination provides well-separated, compact clusters\n")
cat("  that clearly reveal the three-group structure. It is also\n")
cat("  the default in most implementations, and for good reason:\n")
cat("  it balances local detail with global structure effectively.\n")

Show Python solution

# Chapter 15, Exercise 3: UMAP Parameter Exploration
# Using the three-group simulated dataset from the chapter

import numpy as np
import matplotlib.pyplot as plt
from umap import UMAP

# ---- Simulate data (same as chapter) ----
np.random.seed(42)
n_per = 100
p = 20

g1 = np.random.normal(0, 1, (n_per, p))
g2 = np.random.normal(2, 1.2, (n_per, p))
g3 = np.random.normal(-1, 0.8, (n_per, p))
g2[:, :5] += 3
g3[:, 10:15] -= 4

X = np.vstack([g1, g2, g3])
labels = np.repeat(["Type A", "Type B", "Type C"], n_per)
colors = {"Type A": "steelblue", "Type B": "firebrick", "Type C": "forestgreen"}

# ---- (a) UMAP with varying n_neighbors, min_dist = 0.1 ----
n_neighbors_vals = [5, 15, 50, 100]

fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.flatten()

for ax, nn in zip(axes, n_neighbors_vals):
    umap_emb = UMAP(n_components=2, n_neighbors=nn, min_dist=0.1,
                      random_state=42).fit_transform(X)
    for lab in ["Type A", "Type B", "Type C"]:
        mask = labels == lab
        ax.scatter(umap_emb[mask, 0], umap_emb[mask, 1], c=colors[lab],
                   s=15, alpha=0.7, label=lab)
    ax.set_title(f"n_neighbors = {nn}, min_dist = 0.1")
    ax.set_xlabel("UMAP 1")
    ax.set_ylabel("UMAP 2")
    ax.legend(fontsize=8)

plt.suptitle("UMAP: Varying n_neighbors", fontsize=14, y=1.02)
plt.tight_layout()
plt.savefig("ch15_ex3_umap_neighbors.png", dpi=150)
plt.show()

# ---- (b) UMAP with varying min_dist, n_neighbors = 15 ----
min_dist_vals = [0.0, 0.1, 0.5, 1.0]

fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.flatten()

for ax, md in zip(axes, min_dist_vals):
    umap_emb = UMAP(n_components=2, n_neighbors=15, min_dist=md,
                      random_state=42).fit_transform(X)
    for lab in ["Type A", "Type B", "Type C"]:
        mask = labels == lab
        ax.scatter(umap_emb[mask, 0], umap_emb[mask, 1], c=colors[lab],
                   s=15, alpha=0.7, label=lab)
    ax.set_title(f"n_neighbors = 15, min_dist = {md}")
    ax.set_xlabel("UMAP 1")
    ax.set_ylabel("UMAP 2")
    ax.legend(fontsize=8)

plt.suptitle("UMAP: Varying min_dist", fontsize=14, y=1.02)
plt.tight_layout()
plt.savefig("ch15_ex3_umap_mindist.png", dpi=150)
plt.show()

# ---- (d) Description and recommendation ----
print("=== Part (d): How Each Parameter Affects Visual Appearance ===\n")

print("n_neighbors (with min_dist = 0.1 fixed):")
print("  n_neighbors = 5:   Very tight, fragmented clusters. Emphasises")
print("    micro-structure, may split true clusters.")
print("  n_neighbors = 15:  Well-separated, compact clusters. Good balance")
print("    between local and global structure.")
print("  n_neighbors = 50:  Broader clusters, more connected. Groups")
print("    start to merge as wider neighbourhoods are considered.")
print("  n_neighbors = 100: Spread out. Global structure emphasised.\n")

print("min_dist (with n_neighbors = 15 fixed):")
print("  min_dist = 0.0:  Very tight, dense clusters. Maximum separation.")
print("  min_dist = 0.1:  Slightly looser. Good default.")
print("  min_dist = 0.5:  Spread-out embedding. Internal structure visible")
print("    but inter-cluster gaps shrink.")
print("  min_dist = 1.0:  Very spread out, near-uniform. Clusters overlap.\n")

print("Recommendation for this dataset:")
print("  n_neighbors = 15, min_dist = 0.1")
print("  This provides well-separated, compact clusters that clearly")
print("  reveal the three-group structure. It is the default for good")
print("  reason: it balances local detail with global structure.")

D.11.4 Exercise 4

Show R solution

# Chapter 15, Exercise 4: Full Workflow on Simulated Genomic Data
# 1000 patients, 500 gene expression features, 4 cancer subtypes

library(tidyverse)
library(Rtsne)
library(uwot)

# ---- Simulate data ----
set.seed(42)
n <- 1000
p <- 500

# 4 cancer subtypes with different prevalences (one rare at 5%)
subtype_probs <- c(0.35, 0.30, 0.30, 0.05)
subtype <- sample(1:4, n, replace = TRUE, prob = subtype_probs)

cat("Subtype distribution:\n")
print(table(subtype))

# Generate base gene expression
X <- matrix(rnorm(n * p), ncol = p)

# Add subtype-specific signals in different gene subsets
# Subtype 1: upregulated in genes 1-30
X[subtype == 1, 1:30] <- X[subtype == 1, 1:30] + 2.0

# Subtype 2: upregulated in genes 31-60, downregulated in 61-80
X[subtype == 2, 31:60] <- X[subtype == 2, 31:60] + 2.5
X[subtype == 2, 61:80] <- X[subtype == 2, 61:80] - 1.5

# Subtype 3: upregulated in genes 81-120
X[subtype == 3, 81:120] <- X[subtype == 3, 81:120] + 2.0

# Subtype 4 (rare): strong signal in genes 121-160
X[subtype == 4, 121:160] <- X[subtype == 4, 121:160] + 3.5

# ---- (a) Scale and PCA ----
cat("\n=== Part (a): PCA ===\n")
X_scaled <- scale(X)
pca_result <- prcomp(X_scaled)

var_prop <- pca_result$sdev^2 / sum(pca_result$sdev^2)
cum_var <- cumsum(var_prop)
n_80 <- which(cum_var >= 0.80)[1]

cat("PCs needed for 80% variance:", n_80, "\n")
cat("First 10 cumulative variance:", round(cum_var[1:10], 3), "\n")

# Scree plot (first 30 components)
par(mfrow = c(1, 1))
barplot(var_prop[1:30], names.arg = 1:30, col = "steelblue",
        main = "Scree Plot (first 30 PCs)",
        xlab = "Component", ylab = "Proportion of Variance")
abline(h = 1/p, col = "firebrick", lty = 2)

# ---- (b) UMAP on first 30 PCs ----
cat("\n=== Part (b): UMAP on First 30 PCs ===\n")
pca_30 <- pca_result$x[, 1:30]

set.seed(42)
umap_result <- umap(pca_30, n_neighbors = 15, min_dist = 0.1, verbose = FALSE)

subtype_labels <- factor(subtype, labels = c("Subtype 1", "Subtype 2",
                                              "Subtype 3", "Subtype 4 (rare)"))
cols <- c("steelblue", "firebrick", "forestgreen", "goldenrod")

plot_df <- tibble(
  UMAP1 = umap_result[, 1],
  UMAP2 = umap_result[, 2],
  Subtype = subtype_labels
)

# UMAP plot
par(mfrow = c(1, 1))
plot(umap_result, col = cols[subtype], pch = 16, cex = 0.7,
     main = "UMAP of Simulated Genomic Data (4 Cancer Subtypes)",
     xlab = "UMAP 1", ylab = "UMAP 2")
legend("topright", levels(subtype_labels), col = cols, pch = 16, cex = 0.8)

cat("The four subtypes are visually separable in the UMAP embedding.\n")
cat("Each subtype forms a distinct cluster, reflecting the different\n")
cat("gene expression profiles we simulated.\n")

# ---- (c) t-SNE comparison ----
cat("\n=== Part (c): t-SNE Comparison ===\n")

set.seed(42)
tsne_result <- Rtsne(pca_30, perplexity = 30, dims = 2,
                      verbose = FALSE, max_iter = 1000)

par(mfrow = c(1, 2), mar = c(4, 4, 3, 1))

plot(umap_result, col = cols[subtype], pch = 16, cex = 0.7,
     main = "UMAP (n_neighbors=15)", xlab = "UMAP 1", ylab = "UMAP 2")
legend("topright", levels(subtype_labels), col = cols, pch = 16, cex = 0.6)

plot(tsne_result$Y, col = cols[subtype], pch = 16, cex = 0.7,
     main = "t-SNE (perplexity=30)", xlab = "t-SNE 1", ylab = "t-SNE 2")
legend("topright", levels(subtype_labels), col = cols, pch = 16, cex = 0.6)

cat("Both UMAP and t-SNE successfully separate the four subtypes.\n")
cat("UMAP tends to preserve more global structure (relative distances\n")
cat("between clusters are more meaningful), while t-SNE may produce\n")
cat("more compact and well-separated clusters but with unreliable\n")
cat("inter-cluster distances.\n")

# ---- (d) Can the rare subtype be identified? ----
cat("\n=== Part (d): Identifying the Rare Subtype ===\n")

n_rare <- sum(subtype == 4)
cat("Rare subtype (Subtype 4) has", n_rare, "patients (5% of total).\n\n")

cat("YES, the rare subtype can be identified in the UMAP plot.\n")
cat("Despite having only ~50 patients, Subtype 4 forms a distinct\n")
cat("cluster (shown in goldenrod/yellow). This is because:\n")
cat("  1. The signal is strong (effect size = 3.5 in 40 genes)\n")
cat("  2. UMAP preserves local structure well, so even small groups\n")
cat("     remain cohesive\n")
cat("  3. The subtype's gene expression pattern is qualitatively\n")
cat("     different from the other subtypes (different genes)\n\n")
cat("In practice, rare subtypes CAN be identified if:\n")
cat("  - Their molecular profile is sufficiently distinct\n")
cat("  - The sample size is not too small (>20-30 patients)\n")
cat("  - Dimensionality reduction preserves the relevant structure\n")
cat("If the signal were weaker, the rare subtype might merge with\n")
cat("another group or be scattered as outliers.\n")

Show Python solution

# Chapter 15, Exercise 4: Full Workflow on Simulated Genomic Data
# 1000 patients, 500 gene expression features, 4 cancer subtypes

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from umap import UMAP

# ---- Simulate data ----
np.random.seed(42)
n = 1000
p = 500

# 4 cancer subtypes (one rare at 5%)
subtype = np.random.choice(4, n, p=[0.35, 0.30, 0.30, 0.05])
print("Subtype distribution:", {i: (subtype == i).sum() for i in range(4)})

# Generate base gene expression
X = np.random.normal(0, 1, (n, p))

# Add subtype-specific signals
X[subtype == 0, :30] += 2.0       # Subtype 1: genes 0-29
X[subtype == 1, 30:60] += 2.5     # Subtype 2: genes 30-59
X[subtype == 1, 60:80] -= 1.5     # Subtype 2: downregulated genes 60-79
X[subtype == 2, 80:120] += 2.0    # Subtype 3: genes 80-119
X[subtype == 3, 120:160] += 3.5   # Subtype 4 (rare): genes 120-159

# ---- (a) Scale and PCA ----
print("\n=== Part (a): PCA ===")
X_scaled = StandardScaler().fit_transform(X)
pca = PCA()
scores = pca.fit_transform(X_scaled)

cum_var = np.cumsum(pca.explained_variance_ratio_)
n_80 = np.argmax(cum_var >= 0.80) + 1
print(f"PCs needed for 80% variance: {n_80}")
print(f"First 10 cumulative variance: {np.round(cum_var[:10], 3)}")

# Scree plot
fig, ax = plt.subplots(figsize=(10, 4))
ax.bar(range(1, 31), pca.explained_variance_ratio_[:30],
       color="steelblue", edgecolor="white")
ax.axhline(y=1/p, color="firebrick", linestyle="--", label=f"1/p = {1/p:.4f}")
ax.set_xlabel("Component")
ax.set_ylabel("Proportion of Variance")
ax.set_title("Scree Plot (first 30 PCs)")
ax.legend()
plt.tight_layout()
plt.savefig("ch15_ex4_scree.png", dpi=150)
plt.show()

# ---- (b) UMAP on first 30 PCs ----
print("\n=== Part (b): UMAP on First 30 PCs ===")
pca_30 = scores[:, :30]
umap_2d = UMAP(n_components=2, n_neighbors=15, min_dist=0.1,
                random_state=42).fit_transform(pca_30)

subtype_names = ["Subtype 1", "Subtype 2", "Subtype 3", "Subtype 4 (rare)"]
colors_4 = ["steelblue", "firebrick", "forestgreen", "goldenrod"]

fig, ax = plt.subplots(figsize=(8, 6))
for s in range(4):
    mask = subtype == s
    ax.scatter(umap_2d[mask, 0], umap_2d[mask, 1], c=colors_4[s],
               s=12, alpha=0.6, label=subtype_names[s])
ax.set_xlabel("UMAP 1")
ax.set_ylabel("UMAP 2")
ax.set_title("UMAP of Simulated Genomic Data (4 Cancer Subtypes)")
ax.legend()
plt.tight_layout()
plt.savefig("ch15_ex4_umap.png", dpi=150)
plt.show()

print("The four subtypes are visually separable in the UMAP embedding.")

# ---- (c) t-SNE comparison ----
print("\n=== Part (c): t-SNE Comparison ===")
tsne_2d = TSNE(n_components=2, perplexity=30, random_state=42,
                max_iter=1000).fit_transform(pca_30)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

for s in range(4):
    mask = subtype == s
    axes[0].scatter(umap_2d[mask, 0], umap_2d[mask, 1], c=colors_4[s],
                    s=12, alpha=0.6, label=subtype_names[s])
    axes[1].scatter(tsne_2d[mask, 0], tsne_2d[mask, 1], c=colors_4[s],
                    s=12, alpha=0.6, label=subtype_names[s])

axes[0].set_title("UMAP (n_neighbors=15)")
axes[0].set_xlabel("UMAP 1")
axes[0].set_ylabel("UMAP 2")
axes[0].legend(fontsize=8)

axes[1].set_title("t-SNE (perplexity=30)")
axes[1].set_xlabel("t-SNE 1")
axes[1].set_ylabel("t-SNE 2")
axes[1].legend(fontsize=8)

plt.tight_layout()
plt.savefig("ch15_ex4_comparison.png", dpi=150)
plt.show()

print("Both UMAP and t-SNE successfully separate the four subtypes.")
print("UMAP preserves more global structure; t-SNE may produce more")
print("compact clusters but with unreliable inter-cluster distances.")

# ---- (d) Identifying the rare subtype ----
print("\n=== Part (d): Identifying the Rare Subtype ===")
n_rare = (subtype == 3).sum()
print(f"Rare subtype (Subtype 4) has {n_rare} patients (5% of total).\n")

print("YES, the rare subtype can be identified in the UMAP plot.")
print("Despite having only ~50 patients, Subtype 4 forms a distinct")
print("cluster (shown in goldenrod). This is because:")
print("  1. The signal is strong (effect size = 3.5 in 40 genes)")
print("  2. UMAP preserves local structure, so small groups remain cohesive")
print("  3. The gene expression pattern is qualitatively different\n")
print("In practice, rare subtypes CAN be identified if:")
print("  - Their molecular profile is sufficiently distinct")
print("  - Sample size is not too small (>20-30 patients)")
print("  - Dimensionality reduction preserves the relevant structure")
print("If the signal were weaker, the rare subtype might merge with")
print("another group or scatter as outliers.")

D.12 Chapter 16: Clustering

D.12.1 Exercise 1

Show R solution

# Chapter 16, Exercise 1: K-Means on Simulated Patient Data
# 600 patients, 6 clinical variables, 3 underlying phenotypes

library(tidyverse)
library(cluster)

# ---- Simulate data ----
set.seed(42)
n <- 600

# Phenotype 1: Septic shock (high HR, RR, lactate, low SBP)
p1 <- tibble(
  hr = rnorm(200, 115, 12), rr = rnorm(200, 28, 5),
  temp = rnorm(200, 38.5, 0.8), sbp = rnorm(200, 80, 12),
  creat = rnorm(200, 2.5, 0.8), lactate = rnorm(200, 5, 2)
)

# Phenotype 2: Stable critical (moderate vitals)
p2 <- tibble(
  hr = rnorm(200, 90, 10), rr = rnorm(200, 20, 3),
  temp = rnorm(200, 37.2, 0.5), sbp = rnorm(200, 110, 15),
  creat = rnorm(200, 1.2, 0.3), lactate = rnorm(200, 1.5, 0.5)
)

# Phenotype 3: Febrile, preserved hemodynamics
p3 <- tibble(
  hr = rnorm(200, 100, 10), rr = rnorm(200, 22, 4),
  temp = rnorm(200, 39.2, 0.7), sbp = rnorm(200, 120, 10),
  creat = rnorm(200, 1.0, 0.2), lactate = rnorm(200, 2.0, 0.8)
)

dat <- bind_rows(p1, p2, p3)
true_labels <- rep(1:3, each = 200)

# ---- (a) Scale and apply K-means for k = 2 to 6 ----
dat_scaled <- scale(dat)

wcss <- numeric(6)
sil_avg <- numeric(6)

for (k in 2:6) {
  km <- kmeans(dat_scaled, centers = k, nstart = 25)
  wcss[k] <- km$tot.withinss
  sil_avg[k] <- mean(silhouette(km$cluster, dist(dat_scaled))[, 3])
}

# ---- (b) Elbow and silhouette plots ----
par(mfrow = c(1, 2), mar = c(4, 4, 3, 1))

plot(2:6, wcss[2:6], type = "b", pch = 16, col = "steelblue",
     xlab = "Number of clusters (k)", ylab = "Within-cluster SS",
     main = "Elbow Method")

plot(2:6, sil_avg[2:6], type = "b", pch = 16, col = "firebrick",
     xlab = "Number of clusters (k)", ylab = "Average Silhouette",
     main = "Silhouette Scores")

cat("=== Part (b): Optimal k ===\n")
cat("Elbow plot: The elbow appears at k = 3, where WCSS decreases\n")
cat("  sharply and then flattens.\n")
cat("Silhouette: Maximum average silhouette at k =", which.max(sil_avg[2:6]) + 1, "\n")
cat("Both methods suggest k = 3, consistent with the 3 simulated phenotypes.\n")

# ---- (c) Visualise k=3 solution using PCA ----
km3 <- kmeans(dat_scaled, centers = 3, nstart = 25)
pca2 <- prcomp(dat_scaled)$x[, 1:2]

par(mfrow = c(1, 1))
cols <- c("steelblue", "firebrick", "forestgreen")
plot(pca2, col = cols[km3$cluster], pch = 16, cex = 0.7,
     main = "K-means (k=3) on First 2 PCs",
     xlab = "PC1", ylab = "PC2")
legend("topright", paste("Cluster", 1:3), col = cols, pch = 16, cex = 0.8)

# ---- (d) Profile the clusters ----
cat("\n=== Part (d): Cluster Profiles ===\n\n")

dat$cluster <- km3$cluster
profiles <- dat %>%
  group_by(cluster) %>%
  summarise(across(hr:lactate, mean), n = n(), .groups = "drop")

print(profiles)

cat("\nClinical interpretation of cluster profiles:\n\n")

# Identify which cluster matches which phenotype
for (cl in 1:3) {
  prof <- profiles %>% filter(cluster == cl)
  cat(sprintf("Cluster %d (n = %d):\n", cl, prof$n))
  cat(sprintf("  HR=%.0f, RR=%.0f, Temp=%.1f, SBP=%.0f, Creat=%.1f, Lactate=%.1f\n",
              prof$hr, prof$rr, prof$temp, prof$sbp, prof$creat, prof$lactate))

  if (prof$lactate > 3 && prof$sbp < 90) {
    cat("  -> Matches SEPTIC SHOCK phenotype: high HR, RR, lactate; low SBP\n")
  } else if (prof$temp > 38.5) {
    cat("  -> Matches FEBRILE phenotype: high temp, preserved hemodynamics\n")
  } else {
    cat("  -> Matches STABLE CRITICAL phenotype: moderate vitals\n")
  }
  cat("\n")
}

cat("The cluster profiles make clinical sense. The algorithm has\n")
cat("successfully recovered the three simulated phenotypes, each\n")
cat("with a distinct clinical profile that a clinician would recognise.\n")

Show Python solution

# Chapter 16, Exercise 1: K-Means on Simulated Patient Data
# 600 patients, 6 clinical variables, 3 underlying phenotypes

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA

# ---- Simulate data ----
np.random.seed(42)

# Phenotype 1: Septic shock
p1 = np.column_stack([
    np.random.normal(115, 12, 200),   # hr
    np.random.normal(28, 5, 200),     # rr
    np.random.normal(38.5, 0.8, 200), # temp
    np.random.normal(80, 12, 200),    # sbp
    np.random.normal(2.5, 0.8, 200),  # creat
    np.random.normal(5, 2, 200)       # lactate
])

# Phenotype 2: Stable critical
p2 = np.column_stack([
    np.random.normal(90, 10, 200),
    np.random.normal(20, 3, 200),
    np.random.normal(37.2, 0.5, 200),
    np.random.normal(110, 15, 200),
    np.random.normal(1.2, 0.3, 200),
    np.random.normal(1.5, 0.5, 200)
])

# Phenotype 3: Febrile
p3 = np.column_stack([
    np.random.normal(100, 10, 200),
    np.random.normal(22, 4, 200),
    np.random.normal(39.2, 0.7, 200),
    np.random.normal(120, 10, 200),
    np.random.normal(1.0, 0.2, 200),
    np.random.normal(2.0, 0.8, 200)
])

X = np.vstack([p1, p2, p3])
cols = ['hr', 'rr', 'temp', 'sbp', 'creat', 'lactate']
df = pd.DataFrame(X, columns=cols)

# ---- (a) Scale and K-means for k = 2 to 6 ----
X_scaled = StandardScaler().fit_transform(X)

ks = range(2, 7)
wcss_list = []
sil_list = []

for k in ks:
    km = KMeans(n_clusters=k, n_init=25, random_state=42)
    km.fit(X_scaled)
    wcss_list.append(km.inertia_)
    sil_list.append(silhouette_score(X_scaled, km.labels_))

# ---- (b) Elbow and silhouette plots ----
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

axes[0].plot(list(ks), wcss_list, "o-", color="steelblue")
axes[0].set_xlabel("Number of clusters (k)")
axes[0].set_ylabel("Within-cluster SS")
axes[0].set_title("Elbow Method")

axes[1].plot(list(ks), sil_list, "o-", color="firebrick")
axes[1].set_xlabel("Number of clusters (k)")
axes[1].set_ylabel("Average Silhouette")
axes[1].set_title("Silhouette Scores")

plt.tight_layout()
plt.savefig("ch16_ex1_elbow_silhouette.png", dpi=150)
plt.show()

best_k = list(ks)[np.argmax(sil_list)]
print(f"=== Part (b): Optimal k ===")
print(f"Elbow: Elbow appears at k = 3")
print(f"Silhouette: Maximum at k = {best_k}")
print(f"Both methods suggest k = 3, matching the 3 simulated phenotypes.")

# ---- (c) Visualise k=3 with PCA ----
km3 = KMeans(n_clusters=3, n_init=25, random_state=42).fit(X_scaled)
pca2 = PCA(n_components=2).fit_transform(X_scaled)

fig, ax = plt.subplots(figsize=(8, 6))
colors_3 = ["steelblue", "firebrick", "forestgreen"]
for c in range(3):
    mask = km3.labels_ == c
    ax.scatter(pca2[mask, 0], pca2[mask, 1], c=colors_3[c],
               s=15, alpha=0.6, label=f"Cluster {c+1}")
ax.set_xlabel("PC1")
ax.set_ylabel("PC2")
ax.set_title("K-means (k=3) on First 2 PCs")
ax.legend()
plt.tight_layout()
plt.savefig("ch16_ex1_pca_clusters.png", dpi=150)
plt.show()

# ---- (d) Profile the clusters ----
print("\n=== Part (d): Cluster Profiles ===\n")

df['cluster'] = km3.labels_
profiles = df.groupby('cluster')[cols].mean()
print(profiles.round(1))

print("\nClinical interpretation:")
for cl in range(3):
    prof = profiles.loc[cl]
    n_cl = (km3.labels_ == cl).sum()
    print(f"\nCluster {cl+1} (n = {n_cl}):")
    print(f"  HR={prof['hr']:.0f}, RR={prof['rr']:.0f}, Temp={prof['temp']:.1f}, "
          f"SBP={prof['sbp']:.0f}, Creat={prof['creat']:.1f}, Lactate={prof['lactate']:.1f}")

    if prof['lactate'] > 3 and prof['sbp'] < 90:
        print("  -> SEPTIC SHOCK: high HR, RR, lactate; low SBP")
    elif prof['temp'] > 38.5:
        print("  -> FEBRILE: high temperature, preserved hemodynamics")
    else:
        print("  -> STABLE CRITICAL: moderate vitals")

print("\nThe cluster profiles make clinical sense. The algorithm")
print("successfully recovered the three simulated phenotypes.")

D.12.2 Exercise 2

Show R solution

# Chapter 16, Exercise 2: Hierarchical Clustering with Different Linkage Methods
# Using the same dataset from Exercise 1

library(tidyverse)
library(cluster)
library(mclust)  # for adjustedRandIndex

# ---- Simulate data (same as Exercise 1) ----
set.seed(42)

p1 <- tibble(hr = rnorm(200, 115, 12), rr = rnorm(200, 28, 5),
             temp = rnorm(200, 38.5, 0.8), sbp = rnorm(200, 80, 12),
             creat = rnorm(200, 2.5, 0.8), lactate = rnorm(200, 5, 2))
p2 <- tibble(hr = rnorm(200, 90, 10), rr = rnorm(200, 20, 3),
             temp = rnorm(200, 37.2, 0.5), sbp = rnorm(200, 110, 15),
             creat = rnorm(200, 1.2, 0.3), lactate = rnorm(200, 1.5, 0.5))
p3 <- tibble(hr = rnorm(200, 100, 10), rr = rnorm(200, 22, 4),
             temp = rnorm(200, 39.2, 0.7), sbp = rnorm(200, 120, 10),
             creat = rnorm(200, 1.0, 0.2), lactate = rnorm(200, 2.0, 0.8))

dat <- bind_rows(p1, p2, p3)
dat_scaled <- scale(dat)

# K-means reference solution
km3 <- kmeans(dat_scaled, centers = 3, nstart = 25)

# ---- (a) Hierarchical clustering with 4 linkage methods ----
cat("=== Part (a): Hierarchical Clustering ===\n\n")

d <- dist(dat_scaled)
methods <- c("single", "complete", "average", "ward.D2")
titles <- c("Single", "Complete", "Average", "Ward's")

# Use a subset for readable dendrograms
set.seed(42)
idx <- sample(nrow(dat_scaled), 80)

par(mfrow = c(2, 2), mar = c(2, 3, 3, 1))
for (i in seq_along(methods)) {
  hc <- hclust(dist(dat_scaled[idx, ]), method = methods[i])
  plot(hc, labels = FALSE, main = titles[i], xlab = "", sub = "",
       hang = -1, cex = 0.5)
  rect.hclust(hc, k = 3, border = "firebrick")
}

# ---- (b) Cut each dendrogram at k=3, compare assignments ----
cat("=== Part (b): Cluster Assignments ===\n\n")

hc_clusters <- list()
for (i in seq_along(methods)) {
  hc <- hclust(d, method = methods[i])
  hc_clusters[[methods[i]]] <- cutree(hc, k = 3)
}

# Crosstabulation between methods
cat("Crosstab: Ward's vs Complete linkage:\n")
print(table(Ward = hc_clusters[["ward.D2"]],
            Complete = hc_clusters[["complete"]]))

cat("\nCrosstab: Ward's vs Single linkage:\n")
print(table(Ward = hc_clusters[["ward.D2"]],
            Single = hc_clusters[["single"]]))

# ---- (c) ARI comparison with K-means ----
cat("\n=== Part (c): Adjusted Rand Index vs K-means ===\n\n")

for (i in seq_along(methods)) {
  ari <- adjustedRandIndex(km3$cluster, hc_clusters[[methods[i]]])
  cat(sprintf("ARI(%s vs K-means): %.3f\n", titles[i], ari))
}

cat("\nWard's method produces clusters most similar to K-means (highest ARI).\n")
cat("This is expected because both Ward's method and K-means minimise\n")
cat("within-cluster variance (WCSS), so they have similar objectives.\n")

# ---- (d) Recommendation for clinical data ----
cat("\n=== Part (d): Recommendation ===\n\n")

cat("Ward's method is the recommended default for clinical data because:\n\n")
cat("1. OBJECTIVE: Ward's minimises within-cluster variance, which\n")
cat("   aligns with the goal of finding compact, homogeneous clusters.\n\n")
cat("2. CONSISTENCY: It produces results most similar to K-means,\n")
cat("   which is the most widely used method. Using both and checking\n")
cat("   for agreement strengthens confidence in the results.\n\n")
cat("3. ROBUSTNESS: Complete linkage can be distorted by outliers;\n")
cat("   single linkage creates chaining effects (long, straggling\n")
cat("   clusters). Ward's avoids both issues.\n\n")
cat("4. COMPACT CLUSTERS: Clinical phenotypes are typically expected\n")
cat("   to be compact groups of similar patients, which Ward's\n")
cat("   naturally produces.\n\n")
cat("5. INTERPRETABILITY: The dendrogram from Ward's method is\n")
cat("   usually the easiest to interpret, with clear height gaps\n")
cat("   indicating natural cluster boundaries.\n\n")
cat("CAVEAT: If you suspect non-spherical clusters (e.g., disease\n")
cat("trajectories that form elongated shapes), average linkage\n")
cat("might be more appropriate. But for most clinical phenotyping\n")
cat("applications, Ward's is the safe default.\n")

Show Python solution

# Chapter 16, Exercise 2: Hierarchical Clustering with Different Linkage Methods
# Using the same dataset from Exercise 1

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.metrics import adjusted_rand_score
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster

# ---- Simulate data (same as Exercise 1) ----
np.random.seed(42)

p1 = np.column_stack([
    np.random.normal(115, 12, 200), np.random.normal(28, 5, 200),
    np.random.normal(38.5, 0.8, 200), np.random.normal(80, 12, 200),
    np.random.normal(2.5, 0.8, 200), np.random.normal(5, 2, 200)
])
p2 = np.column_stack([
    np.random.normal(90, 10, 200), np.random.normal(20, 3, 200),
    np.random.normal(37.2, 0.5, 200), np.random.normal(110, 15, 200),
    np.random.normal(1.2, 0.3, 200), np.random.normal(1.5, 0.5, 200)
])
p3 = np.column_stack([
    np.random.normal(100, 10, 200), np.random.normal(22, 4, 200),
    np.random.normal(39.2, 0.7, 200), np.random.normal(120, 10, 200),
    np.random.normal(1.0, 0.2, 200), np.random.normal(2.0, 0.8, 200)
])

X = np.vstack([p1, p2, p3])
X_scaled = StandardScaler().fit_transform(X)

# K-means reference
km3 = KMeans(n_clusters=3, n_init=25, random_state=42).fit(X_scaled)

# ---- (a) Hierarchical clustering with 4 linkage methods ----
print("=== Part (a): Hierarchical Clustering ===\n")

methods = ['single', 'complete', 'average', 'ward']
titles = ['Single', 'Complete', 'Average', "Ward's"]

# Dendrograms (use subset for readability)
np.random.seed(42)
idx = np.random.choice(len(X_scaled), 80, replace=False)
X_sub = X_scaled[idx]

fig, axes = plt.subplots(2, 2, figsize=(16, 10))
axes = axes.flatten()

for ax, method, title in zip(axes, methods, titles):
    Z = linkage(X_sub, method=method)
    dendrogram(Z, ax=ax, no_labels=True, color_threshold=0)
    ax.set_title(title)
    ax.set_xlabel("")

plt.suptitle("Hierarchical Clustering: Linkage Method Comparison", fontsize=14)
plt.tight_layout()
plt.savefig("ch16_ex2_dendrograms.png", dpi=150)
plt.show()

# ---- (b) Cut at k=3 and compare ----
print("=== Part (b): Cluster Assignments ===\n")

hc_labels = {}
for method in methods:
    Z = linkage(X_scaled, method=method)
    hc_labels[method] = fcluster(Z, t=3, criterion='maxclust')

# Crosstabs
print("Crosstab: Ward's vs Complete:")
ct = pd.crosstab(pd.Series(hc_labels['ward'], name='Ward'),
                  pd.Series(hc_labels['complete'], name='Complete'))
print(ct)

print("\nCrosstab: Ward's vs Single:")
ct2 = pd.crosstab(pd.Series(hc_labels['ward'], name='Ward'),
                   pd.Series(hc_labels['single'], name='Single'))
print(ct2)

# ---- (c) ARI comparison with K-means ----
print("\n=== Part (c): Adjusted Rand Index vs K-means ===\n")

for method, title in zip(methods, titles):
    ari = adjusted_rand_score(km3.labels_, hc_labels[method])
    print(f"ARI({title} vs K-means): {ari:.3f}")

print("\nWard's method produces clusters most similar to K-means (highest ARI).")
print("Both minimise within-cluster variance, so they share similar objectives.")

# ---- (d) Recommendation ----
print("\n=== Part (d): Recommendation ===\n")

print("Ward's method is recommended for clinical data because:\n")
print("1. OBJECTIVE: Minimises within-cluster variance, producing")
print("   compact, homogeneous clusters.\n")
print("2. CONSISTENCY: Most similar to K-means, strengthening")
print("   confidence when both methods agree.\n")
print("3. ROBUSTNESS: Avoids chaining (single linkage) and outlier")
print("   sensitivity (complete linkage).\n")
print("4. COMPACT CLUSTERS: Clinical phenotypes are typically compact")
print("   groups, which Ward's naturally produces.\n")
print("5. INTERPRETABILITY: Dendrogram with clear height gaps at")
print("   natural cluster boundaries.\n")
print("CAVEAT: For non-spherical clusters, average linkage may be")
print("more appropriate. But Ward's is the safe default for most")
print("clinical phenotyping applications.")

D.12.3 Exercise 3

Show R solution

# Chapter 16, Exercise 3: DBSCAN for Outlier Detection
# Add outlier patients to the Exercise 1 dataset

library(tidyverse)
library(cluster)
library(dbscan)

# ---- Simulate data (same as Exercise 1) ----
set.seed(42)

p1 <- tibble(hr = rnorm(200, 115, 12), rr = rnorm(200, 28, 5),
             temp = rnorm(200, 38.5, 0.8), sbp = rnorm(200, 80, 12),
             creat = rnorm(200, 2.5, 0.8), lactate = rnorm(200, 5, 2))
p2 <- tibble(hr = rnorm(200, 90, 10), rr = rnorm(200, 20, 3),
             temp = rnorm(200, 37.2, 0.5), sbp = rnorm(200, 110, 15),
             creat = rnorm(200, 1.2, 0.3), lactate = rnorm(200, 1.5, 0.5))
p3 <- tibble(hr = rnorm(200, 100, 10), rr = rnorm(200, 22, 4),
             temp = rnorm(200, 39.2, 0.7), sbp = rnorm(200, 120, 10),
             creat = rnorm(200, 1.0, 0.2), lactate = rnorm(200, 2.0, 0.8))

dat <- bind_rows(p1, p2, p3)
is_outlier <- rep(FALSE, 600)

# Add 30 outlier patients with extreme values
set.seed(99)
outliers <- tibble(
  hr = rnorm(30, 180, 10),       # Extreme heart rate
  rr = rnorm(30, 40, 5),         # Extreme respiratory rate
  temp = rnorm(30, 40.5, 0.5),   # Very high temperature
  sbp = rnorm(30, 50, 10),       # Very low blood pressure
  creat = rnorm(30, 8, 1.5),     # Very high creatinine
  lactate = rnorm(30, 15, 3)     # Very high lactate
)

dat <- bind_rows(dat, outliers)
is_outlier <- c(is_outlier, rep(TRUE, 30))
dat_scaled <- scale(dat)

cat("Total patients:", nrow(dat), "(600 normal + 30 outliers)\n\n")

# ---- (a) K-means with k=3 ----
cat("=== Part (a): K-means with k=3 ===\n")

km3 <- kmeans(dat_scaled, centers = 3, nstart = 25)

# Where do outliers end up?
outlier_clusters <- km3$cluster[is_outlier]
cat("Outlier distribution across K-means clusters:\n")
print(table(outlier_clusters))

cat("\nK-means assigns outliers to one of the existing clusters,\n")
cat("typically the cluster whose centroid is nearest (even if far).\n")
cat("This can distort the centroid and corrupt the cluster profiles.\n")

# Show how outliers affect cluster means
dat$km_cluster <- km3$cluster
dat$is_outlier <- is_outlier

for (cl in 1:3) {
  n_outlier_in_cl <- sum(dat$km_cluster == cl & dat$is_outlier)
  n_total_in_cl <- sum(dat$km_cluster == cl)
  cat(sprintf("Cluster %d: %d patients (%d outliers)\n",
              cl, n_total_in_cl, n_outlier_in_cl))
}

# ---- (b) DBSCAN ----
cat("\n=== Part (b): DBSCAN ===\n")

# Use kNNdistplot to help choose eps
kNNdistplot(dat_scaled, k = 5)
abline(h = 2.5, col = "firebrick", lty = 2)

# Run DBSCAN
db <- dbscan(dat_scaled, eps = 2.5, minPts = 10)

cat("DBSCAN results:\n")
cat("  Number of clusters:", max(db$cluster), "\n")
cat("  Noise points (outliers):", sum(db$cluster == 0), "\n")
cat("  Actual outliers detected as noise:",
    sum(db$cluster == 0 & is_outlier), "out of 30\n")

cat("\nDBSCAN cluster sizes:\n")
print(table(db$cluster))

# ---- (c) Compare non-outlier assignments ----
cat("\n=== Part (c): Comparison for Non-Outlier Patients ===\n")

# Get cluster assignments for non-outlier patients only
normal_idx <- !is_outlier
km_normal <- km3$cluster[normal_idx]
db_normal <- db$cluster[normal_idx]

# For DBSCAN, remove any normal patients assigned as noise
db_noise_normal <- sum(db_normal == 0)
cat("Normal patients misclassified as noise by DBSCAN:", db_noise_normal, "\n")

# ARI for non-noise, non-outlier patients
both_assigned <- db_normal > 0
if (sum(both_assigned) > 0) {
  # Use mclust for ARI
  library(mclust)
  ari <- adjustedRandIndex(km_normal[both_assigned], db_normal[both_assigned])
  cat("ARI between K-means and DBSCAN (non-outlier, non-noise):", round(ari, 3), "\n")
}

# ---- (d) Which approach for outlier handling? ----
cat("\n=== Part (d): Preferred Approach for Outlier Handling ===\n\n")

cat("DBSCAN is preferred for clinical phenotyping when outliers are expected.\n\n")

cat("Reasons:\n")
cat("1. EXPLICIT OUTLIER IDENTIFICATION: DBSCAN labels outliers as noise,\n")
cat("   making them visible for clinical review. K-means silently absorbs\n")
cat("   them into clusters, distorting the results.\n\n")

cat("2. CLINICAL RELEVANCE: Outlier patients (e.g., with extreme vitals)\n")
cat("   may represent data errors, rare presentations, or patients who\n")
cat("   need individual assessment. Identifying them is valuable.\n\n")

cat("3. CLUSTER INTEGRITY: By excluding outliers, DBSCAN preserves the\n")
cat("   purity of the main clusters. K-means cluster centroids can be\n")
cat("   pulled by outliers, producing misleading profiles.\n\n")

cat("4. PRACTICAL WORKFLOW:\n")
cat("   - Use DBSCAN to identify outliers first.\n")
cat("   - Review outliers clinically (data errors? rare cases?).\n")
cat("   - Apply K-means to the remaining non-outlier patients for\n")
cat("     cleaner phenotyping.\n\n")

cat("5. CAVEAT: DBSCAN requires tuning eps and minPts, which can be\n")
cat("   challenging. The kNN distance plot helps, but the choice\n")
cat("   affects how many points are labelled as noise.\n")

# Visualise comparison
pca2 <- prcomp(dat_scaled)$x[, 1:2]

par(mfrow = c(1, 2), mar = c(4, 4, 3, 1))

# K-means
cols_km <- c("steelblue", "firebrick", "forestgreen")[km3$cluster]
pch_km <- ifelse(is_outlier, 4, 16)
plot(pca2, col = cols_km, pch = pch_km, cex = 0.7,
     main = "K-means (k=3)\n(X = outlier patients)",
     xlab = "PC1", ylab = "PC2")

# DBSCAN
db_cols <- c("grey50", "steelblue", "firebrick", "forestgreen")
cols_db <- db_cols[db$cluster + 1]
pch_db <- ifelse(db$cluster == 0, 4, 16)
plot(pca2, col = cols_db, pch = pch_db, cex = 0.7,
     main = "DBSCAN\n(X/grey = noise)",
     xlab = "PC1", ylab = "PC2")
legend("topright", c("Noise", "Cluster 1", "Cluster 2", "Cluster 3"),
       col = db_cols, pch = c(4, 16, 16, 16), cex = 0.7)

Show Python solution

# Chapter 16, Exercise 3: DBSCAN for Outlier Detection
# Add outlier patients to the Exercise 1 dataset

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, DBSCAN
from sklearn.metrics import adjusted_rand_score
from sklearn.decomposition import PCA
from sklearn.neighbors import NearestNeighbors

# ---- Simulate data (same as Exercise 1) ----
np.random.seed(42)

p1 = np.column_stack([
    np.random.normal(115, 12, 200), np.random.normal(28, 5, 200),
    np.random.normal(38.5, 0.8, 200), np.random.normal(80, 12, 200),
    np.random.normal(2.5, 0.8, 200), np.random.normal(5, 2, 200)
])
p2 = np.column_stack([
    np.random.normal(90, 10, 200), np.random.normal(20, 3, 200),
    np.random.normal(37.2, 0.5, 200), np.random.normal(110, 15, 200),
    np.random.normal(1.2, 0.3, 200), np.random.normal(1.5, 0.5, 200)
])
p3 = np.column_stack([
    np.random.normal(100, 10, 200), np.random.normal(22, 4, 200),
    np.random.normal(39.2, 0.7, 200), np.random.normal(120, 10, 200),
    np.random.normal(1.0, 0.2, 200), np.random.normal(2.0, 0.8, 200)
])

X_normal = np.vstack([p1, p2, p3])
is_outlier = np.array([False] * 600)

# Add 30 outlier patients
np.random.seed(99)
outliers = np.column_stack([
    np.random.normal(180, 10, 30),     # Extreme HR
    np.random.normal(40, 5, 30),       # Extreme RR
    np.random.normal(40.5, 0.5, 30),   # Very high temp
    np.random.normal(50, 10, 30),      # Very low SBP
    np.random.normal(8, 1.5, 30),      # Very high creatinine
    np.random.normal(15, 3, 30)        # Very high lactate
])

X = np.vstack([X_normal, outliers])
is_outlier = np.concatenate([is_outlier, np.array([True] * 30)])
X_scaled = StandardScaler().fit_transform(X)

print(f"Total patients: {len(X)} (600 normal + 30 outliers)\n")

# ---- (a) K-means with k=3 ----
print("=== Part (a): K-means with k=3 ===")

km3 = KMeans(n_clusters=3, n_init=25, random_state=42).fit(X_scaled)

outlier_clusters = km3.labels_[is_outlier]
print("Outlier distribution across K-means clusters:")
for c in range(3):
    n_out = (outlier_clusters == c).sum()
    n_tot = (km3.labels_ == c).sum()
    print(f"  Cluster {c+1}: {n_tot} patients ({n_out} outliers)")

print("\nK-means assigns outliers to existing clusters, distorting centroids.")

# ---- (b) DBSCAN ----
print("\n=== Part (b): DBSCAN ===")

# kNN distance plot to choose eps
nn = NearestNeighbors(n_neighbors=5).fit(X_scaled)
distances, _ = nn.kneighbors(X_scaled)
knn_dist = np.sort(distances[:, -1])

plt.figure(figsize=(8, 4))
plt.plot(knn_dist, color="steelblue")
plt.axhline(y=2.5, color="firebrick", linestyle="--", label="eps = 2.5")
plt.xlabel("Points (sorted)")
plt.ylabel("5-NN Distance")
plt.title("kNN Distance Plot for eps Selection")
plt.legend()
plt.tight_layout()
plt.savefig("ch16_ex3_knn_dist.png", dpi=150)
plt.show()

# Run DBSCAN
db = DBSCAN(eps=2.5, min_samples=10).fit(X_scaled)

n_clusters = len(set(db.labels_)) - (1 if -1 in db.labels_ else 0)
n_noise = (db.labels_ == -1).sum()
outliers_as_noise = ((db.labels_ == -1) & is_outlier).sum()

print(f"Number of clusters: {n_clusters}")
print(f"Noise points (outliers): {n_noise}")
print(f"Actual outliers detected as noise: {outliers_as_noise} out of 30")

# ---- (c) Compare non-outlier assignments ----
print("\n=== Part (c): Comparison for Non-Outlier Patients ===")

normal_mask = ~is_outlier
km_normal = km3.labels_[normal_mask]
db_normal = db.labels_[normal_mask]

db_noise_normal = (db_normal == -1).sum()
print(f"Normal patients misclassified as noise: {db_noise_normal}")

# ARI for non-noise, non-outlier patients
both_assigned = db_normal >= 0
if both_assigned.sum() > 0:
    ari = adjusted_rand_score(km_normal[both_assigned], db_normal[both_assigned])
    print(f"ARI (K-means vs DBSCAN, non-outlier, non-noise): {ari:.3f}")

# ---- Visualise comparison ----
pca2 = PCA(n_components=2).fit_transform(X_scaled)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# K-means
colors_km = ["steelblue", "firebrick", "forestgreen"]
for c in range(3):
    mask = (km3.labels_ == c) & ~is_outlier
    axes[0].scatter(pca2[mask, 0], pca2[mask, 1], c=colors_km[c], s=12, alpha=0.6)
    mask_out = (km3.labels_ == c) & is_outlier
    axes[0].scatter(pca2[mask_out, 0], pca2[mask_out, 1], c=colors_km[c],
                    marker="x", s=50, linewidths=2)
axes[0].set_title("K-means (k=3)\n(X = outlier patients)")
axes[0].set_xlabel("PC1")
axes[0].set_ylabel("PC2")

# DBSCAN
color_map = {-1: "grey", 0: "steelblue", 1: "firebrick", 2: "forestgreen"}
for c_val in sorted(set(db.labels_)):
    mask = db.labels_ == c_val
    marker = "x" if c_val == -1 else "o"
    label = "Noise" if c_val == -1 else f"Cluster {c_val+1}"
    axes[1].scatter(pca2[mask, 0], pca2[mask, 1],
                    c=color_map.get(c_val, "purple"),
                    marker=marker, s=15 if c_val >= 0 else 40,
                    alpha=0.6, label=label)
axes[1].set_title("DBSCAN\n(X/grey = noise)")
axes[1].set_xlabel("PC1")
axes[1].set_ylabel("PC2")
axes[1].legend(fontsize=8)

plt.tight_layout()
plt.savefig("ch16_ex3_comparison.png", dpi=150)
plt.show()

# ---- (d) Preferred approach ----
print("\n=== Part (d): Preferred Approach ===\n")
print("DBSCAN is preferred for clinical phenotyping when outliers are expected.\n")
print("1. EXPLICIT OUTLIER IDENTIFICATION: DBSCAN labels outliers as noise,")
print("   making them visible. K-means silently absorbs them.\n")
print("2. CLINICAL RELEVANCE: Outlier patients may represent data errors,")
print("   rare presentations, or patients needing individual assessment.\n")
print("3. CLUSTER INTEGRITY: DBSCAN preserves cluster purity by excluding")
print("   outliers. K-means centroids can be pulled by outliers.\n")
print("4. PRACTICAL WORKFLOW: Use DBSCAN first to identify outliers,")
print("   review them clinically, then apply K-means to non-outliers.\n")
print("5. CAVEAT: DBSCAN requires tuning eps and min_samples, which")
print("   affects how many points are labelled as noise.")

D.12.4 Exercise 4

Show R solution

# Chapter 16, Exercise 4: Full Pipeline - PCA + Clustering + UMAP Visualisation
# 800 patients, 30 clinical variables, 4 underlying subtypes

library(tidyverse)
library(cluster)
library(uwot)

# ---- Simulate data ----
set.seed(42)
n <- 800
p <- 30

# 4 subtypes with different prevalences
subtype_probs <- c(0.30, 0.30, 0.25, 0.15)
true_subtype <- sample(1:4, n, replace = TRUE, prob = subtype_probs)

cat("True subtype distribution:\n")
print(table(true_subtype))

# Generate base data
X <- matrix(rnorm(n * p), ncol = p)

# Add subtype-specific signals in different feature subsets
for (s in 1:4) {
  feature_start <- (s - 1) * 6 + 1
  feature_end <- min(s * 6, p)
  X[true_subtype == s, feature_start:feature_end] <-
    X[true_subtype == s, feature_start:feature_end] + 2.5
}

# ---- (a) Standardise and PCA ----
cat("\n=== Part (a): PCA ===\n")

X_scaled <- scale(X)
pca_result <- prcomp(X_scaled)

var_prop <- pca_result$sdev^2 / sum(pca_result$sdev^2)
cum_var <- cumsum(var_prop)

# Scree plot
par(mfrow = c(1, 2), mar = c(4, 4, 3, 1))
barplot(var_prop[1:15], names.arg = 1:15, col = "steelblue",
        main = "Scree Plot (first 15 PCs)",
        xlab = "Component", ylab = "Proportion of Variance")
abline(h = 1/p, col = "firebrick", lty = 2)

plot(1:15, cum_var[1:15], type = "b", pch = 16, col = "steelblue",
     main = "Cumulative Variance", xlab = "Components",
     ylab = "Cumulative Proportion", ylim = c(0, 1))
abline(h = 0.80, col = "firebrick", lty = 2)

n_80 <- which(cum_var >= 0.80)[1]
cat("PCs needed for 80% variance:", n_80, "\n")

# Use first 10 PCs for clustering
n_pcs <- 10
pca_scores <- pca_result$x[, 1:n_pcs]

# ---- (b) K-means on PCA scores ----
cat("\n=== Part (b): K-means on PCA Scores ===\n")

sil_scores <- numeric(5)
for (k in 2:5) {
  km <- kmeans(pca_scores, centers = k, nstart = 50)
  sil_scores[k] <- mean(silhouette(km$cluster, dist(pca_scores))[, 3])
  cat(sprintf("k = %d: Silhouette = %.3f\n", k, sil_scores[k]))
}

best_k <- which.max(sil_scores[2:5]) + 1
cat("Best k by silhouette:", best_k, "\n")

# Fit final clustering
km_final <- kmeans(pca_scores, centers = best_k, nstart = 50)

# ---- (c) UMAP visualisation ----
cat("\n=== Part (c): UMAP Visualisation ===\n")

set.seed(42)
umap_result <- umap(pca_scores, n_neighbors = 15, min_dist = 0.1,
                     verbose = FALSE)

plot_df <- tibble(
  UMAP1 = umap_result[, 1],
  UMAP2 = umap_result[, 2],
  Cluster = factor(km_final$cluster),
  True_subtype = factor(true_subtype)
)

cols4 <- c("steelblue", "firebrick", "forestgreen", "goldenrod")

par(mfrow = c(1, 2), mar = c(4, 4, 3, 1))

# Discovered clusters
plot(umap_result, col = cols4[km_final$cluster], pch = 16, cex = 0.6,
     main = "K-means Clusters on UMAP", xlab = "UMAP 1", ylab = "UMAP 2")
legend("topright", paste("Cluster", 1:best_k), col = cols4[1:best_k],
       pch = 16, cex = 0.7)

# True subtypes
plot(umap_result, col = cols4[true_subtype], pch = 16, cex = 0.6,
     main = "True Subtypes on UMAP", xlab = "UMAP 1", ylab = "UMAP 2")
legend("topright", paste("Subtype", 1:4), col = cols4, pch = 16, cex = 0.7)

# ---- (d) Cluster profiles ----
cat("\n=== Part (d): Cluster Profiles ===\n\n")

# Name variables for interpretability
var_names <- paste0("V", 1:p)
dat_df <- as.data.frame(X)
colnames(dat_df) <- var_names
dat_df$cluster <- km_final$cluster

# Compute mean of each variable by cluster
profiles <- dat_df %>%
  group_by(cluster) %>%
  summarise(across(everything(), mean), n = n(), .groups = "drop")

cat("Cluster sizes:\n")
print(table(km_final$cluster))

cat("\nCluster means for key variables (first 24, showing subtype signals):\n\n")

# Show means for the signal variables
signal_vars <- paste0("V", 1:24)
profile_table <- profiles %>%
  select(cluster, n, all_of(signal_vars))

# Print in a compact format
for (cl in 1:best_k) {
  prof <- profile_table %>% filter(cluster == cl)
  cat(sprintf("Cluster %d (n = %d):\n", cl, prof$n))
  for (s in 1:4) {
    start <- (s - 1) * 6 + 1
    end <- s * 6
    vars <- paste0("V", start:end)
    means <- round(unlist(prof[vars]), 2)
    cat(sprintf("  Subtype %d signal vars (V%d-V%d): mean = %.2f\n",
                s, start, end, mean(means)))
  }
  cat("\n")
}

# ---- (e) Discussion ----
cat("=== Part (e): Validation Discussion ===\n\n")

cat("To validate these clusters in a real clinical study:\n\n")

cat("1. EXTERNAL OUTCOMES: Compare clusters on outcomes NOT used in\n")
cat("   clustering (mortality, length of stay, treatment response).\n")
cat("   If clusters predict outcomes they were not trained on, the\n")
cat("   structure is likely clinically meaningful.\n\n")

cat("2. STABILITY ANALYSIS: Resample the data (bootstrap) and re-run\n")
cat("   clustering. If the same patients consistently end up in the\n")
cat("   same clusters, the results are robust. If clusters change\n")
cat("   substantially across resamples, they may be artefacts.\n\n")

cat("3. INDEPENDENT REPLICATION: Apply the same pipeline to an\n")
cat("   independent dataset (different hospital, different time period).\n")
cat("   If similar clusters emerge, the findings are generalisable.\n\n")

cat("4. CLINICAL EXPERT REVIEW: Present cluster profiles to clinicians.\n")
cat("   Do the clusters correspond to recognisable patient types?\n")
cat("   Do they suggest different management strategies?\n\n")

cat("5. MULTIPLE ALGORITHMS: Compare results from K-means, hierarchical\n")
cat("   clustering, and DBSCAN. Agreement across methods strengthens\n")
cat("   confidence. Disagreement suggests fragile structure.\n\n")

cat("6. AVOID CIRCULAR REASONING: Do NOT validate clusters by showing\n")
cat("   they differ on the same variables used to create them.\n")
cat("   This is guaranteed by construction and proves nothing.\n")

Show Python solution

# Chapter 16, Exercise 4: Full Pipeline - PCA + Clustering + UMAP Visualisation
# 800 patients, 30 clinical variables, 4 underlying subtypes

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from umap import UMAP

# ---- Simulate data ----
np.random.seed(42)
n = 800
p = 30

# 4 subtypes
true_subtype = np.random.choice(4, n, p=[0.30, 0.30, 0.25, 0.15])
print("True subtype distribution:", {i: (true_subtype == i).sum() for i in range(4)})

X = np.random.normal(0, 1, (n, p))

# Add subtype-specific signals
for s in range(4):
    start = s * 6
    end = min((s + 1) * 6, p)
    X[true_subtype == s, start:end] += 2.5

# ---- (a) Standardise and PCA ----
print("\n=== Part (a): PCA ===")

X_scaled = StandardScaler().fit_transform(X)
pca = PCA()
scores = pca.fit_transform(X_scaled)

cum_var = np.cumsum(pca.explained_variance_ratio_)
n_80 = np.argmax(cum_var >= 0.80) + 1
print(f"PCs needed for 80% variance: {n_80}")

# Scree plot
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

axes[0].bar(range(1, 16), pca.explained_variance_ratio_[:15],
            color="steelblue", edgecolor="white")
axes[0].axhline(y=1/p, color="firebrick", linestyle="--", label="1/p")
axes[0].set_xlabel("Component")
axes[0].set_ylabel("Proportion of Variance")
axes[0].set_title("Scree Plot")
axes[0].legend()

axes[1].plot(range(1, 16), cum_var[:15], "o-", color="steelblue")
axes[1].axhline(y=0.80, color="firebrick", linestyle="--", label="80%")
axes[1].set_xlabel("Components")
axes[1].set_ylabel("Cumulative Proportion")
axes[1].set_title("Cumulative Variance")
axes[1].legend()

plt.tight_layout()
plt.savefig("ch16_ex4_scree.png", dpi=150)
plt.show()

# Use 10 PCs
n_pcs = 10
pca_scores = scores[:, :n_pcs]

# ---- (b) K-means on PCA scores ----
print("\n=== Part (b): K-means on PCA Scores ===")

sil_list = []
for k in range(2, 6):
    km = KMeans(n_clusters=k, n_init=50, random_state=42).fit(pca_scores)
    sil = silhouette_score(pca_scores, km.labels_)
    sil_list.append(sil)
    print(f"k = {k}: Silhouette = {sil:.3f}")

best_k = list(range(2, 6))[np.argmax(sil_list)]
print(f"Best k by silhouette: {best_k}")

# Fit final clustering
km_final = KMeans(n_clusters=best_k, n_init=50, random_state=42).fit(pca_scores)

# ---- (c) UMAP visualisation ----
print("\n=== Part (c): UMAP Visualisation ===")

umap_2d = UMAP(n_components=2, n_neighbors=15, min_dist=0.1,
                random_state=42).fit_transform(pca_scores)

colors_4 = ["steelblue", "firebrick", "forestgreen", "goldenrod"]

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Discovered clusters
for c in range(best_k):
    mask = km_final.labels_ == c
    axes[0].scatter(umap_2d[mask, 0], umap_2d[mask, 1], c=colors_4[c],
                    s=10, alpha=0.6, label=f"Cluster {c+1}")
axes[0].set_title("K-means Clusters on UMAP")
axes[0].set_xlabel("UMAP 1")
axes[0].set_ylabel("UMAP 2")
axes[0].legend(fontsize=8)

# True subtypes
for s in range(4):
    mask = true_subtype == s
    axes[1].scatter(umap_2d[mask, 0], umap_2d[mask, 1], c=colors_4[s],
                    s=10, alpha=0.6, label=f"Subtype {s+1}")
axes[1].set_title("True Subtypes on UMAP")
axes[1].set_xlabel("UMAP 1")
axes[1].set_ylabel("UMAP 2")
axes[1].legend(fontsize=8)

plt.tight_layout()
plt.savefig("ch16_ex4_umap.png", dpi=150)
plt.show()

# ---- (d) Cluster profiles ----
print("\n=== Part (d): Cluster Profiles ===\n")

var_names = [f"V{i+1}" for i in range(p)]
df = pd.DataFrame(X, columns=var_names)
df['cluster'] = km_final.labels_

print("Cluster sizes:")
print(df['cluster'].value_counts().sort_index())

print("\nCluster means for signal variables:")
for cl in range(best_k):
    mask = df['cluster'] == cl
    n_cl = mask.sum()
    print(f"\nCluster {cl+1} (n = {n_cl}):")
    for s in range(4):
        start = s * 6
        end = (s + 1) * 6
        signal_vars = [f"V{i+1}" for i in range(start, min(end, p))]
        mean_val = df.loc[mask, signal_vars].mean().mean()
        print(f"  Subtype {s+1} signal vars (V{start+1}-V{min(end,p)}): mean = {mean_val:.2f}")

# ---- (e) Discussion ----
print("\n=== Part (e): Validation Discussion ===\n")

print("To validate these clusters in a real clinical study:\n")
print("1. EXTERNAL OUTCOMES: Compare clusters on outcomes NOT used in")
print("   clustering (mortality, LOS, treatment response). Clusters")
print("   that predict external outcomes are clinically meaningful.\n")
print("2. STABILITY ANALYSIS: Bootstrap resampling + re-clustering.")
print("   Consistent membership = robust; variable = artefact.\n")
print("3. INDEPENDENT REPLICATION: Apply pipeline to an independent")
print("   dataset (different hospital/time). Similar clusters = generalisable.\n")
print("4. CLINICAL EXPERT REVIEW: Do clusters correspond to")
print("   recognisable patient types? Do they suggest different management?\n")
print("5. MULTIPLE ALGORITHMS: Compare K-means, hierarchical, DBSCAN.")
print("   Agreement across methods strengthens confidence.\n")
print("6. AVOID CIRCULAR REASONING: Do NOT validate on the same")
print("   variables used for clustering. That proves nothing.")

D.13 Chapter 17: Causal Inference

D.13.1 Exercise 1

Show R solution

# =============================================================================
# Chapter 17 - Exercise 1: DAG Construction (Conceptual)
# Studying the relationship between ACE inhibitor use and Acute Kidney Injury
# =============================================================================

# This is a conceptual exercise. The answers are provided as detailed comments
# with optional code to illustrate DAG concepts using the dagitty package.

# --- Part (a): List at least five relevant variables ---
#
# 1. Baseline kidney function (eGFR / serum creatinine)
#    - Patients with worse kidney function are more likely to receive ACE inhibitors
#      (for renoprotective effects) AND are at higher risk of AKI.
#
# 2. Heart failure severity
#    - Heart failure is a major indication for ACE inhibitors AND independently
#      increases AKI risk (through haemodynamic changes).
#
# 3. Diabetes
#    - Diabetes is an indication for ACE inhibitors (nephroprotection) AND a risk
#      factor for AKI.
#
# 4. Age
#    - Older patients are more likely to be on ACE inhibitors and are at higher
#      risk of AKI.
#
# 5. Hypertension
#    - Primary indication for ACE inhibitors and can contribute to kidney injury.
#
# 6. Concomitant nephrotoxic drugs (e.g., NSAIDs, contrast agents)
#    - May be prescribed alongside or instead of ACE inhibitors and directly
#      cause AKI.
#
# 7. Volume status / dehydration
#    - Dehydration increases AKI risk and may influence whether ACE inhibitors
#      are held or continued.
#
# 8. Proteinuria
#    - An indication for ACE inhibitors and a marker of kidney disease severity
#      (related to AKI risk).

# --- Part (b): Draw a DAG ---
# We use the dagitty package to encode the DAG programmatically.

# install.packages("dagitty")
library(dagitty)

dag <- dagitty('dag {
  ACEi [exposure]
  AKI [outcome]
  eGFR [confounder]
  HeartFailure [confounder]
  Diabetes [confounder]
  Age [confounder]
  Hypertension [confounder]
  NephrotoxicDrugs [confounder]
  VolumeStatus [confounder]
  ICUAdmission [collider]

  Age -> ACEi

  Age -> AKI
  Age -> eGFR
  Age -> HeartFailure
  Age -> Diabetes

  eGFR -> ACEi
  eGFR -> AKI

  HeartFailure -> ACEi
  HeartFailure -> AKI
  HeartFailure -> VolumeStatus

  Diabetes -> ACEi
  Diabetes -> AKI
  Diabetes -> eGFR

  Hypertension -> ACEi
  Hypertension -> AKI

  NephrotoxicDrugs -> AKI

  VolumeStatus -> AKI

  ACEi -> AKI

  # Collider: ICU Admission is caused by both ACEi use patterns
  # and by AKI itself (patients with AKI go to ICU)
  ACEi -> ICUAdmission
  AKI -> ICUAdmission
}')

# Visualise the DAG (if running interactively)
# plot(dag)

# --- Part (c): Minimal sufficient adjustment set ---
# Using dagitty to compute the adjustment set automatically
adj_set <- adjustmentSets(dag, exposure = "ACEi", outcome = "AKI", type = "minimal")
cat("Minimal sufficient adjustment sets:\n")
print(adj_set)

# Explanation:
# The minimal adjustment set includes confounders that block all backdoor paths
# from ACEi to AKI. Based on our DAG, this includes:
# - Age, eGFR, HeartFailure, Diabetes, Hypertension
# These are the common causes of both ACEi use and AKI.
# We do NOT need to adjust for VolumeStatus (it's not a confounder of ACEi->AKI
# unless there's a direct path from ACEi to VolumeStatus).
# We do NOT need to adjust for NephrotoxicDrugs (it only affects AKI, not ACEi).

# --- Part (d): Identify a collider ---
#
# ICU Admission is a COLLIDER: it is caused by both ACEi use (patients on
# ACE inhibitors who develop complications may be admitted to ICU) and AKI
# (AKI itself is a reason for ICU admission).
#
# What happens if we adjust for ICU Admission?
# Conditioning on a collider OPENS a spurious path between ACEi and AKI.
# This would create "collider bias" (also known as Berkson's bias in clinical
# settings). Even if ACEi had no effect on AKI, adjusting for ICU admission
# would create an artificial association because:
# - Among ICU patients, if they were NOT admitted for AKI, they were more
#   likely admitted for ACEi-related reasons (and vice versa).
# - This induces a negative correlation between the two causes of the collider.
#
# Practical lesson: Do NOT adjust for variables that are consequences of both
# the exposure and the outcome (or that lie on causal paths from both).

cat("\n--- Summary ---\n")
cat("Key confounders to adjust for: Age, eGFR, Heart Failure, Diabetes, Hypertension\n")
cat("Collider to AVOID adjusting for: ICU Admission\n")
cat("Mediators to consider carefully: Volume status changes caused by ACEi\n")

Show Python solution

# =============================================================================
# Chapter 17 - Exercise 1: DAG Construction (Conceptual)
# Studying the relationship between ACE inhibitor use and Acute Kidney Injury
# =============================================================================

# This is a conceptual exercise. The answers are provided as detailed comments
# with optional code to illustrate DAG concepts.

# --- Part (a): List at least five relevant variables ---
#
# 1. Baseline kidney function (eGFR / serum creatinine)
#    - Patients with worse kidney function are more likely to receive ACE inhibitors
#      (for renoprotective effects) AND are at higher risk of AKI.
#
# 2. Heart failure severity
#    - Heart failure is a major indication for ACE inhibitors AND independently
#      increases AKI risk (through haemodynamic changes).
#
# 3. Diabetes
#    - Diabetes is an indication for ACE inhibitors (nephroprotection) AND a risk
#      factor for AKI.
#
# 4. Age
#    - Older patients are more likely to be on ACE inhibitors and are at higher
#      risk of AKI.
#
# 5. Hypertension
#    - Primary indication for ACE inhibitors and can contribute to kidney injury.
#
# 6. Concomitant nephrotoxic drugs (e.g., NSAIDs, contrast agents)
#    - May be prescribed alongside or instead of ACE inhibitors and directly
#      cause AKI.
#
# 7. Volume status / dehydration
#    - Dehydration increases AKI risk and may influence whether ACE inhibitors
#      are held or continued.
#
# 8. Proteinuria
#    - An indication for ACE inhibitors and a marker of kidney disease severity
#      (related to AKI risk).

# --- Part (b): Draw a DAG ---
# We represent the DAG as an adjacency list and visualise with networkx.

import networkx as nx
import matplotlib.pyplot as plt

# Define the DAG
dag = nx.DiGraph()

# Add edges representing causal relationships
edges = [
    # Age affects many variables
    ("Age", "ACEi"), ("Age", "AKI"), ("Age", "eGFR"),
    ("Age", "HeartFailure"), ("Age", "Diabetes"),
    # eGFR confounds ACEi-AKI
    ("eGFR", "ACEi"), ("eGFR", "AKI"),
    # Heart failure confounds ACEi-AKI
    ("HeartFailure", "ACEi"), ("HeartFailure", "AKI"),
    ("HeartFailure", "VolumeStatus"),
    # Diabetes confounds ACEi-AKI
    ("Diabetes", "ACEi"), ("Diabetes", "AKI"), ("Diabetes", "eGFR"),
    # Hypertension confounds ACEi-AKI
    ("Hypertension", "ACEi"), ("Hypertension", "AKI"),
    # Other causes of AKI
    ("NephrotoxicDrugs", "AKI"),
    ("VolumeStatus", "AKI"),
    # Treatment -> Outcome (causal effect of interest)
    ("ACEi", "AKI"),
    # Collider: ICU admission caused by both ACEi complications and AKI
    ("ACEi", "ICUAdmission"), ("AKI", "ICUAdmission"),
]

dag.add_edges_from(edges)

# Visualise the DAG
plt.figure(figsize=(12, 8))
pos = nx.spring_layout(dag, seed=42, k=2)
nx.draw(dag, pos, with_labels=True, node_color='lightblue',
        node_size=2000, font_size=9, font_weight='bold',
        arrows=True, arrowsize=20, edge_color='gray')
plt.title("DAG: ACE Inhibitor Use and Acute Kidney Injury")
plt.tight_layout()
plt.savefig("dag_acei_aki.png", dpi=300)
plt.show()

# --- Part (c): Minimal sufficient adjustment set ---
#
# To estimate the causal effect of ACEi on AKI, we need to block all
# backdoor paths (non-causal paths from ACEi to AKI).
#
# Backdoor paths from ACEi to AKI:
# 1. ACEi <- Age -> AKI
# 2. ACEi <- eGFR -> AKI
# 3. ACEi <- HeartFailure -> AKI
# 4. ACEi <- Diabetes -> AKI
# 5. ACEi <- Diabetes -> eGFR -> AKI
# 6. ACEi <- Hypertension -> AKI
# 7. ACEi <- Age -> eGFR -> AKI
# 8. ACEi <- Age -> HeartFailure -> AKI
# 9. ACEi <- Age -> Diabetes -> AKI
# 10. ACEi <- HeartFailure -> VolumeStatus -> AKI
#
# Minimal sufficient adjustment set:
# {Age, eGFR, HeartFailure, Diabetes, Hypertension}
#
# This blocks all backdoor paths. Note:
# - NephrotoxicDrugs only affects AKI (not a confounder), so no need to adjust.
# - VolumeStatus is blocked by conditioning on HeartFailure (the path goes
#   through HeartFailure).

print("Minimal sufficient adjustment set:")
print("  {Age, eGFR, HeartFailure, Diabetes, Hypertension}")

# --- Part (d): Identify a collider ---
#
# ICU Admission is a COLLIDER in the DAG:
#   ACEi -> ICUAdmission <- AKI
#
# Both ACEi use (through complications or monitoring) and AKI (as a
# critical condition) can cause ICU admission.
#
# What happens if we adjust for ICU Admission?
#
# Conditioning on a collider OPENS a spurious path between its causes.
# This creates "collider bias" (Berkson's bias):
#
# - Among ICU patients, if a patient was NOT admitted because of AKI,
#   they were more likely admitted due to ACEi-related issues (and vice versa).
# - This induces an artificial negative association between ACEi and AKI,
#   even if no true causal relationship exists.
#
# Practical lesson: Never adjust for variables that are consequences of both
# the exposure and the outcome. Restricting the analysis to ICU patients
# only would similarly introduce collider bias.

print("\nCollider identified: ICU Admission")
print("  ACEi -> ICUAdmission <- AKI")
print("  Adjusting for ICU admission would OPEN a spurious path (collider bias)")

print("\n--- Summary ---")
print("Key confounders to adjust for: Age, eGFR, Heart Failure, Diabetes, Hypertension")
print("Collider to AVOID adjusting for: ICU Admission")
print("Mediators to consider: Volume status changes caused by ACEi")

D.13.2 Exercise 2

Show R solution

# =============================================================================
# Chapter 17 - Exercise 2: Propensity Score Matching in R
# Beta-blocker use and 1-year mortality
# =============================================================================

library(tidyverse)
library(MatchIt)
library(cobalt)
library(broom)

# --- Simulate the dataset (from the exercise) ---
set.seed(123)
n <- 1500
exercise_dat <- tibble(
  age = rnorm(n, 70, 8),
  creatinine = rnorm(n, 1.2, 0.4),
  heart_failure = rbinom(n, 1, 0.35),
  prior_mi = rbinom(n, 1, 0.20),
  # Confounded treatment (beta-blocker use)
  treatment = rbinom(n, 1, plogis(-1 + 0.02*age + 0.3*heart_failure +
                                     0.5*prior_mi - 0.8*creatinine)),
  # Outcome: 1-year mortality
  death_1yr = rbinom(n, 1, plogis(-2 + 0.05*age + 0.4*heart_failure +
                                     0.6*prior_mi + 1.0*creatinine -
                                     0.7*treatment))
)

cat("Dataset dimensions:", nrow(exercise_dat), "patients\n")
cat("Treatment prevalence:", mean(exercise_dat$treatment), "\n")
cat("Outcome prevalence:", mean(exercise_dat$death_1yr), "\n\n")

# --- Part (a): Estimate propensity scores using logistic regression ---
ps_model <- glm(treatment ~ age + creatinine + heart_failure + prior_mi,
                data = exercise_dat, family = binomial)

exercise_dat$ps <- predict(ps_model, type = "response")

cat("Propensity score summary:\n")
print(summary(exercise_dat$ps))

# Visualise propensity score overlap
ggplot(exercise_dat, aes(x = ps, fill = factor(treatment))) +
  geom_density(alpha = 0.5) +
  labs(x = "Propensity Score", fill = "Beta-blocker",
       title = "Propensity Score Distribution by Treatment Group") +
  scale_fill_manual(values = c("#0072B2", "#D55E00"),
                    labels = c("No", "Yes")) +
  theme_minimal()

# --- Part (b): 1:1 nearest-neighbour matching with caliper of 0.2 SD ---
m_out <- matchit(treatment ~ age + creatinine + heart_failure + prior_mi,
                 data = exercise_dat,
                 method = "nearest",
                 distance = "glm",        # logistic regression PS
                 caliper = 0.2,           # 0.2 SD of logit PS
                 ratio = 1)               # 1:1 matching

cat("\nMatching summary:\n")
summary(m_out)

# --- Part (c): Love plot to assess balance ---
love.plot(m_out,
          thresholds = c(m = 0.1),    # SMD threshold of 0.1
          binary = "std",
          var.order = "unadjusted",
          title = "Covariate Balance: Before and After Matching",
          colors = c("#D55E00", "#0072B2"))

# Also check numeric balance
cat("\nBalance table:\n")
print(bal.tab(m_out, thresholds = c(m = 0.1)))

# --- Part (d): Estimate ATT for beta-blocker use on 1-year mortality ---
m_data <- match.data(m_out)

cat("\nMatched sample size:", nrow(m_data), "patients\n")
cat("Treated in matched sample:", sum(m_data$treatment == 1), "\n")
cat("Control in matched sample:", sum(m_data$treatment == 0), "\n")

# Outcome model in matched sample
outcome_model <- glm(death_1yr ~ treatment,
                     data = m_data,
                     family = binomial,
                     weights = weights)

cat("\nATT estimate (logistic regression in matched sample):\n")
result <- tidy(outcome_model, conf.int = TRUE, exponentiate = TRUE)
print(result)

# Risk difference (linear probability model)
rd_model <- lm(death_1yr ~ treatment, data = m_data, weights = weights)
cat("\nRisk difference estimate:\n")
print(tidy(rd_model, conf.int = TRUE))

# Mortality rates in matched sample
cat("\nMortality in matched sample:\n")
cat("  Treated:", mean(m_data$death_1yr[m_data$treatment == 1]), "\n")
cat("  Control:", mean(m_data$death_1yr[m_data$treatment == 0]), "\n")

# --- Part (e): E-value calculation ---
# Extract the odds ratio from the matched analysis
or_est <- result$estimate[result$term == "treatment"]
or_lo <- result$conf.low[result$term == "treatment"]
or_hi <- result$conf.high[result$term == "treatment"]

cat("\nOdds Ratio:", round(or_est, 3),
    "(95% CI:", round(or_lo, 3), "-", round(or_hi, 3), ")\n")

# Convert OR to RR approximation (for rare outcomes, OR ~ RR)
# For the E-value, we need RR on the above-null scale
# Since treatment is protective (OR < 1), convert: RR_above_null = 1/OR
rr_est <- 1 / or_est
rr_lo_bound <- 1 / or_hi  # Note: CI bounds flip when inverting

# E-value formula: E = RR + sqrt(RR * (RR - 1))
e_value <- function(rr) {
  if (rr < 1) rr <- 1 / rr  # Convert to above-null
  return(rr + sqrt(rr * (rr - 1)))
}

e_val_point <- e_value(rr_est)

# E-value for the CI bound closest to null
# If protective, the upper bound of RR (lower bound of 1/OR) is closest to null
e_val_ci <- e_value(rr_lo_bound)

cat("\nE-value for point estimate:", round(e_val_point, 2), "\n")
cat("E-value for CI bound closest to null:", round(e_val_ci, 2), "\n")

# Interpretation
cat("\n--- E-value Interpretation ---\n")
cat("The E-value of", round(e_val_point, 2), "means that an unmeasured\n")
cat("confounder would need to be associated with BOTH beta-blocker use AND\n")
cat("1-year mortality by a risk ratio of at least", round(e_val_point, 2), "\n")
cat("(above and beyond measured confounders) to fully explain away the\n")
cat("observed protective association.\n")
cat("\nThe E-value for the confidence interval bound is", round(e_val_ci, 2), ",\n")
cat("meaning a confounder of that strength could shift the CI to include the null.\n")
cat("If these values are larger than plausible confounders, the result is robust.\n")

# Optional: use EValue package if available
# library(EValue)
# evalues.OR(or_est, lo = or_lo, hi = or_hi, rare = FALSE)

Show Python solution

# =============================================================================
# Chapter 17 - Exercise 2: Propensity Score Matching in Python
# Beta-blocker use and 1-year mortality
# =============================================================================

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors
from scipy import stats
import matplotlib.pyplot as plt

# --- Simulate the dataset (same DGP as the R exercise) ---
np.random.seed(123)
n = 1500

age = np.random.normal(70, 8, n)
creatinine = np.random.normal(1.2, 0.4, n)
heart_failure = np.random.binomial(1, 0.35, n)
prior_mi = np.random.binomial(1, 0.20, n)

# Confounded treatment assignment
ps_true = 1 / (1 + np.exp(-(-1 + 0.02*age + 0.3*heart_failure +
                               0.5*prior_mi - 0.8*creatinine)))
treatment = np.random.binomial(1, ps_true)

# Outcome
mort_prob = 1 / (1 + np.exp(-(-2 + 0.05*age + 0.4*heart_failure +
                                 0.6*prior_mi + 1.0*creatinine -
                                 0.7*treatment)))
death_1yr = np.random.binomial(1, mort_prob)

df = pd.DataFrame({
    'age': age, 'creatinine': creatinine, 'heart_failure': heart_failure,
    'prior_mi': prior_mi, 'treatment': treatment, 'death_1yr': death_1yr
})

print(f"Dataset: {len(df)} patients")
print(f"Treatment prevalence: {df['treatment'].mean():.3f}")
print(f"Outcome prevalence: {df['death_1yr'].mean():.3f}\n")

# --- Part (a): Estimate propensity scores ---
covariates = ['age', 'creatinine', 'heart_failure', 'prior_mi']
ps_model = LogisticRegression(max_iter=1000)
ps_model.fit(df[covariates], df['treatment'])
df['ps'] = ps_model.predict_proba(df[covariates])[:, 1]

# Compute logit of PS for caliper matching
df['logit_ps'] = np.log(df['ps'] / (1 - df['ps']))

print("Propensity score summary:")
print(df['ps'].describe())

# Visualise PS overlap
fig, ax = plt.subplots(figsize=(8, 5))
df[df['treatment'] == 0]['ps'].plot.kde(ax=ax, label='Control', color='#0072B2')
df[df['treatment'] == 1]['ps'].plot.kde(ax=ax, label='Treated', color='#D55E00')
ax.set_xlabel('Propensity Score')
ax.set_ylabel('Density')
ax.set_title('Propensity Score Distribution by Treatment Group')
ax.legend()
plt.tight_layout()
plt.savefig('ps_overlap.png', dpi=300)
plt.show()

# --- Part (b): 1:1 nearest-neighbour matching with caliper ---
treated = df[df['treatment'] == 1].copy()
control = df[df['treatment'] == 0].copy()

# Caliper: 0.2 * SD of logit PS
caliper = 0.2 * df['logit_ps'].std()
print(f"\nCaliper (0.2 SD of logit PS): {caliper:.4f}")

# Nearest-neighbour matching on logit PS
nn = NearestNeighbors(n_neighbors=1, metric='euclidean')
nn.fit(control[['logit_ps']].values)

distances, indices = nn.kneighbors(treated[['logit_ps']].values)

# Apply caliper: keep only matches within the caliper
matched_treated_idx = []
matched_control_idx = []
used_controls = set()

for i in range(len(treated)):
    dist = distances[i, 0]
    ctrl_idx = indices[i, 0]
    if dist <= caliper and ctrl_idx not in used_controls:
        matched_treated_idx.append(treated.index[i])
        matched_control_idx.append(control.index[ctrl_idx])
        used_controls.add(ctrl_idx)

# Create matched dataset
matched_df = pd.concat([
    df.loc[matched_treated_idx],
    df.loc[matched_control_idx]
])

print(f"Matched sample: {len(matched_df)} patients")
print(f"  Treated: {len(matched_treated_idx)}")
print(f"  Control: {len(matched_control_idx)}")
print(f"  Unmatched treated: {len(treated) - len(matched_treated_idx)}")

# --- Part (c): Assess balance using SMD (Love plot equivalent) ---
def compute_smd(data, var, group_col):
    """Compute standardised mean difference."""
    treated = data[data[group_col] == 1][var]
    control = data[data[group_col] == 0][var]
    pooled_sd = np.sqrt((treated.var() + control.var()) / 2)
    if pooled_sd == 0:
        return 0
    return (treated.mean() - control.mean()) / pooled_sd

print("\nCovariate balance (SMD):")
print(f"{'Variable':<20} {'Before':>10} {'After':>10} {'Balanced?':>10}")
print("-" * 52)

smd_before = {}
smd_after = {}
for var in covariates:
    smd_b = compute_smd(df, var, 'treatment')
    smd_a = compute_smd(matched_df, var, 'treatment')
    smd_before[var] = smd_b
    smd_after[var] = smd_a
    balanced = "Yes" if abs(smd_a) < 0.1 else "No"
    print(f"{var:<20} {smd_b:>10.4f} {smd_a:>10.4f} {balanced:>10}")

# Love plot
fig, ax = plt.subplots(figsize=(8, 5))
y_pos = range(len(covariates))

ax.scatter([abs(smd_before[v]) for v in covariates], y_pos,
           color='#D55E00', label='Before matching', s=80, zorder=5)
ax.scatter([abs(smd_after[v]) for v in covariates], y_pos,
           color='#0072B2', label='After matching', s=80, zorder=5)

# Connect before and after
for i, var in enumerate(covariates):
    ax.plot([abs(smd_before[var]), abs(smd_after[var])], [i, i],
            'k-', alpha=0.3)

ax.axvline(x=0.1, color='red', linestyle='--', alpha=0.5, label='SMD = 0.1 threshold')
ax.set_yticks(y_pos)
ax.set_yticklabels(covariates)
ax.set_xlabel('Absolute Standardised Mean Difference')
ax.set_title('Love Plot: Covariate Balance Before and After Matching')
ax.legend(loc='lower right')
plt.tight_layout()
plt.savefig('love_plot.png', dpi=300)
plt.show()

# --- Part (d): Estimate ATT ---
# Mortality rates in matched sample
mort_treated = matched_df[matched_df['treatment'] == 1]['death_1yr'].mean()
mort_control = matched_df[matched_df['treatment'] == 0]['death_1yr'].mean()

print(f"\nMortality in matched sample:")
print(f"  Treated (beta-blocker): {mort_treated:.4f}")
print(f"  Control (no beta-blocker): {mort_control:.4f}")
print(f"  Risk difference (ATT): {mort_treated - mort_control:.4f}")

# Odds ratio
from statsmodels.api import Logit, add_constant

X = add_constant(matched_df['treatment'])
logit_model = Logit(matched_df['death_1yr'], X).fit(disp=0)
print(f"\nLogistic regression in matched sample:")
print(logit_model.summary2().tables[1])

or_est = np.exp(logit_model.params['treatment'])
ci = np.exp(logit_model.conf_int().loc['treatment'])
print(f"\nOdds Ratio: {or_est:.3f} (95% CI: {ci[0]:.3f} - {ci[1]:.3f})")

# --- Part (e): E-value ---
# Convert protective OR to RR on above-null scale
rr_est = 1 / or_est  # Flip because protective
rr_ci_closest_null = 1 / ci[1]  # Upper CI of OR -> lower bound of 1/OR

def e_value(rr):
    """Compute E-value for a risk ratio."""
    if rr < 1:
        rr = 1 / rr
    return rr + np.sqrt(rr * (rr - 1))

e_val_point = e_value(rr_est)
e_val_ci = e_value(rr_ci_closest_null)

print(f"\n--- E-value Analysis ---")
print(f"E-value for point estimate: {e_val_point:.2f}")
print(f"E-value for CI bound closest to null: {e_val_ci:.2f}")

print(f"\nInterpretation:")
print(f"An unmeasured confounder would need to be associated with BOTH")
print(f"beta-blocker use AND 1-year mortality by a risk ratio of at least")
print(f"{e_val_point:.2f} to fully explain away the observed protective effect.")
print(f"For the confidence interval, a confounder with RR >= {e_val_ci:.2f}")
print(f"could shift the CI to include the null.")

D.13.3 Exercise 3

Show R solution

# =============================================================================
# Chapter 17 - Exercise 3: IPTW in R
# (Exercise specifies Python, but we provide an R version as well)
# Beta-blocker use and 1-year mortality using IPTW
# =============================================================================

library(tidyverse)
library(WeightIt)
library(cobalt)
library(survey)
library(broom)

# --- Simulate the dataset (same DGP as Exercise 2) ---
set.seed(123)
n <- 1500

exercise_dat <- tibble(
  age = rnorm(n, 70, 8),
  creatinine = rnorm(n, 1.2, 0.4),
  heart_failure = rbinom(n, 1, 0.35),
  prior_mi = rbinom(n, 1, 0.20),
  treatment = rbinom(n, 1, plogis(-1 + 0.02*age + 0.3*heart_failure +
                                     0.5*prior_mi - 0.8*creatinine)),
  death_1yr = rbinom(n, 1, plogis(-2 + 0.05*age + 0.4*heart_failure +
                                     0.6*prior_mi + 1.0*creatinine -
                                     0.7*treatment))
)

# --- Part (a): Fit PS model and compute stabilised IPTW weights ---
w_out <- weightit(treatment ~ age + creatinine + heart_failure + prior_mi,
                  data = exercise_dat,
                  method = "ps",
                  estimand = "ATE")

cat("Weight summary:\n")
summary(w_out)

# Check for extreme weights
cat("\nWeight distribution:\n")
cat("  Min:", min(w_out$weights), "\n")
cat("  Max:", max(w_out$weights), "\n")
cat("  Mean:", mean(w_out$weights), "\n")
cat("  SD:", sd(w_out$weights), "\n")

# --- Part (b): Assess covariate balance using weighted SMDs ---
cat("\nBalance table:\n")
bt <- bal.tab(w_out, thresholds = c(m = 0.1))
print(bt)

# Love plot
love.plot(w_out,
          thresholds = c(m = 0.1),
          binary = "std",
          var.order = "unadjusted",
          title = "Covariate Balance: Before and After IPTW",
          colors = c("#D55E00", "#0072B2"))

# --- Part (c): Estimate ATE using weighted regression ---
d_weighted <- svydesign(ids = ~1, weights = w_out$weights, data = exercise_dat)

# Logistic regression (odds ratio)
ate_model <- svyglm(death_1yr ~ treatment, design = d_weighted,
                    family = quasibinomial)
cat("\nIPTW ATE estimate (logistic):\n")
print(tidy(ate_model, conf.int = TRUE, exponentiate = TRUE))

# Linear probability model (risk difference)
rd_model <- svyglm(death_1yr ~ treatment, design = d_weighted,
                   family = gaussian())
cat("\nIPTW ATE estimate (risk difference):\n")
print(tidy(rd_model, conf.int = TRUE))

# --- Part (d): Sensitivity analysis with simulated unmeasured confounder ---
cat("\n--- Sensitivity Analysis: Unmeasured Confounder ---\n")

# Simulate an unmeasured confounder U that affects both treatment and outcome
set.seed(456)
U <- rnorm(n, 0, 1)

# Re-simulate treatment and outcome with U included
exercise_dat_u <- tibble(
  age = exercise_dat$age,
  creatinine = exercise_dat$creatinine,
  heart_failure = exercise_dat$heart_failure,
  prior_mi = exercise_dat$prior_mi,
  U = U,
  treatment = rbinom(n, 1, plogis(-1 + 0.02*age + 0.3*heart_failure +
                                     0.5*prior_mi - 0.8*creatinine +
                                     0.6*U)),  # U affects treatment
  death_1yr = rbinom(n, 1, plogis(-2 + 0.05*age + 0.4*heart_failure +
                                     0.6*prior_mi + 1.0*creatinine -
                                     0.7*treatment +
                                     0.8*U))   # U affects outcome
)

# Analysis WITHOUT adjusting for U (biased)
w_no_u <- weightit(treatment ~ age + creatinine + heart_failure + prior_mi,
                   data = exercise_dat_u,
                   method = "ps", estimand = "ATE")

d_no_u <- svydesign(ids = ~1, weights = w_no_u$weights, data = exercise_dat_u)
model_no_u <- svyglm(death_1yr ~ treatment, design = d_no_u,
                     family = quasibinomial)
res_no_u <- tidy(model_no_u, conf.int = TRUE, exponentiate = TRUE)

cat("\nWithout adjusting for U:\n")
print(res_no_u)

# Analysis WITH adjusting for U (unbiased)
w_with_u <- weightit(treatment ~ age + creatinine + heart_failure + prior_mi + U,
                     data = exercise_dat_u,
                     method = "ps", estimand = "ATE")

d_with_u <- svydesign(ids = ~1, weights = w_with_u$weights, data = exercise_dat_u)
model_with_u <- svyglm(death_1yr ~ treatment, design = d_with_u,
                       family = quasibinomial)
res_with_u <- tidy(model_with_u, conf.int = TRUE, exponentiate = TRUE)

cat("\nWith adjusting for U:\n")
print(res_with_u)

cat("\n--- Interpretation ---\n")
cat("The unmeasured confounder U (with associations of 0.6 with treatment\n")
cat("and 0.8 with outcome on the log-odds scale) shifts the treatment effect\n")
cat("estimate when not adjusted for. This demonstrates:\n")
cat("  1. IPTW cannot correct for unmeasured confounding.\n")
cat("  2. Sensitivity analysis helps quantify how strong a confounder\n")
cat("     would need to be to change conclusions.\n")
cat("  3. The E-value provides a formal framework for this assessment.\n")

Show Python solution

# =============================================================================
# Chapter 17 - Exercise 3: IPTW in Python
# Beta-blocker use and 1-year mortality using IPTW
# =============================================================================

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from scipy import stats
import matplotlib.pyplot as plt
import statsmodels.api as sm

# --- Simulate the dataset (same DGP as Exercise 2) ---
np.random.seed(123)
n = 1500

age = np.random.normal(70, 8, n)
creatinine = np.random.normal(1.2, 0.4, n)
heart_failure = np.random.binomial(1, 0.35, n)
prior_mi = np.random.binomial(1, 0.20, n)

ps_true = 1 / (1 + np.exp(-(-1 + 0.02*age + 0.3*heart_failure +
                               0.5*prior_mi - 0.8*creatinine)))
treatment = np.random.binomial(1, ps_true)

mort_prob = 1 / (1 + np.exp(-(-2 + 0.05*age + 0.4*heart_failure +
                                 0.6*prior_mi + 1.0*creatinine -
                                 0.7*treatment)))
death_1yr = np.random.binomial(1, mort_prob)

df = pd.DataFrame({
    'age': age, 'creatinine': creatinine, 'heart_failure': heart_failure,
    'prior_mi': prior_mi, 'treatment': treatment, 'death_1yr': death_1yr
})

print(f"Dataset: {len(df)} patients")
print(f"Treatment prevalence: {df['treatment'].mean():.3f}")
print(f"Outcome prevalence: {df['death_1yr'].mean():.3f}\n")

# --- Part (a): Fit PS model and compute stabilised IPTW weights ---
covariates = ['age', 'creatinine', 'heart_failure', 'prior_mi']

ps_model = LogisticRegression(max_iter=1000)
ps_model.fit(df[covariates], df['treatment'])
df['ps'] = ps_model.predict_proba(df[covariates])[:, 1]

# Unstabilised ATE weights: w = A/ps + (1-A)/(1-ps)
df['iptw'] = np.where(
    df['treatment'] == 1,
    1 / df['ps'],
    1 / (1 - df['ps'])
)

# Stabilised weights: replace numerator with marginal treatment probability
p_treat = df['treatment'].mean()
df['sw'] = np.where(
    df['treatment'] == 1,
    p_treat / df['ps'],
    (1 - p_treat) / (1 - df['ps'])
)

print("Stabilised weight summary:")
print(f"  Min: {df['sw'].min():.4f}")
print(f"  Max: {df['sw'].max():.4f}")
print(f"  Mean: {df['sw'].mean():.4f}")
print(f"  SD: {df['sw'].std():.4f}")

# Check for extreme weights
extreme = (df['sw'] > 10).sum()
print(f"  Extreme weights (>10): {extreme}")

# --- Part (b): Assess covariate balance using weighted SMDs ---
def weighted_smd(data, var, treatment_col, weight_col):
    """Compute weighted standardised mean difference."""
    treated = data[data[treatment_col] == 1]
    control = data[data[treatment_col] == 0]
    mean_t = np.average(treated[var], weights=treated[weight_col])
    mean_c = np.average(control[var], weights=control[weight_col])
    # Use unweighted pooled SD for standardisation
    sd_pooled = np.sqrt((treated[var].var() + control[var].var()) / 2)
    if sd_pooled == 0:
        return 0
    return (mean_t - mean_c) / sd_pooled

def unweighted_smd(data, var, treatment_col):
    """Compute unweighted SMD."""
    treated = data[data[treatment_col] == 1][var]
    control = data[data[treatment_col] == 0][var]
    sd_pooled = np.sqrt((treated.var() + control.var()) / 2)
    if sd_pooled == 0:
        return 0
    return (treated.mean() - control.mean()) / sd_pooled

print("\nCovariate balance (weighted SMD):")
print(f"{'Variable':<20} {'Unweighted':>12} {'Weighted':>12} {'Balanced?':>10}")
print("-" * 56)

smd_before = {}
smd_after = {}
for var in covariates:
    smd_b = unweighted_smd(df, var, 'treatment')
    smd_a = weighted_smd(df, var, 'treatment', 'sw')
    smd_before[var] = smd_b
    smd_after[var] = smd_a
    balanced = "Yes" if abs(smd_a) < 0.1 else "No"
    print(f"{var:<20} {smd_b:>12.4f} {smd_a:>12.4f} {balanced:>10}")

# Love plot equivalent
fig, ax = plt.subplots(figsize=(8, 5))
y_pos = range(len(covariates))

ax.scatter([abs(smd_before[v]) for v in covariates], y_pos,
           color='#D55E00', label='Before IPTW', s=80, zorder=5)
ax.scatter([abs(smd_after[v]) for v in covariates], y_pos,
           color='#0072B2', label='After IPTW', s=80, zorder=5)

for i, var in enumerate(covariates):
    ax.plot([abs(smd_before[var]), abs(smd_after[var])], [i, i],
            'k-', alpha=0.3)

ax.axvline(x=0.1, color='red', linestyle='--', alpha=0.5, label='SMD = 0.1')
ax.set_yticks(y_pos)
ax.set_yticklabels(covariates)
ax.set_xlabel('Absolute Standardised Mean Difference')
ax.set_title('Covariate Balance: Before and After IPTW')
ax.legend(loc='lower right')
plt.tight_layout()
plt.savefig('iptw_balance.png', dpi=300)
plt.show()

# --- Part (c): Estimate ATE using weighted regression ---
X = sm.add_constant(df['treatment'])

# Weighted least squares (linear probability model for risk difference)
wls_model = sm.WLS(df['death_1yr'], X, weights=df['sw']).fit()
print("\nIPTW ATE Estimate (Risk Difference):")
print(f"  Coefficient: {wls_model.params.iloc[1]:.4f}")
print(f"  95% CI: ({wls_model.conf_int().iloc[1, 0]:.4f}, "
      f"{wls_model.conf_int().iloc[1, 1]:.4f})")
print(f"  p-value: {wls_model.pvalues.iloc[1]:.4f}")

# Weighted logistic regression (for odds ratio)
from statsmodels.genmod.generalized_linear_model import GLM
from statsmodels.genmod.families import Binomial

glm_model = GLM(df['death_1yr'], X,
                family=Binomial(),
                freq_weights=df['sw']).fit()
or_est = np.exp(glm_model.params.iloc[1])
ci = np.exp(glm_model.conf_int().iloc[1])
print(f"\nIPTW ATE Estimate (Odds Ratio):")
print(f"  OR: {or_est:.3f} (95% CI: {ci[0]:.3f} - {ci[1]:.3f})")

# --- Part (d): Sensitivity analysis with unmeasured confounder ---
print("\n--- Sensitivity Analysis: Unmeasured Confounder ---")
print("Varying confounder strength to show impact on ATE estimate:\n")

# We simulate data with an unmeasured confounder U of varying strength
results = []
gamma_values = [0, 0.2, 0.4, 0.6, 0.8, 1.0]

for gamma in gamma_values:
    np.random.seed(42)  # Same seed for comparability
    U = np.random.normal(0, 1, n)

    # Re-simulate with confounder U
    ps_u = 1 / (1 + np.exp(-(-1 + 0.02*age + 0.3*heart_failure +
                                0.5*prior_mi - 0.8*creatinine + gamma*U)))
    trt_u = np.random.binomial(1, ps_u)
    mort_u = 1 / (1 + np.exp(-(-2 + 0.05*age + 0.4*heart_failure +
                                  0.6*prior_mi + 1.0*creatinine -
                                  0.7*trt_u + gamma*U)))
    death_u = np.random.binomial(1, mort_u)

    df_u = pd.DataFrame({
        'age': age, 'creatinine': creatinine, 'heart_failure': heart_failure,
        'prior_mi': prior_mi, 'treatment': trt_u, 'death_1yr': death_u
    })

    # IPTW without U
    ps_mod = LogisticRegression(max_iter=1000)
    ps_mod.fit(df_u[covariates], df_u['treatment'])
    ps_est = ps_mod.predict_proba(df_u[covariates])[:, 1]

    p_t = df_u['treatment'].mean()
    sw = np.where(df_u['treatment'] == 1, p_t / ps_est, (1 - p_t) / (1 - ps_est))

    X_u = sm.add_constant(df_u['treatment'])
    wls_u = sm.WLS(df_u['death_1yr'], X_u, weights=sw).fit()

    results.append({
        'gamma': gamma,
        'ate': wls_u.params.iloc[1],
        'ci_lo': wls_u.conf_int().iloc[1, 0],
        'ci_hi': wls_u.conf_int().iloc[1, 1]
    })

results_df = pd.DataFrame(results)
print(f"{'Gamma (U strength)':<20} {'ATE':>10} {'95% CI':>25}")
print("-" * 57)
for _, row in results_df.iterrows():
    print(f"{row['gamma']:<20.1f} {row['ate']:>10.4f} "
          f"({row['ci_lo']:.4f}, {row['ci_hi']:.4f})")

# Plot sensitivity analysis
fig, ax = plt.subplots(figsize=(8, 5))
ax.errorbar(results_df['gamma'], results_df['ate'],
            yerr=[results_df['ate'] - results_df['ci_lo'],
                  results_df['ci_hi'] - results_df['ate']],
            fmt='o-', color='#0072B2', capsize=5, markersize=8)
ax.axhline(y=0, color='red', linestyle='--', alpha=0.5, label='Null effect')
ax.set_xlabel('Unmeasured Confounder Strength (gamma)')
ax.set_ylabel('ATE Estimate (Risk Difference)')
ax.set_title('Sensitivity Analysis: Impact of Unmeasured Confounding')
ax.legend()
plt.tight_layout()
plt.savefig('sensitivity_analysis.png', dpi=300)
plt.show()

print("\n--- Interpretation ---")
print("As the strength of the unmeasured confounder increases (gamma),")
print("the IPTW estimate of the treatment effect becomes more biased")
print("because IPTW cannot account for unmeasured confounding.")
print("When gamma = 0 (no unmeasured confounder), the estimate is closest")
print("to the true effect. This demonstrates why sensitivity analysis")
print("is essential for observational causal inference.")

D.13.4 Exercise 4

Show R solution

# =============================================================================
# Chapter 17 - Exercise 4: Target Trial Emulation (Conceptual)
# Early vs delayed metformin initiation and 5-year cardiovascular events
# =============================================================================

# This is a conceptual exercise. All answers are provided as detailed comments.

# --- Part (a): Complete target trial protocol table ---
#
# +----------------------------+---------------------------------------------+-------------------------------------------+
# | Protocol component         | Target trial                                | Observational emulation                   |
# +----------------------------+---------------------------------------------+-------------------------------------------+
# | Eligibility criteria       | Adults aged 30-80 with new T2DM diagnosis,  | Same criteria applied to EHR database:    |
# |                            | no prior CVD, no prior metformin use,        | first T2DM diagnosis code (ICD-10 E11),   |
# |                            | eGFR >= 30, no contraindications to          | no prior MACE, no metformin dispensing     |
# |                            | metformin (e.g., severe renal impairment).   | before index date, eGFR >= 30 in prior    |
# |                            |                                             | 6 months.                                 |
# +----------------------------+---------------------------------------------+-------------------------------------------+
# | Treatment strategies       | Strategy 1: Initiate metformin within       | Same. Defined by first metformin          |
# |                            | 3 months of T2DM diagnosis.                 | prescription fill date relative to T2DM   |
# |                            | Strategy 2: Do not initiate metformin       | diagnosis date.                           |
# |                            | within 3 months (delayed, 3-12 months).     |                                           |
# +----------------------------+---------------------------------------------+-------------------------------------------+
# | Treatment assignment       | Random assignment at time of T2DM           | Not random. Adjusted using IPTW or        |
# |                            | diagnosis.                                  | propensity score matching, conditioning   |
# |                            |                                             | on baseline confounders.                  |
# +----------------------------+---------------------------------------------+-------------------------------------------+
# | Start of follow-up         | Date of randomisation (= date of T2DM       | Date of T2DM diagnosis (time zero).       |
# | (time zero)                | diagnosis).                                 | All eligible patients enter at this date. |
# +----------------------------+---------------------------------------------+-------------------------------------------+
# | Outcome                    | First major adverse cardiovascular event     | Same. MACE defined by ICD-10 codes for    |
# |                            | (MACE): composite of MI, stroke, or CV      | MI (I21), stroke (I63, I64), or CV death  |
# |                            | death within 5 years.                       | (cause of death codes). Censored at 5     |
# |                            |                                             | years, loss to follow-up, or death from   |
# |                            |                                             | non-CV causes.                            |
# +----------------------------+---------------------------------------------+-------------------------------------------+
# | Causal contrast            | Intention-to-treat: effect of being          | Intention-to-treat: compare early vs      |
# |                            | assigned to early vs delayed initiation.     | delayed initiation regardless of          |
# |                            | Per-protocol: effect of adhering to the      | subsequent adherence. Per-protocol:       |
# |                            | assigned strategy.                           | use clone-censor-weight approach.         |
# +----------------------------+---------------------------------------------+-------------------------------------------+
# | Analysis plan              | Cox proportional hazards model.              | IPTW-weighted Cox PH model for ITT.       |
# |                            | ITT analysis is primary.                     | For per-protocol: clone each patient      |
# |                            | Per-protocol as sensitivity analysis.        | into both strategies, censor when they    |
# |                            |                                             | deviate, and apply IPCW to correct for    |
# |                            |                                             | informative censoring.                    |
# +----------------------------+---------------------------------------------+-------------------------------------------+

# --- Part (b): Sources of immortal time bias in a naive analysis ---
#
# In a naive analysis comparing "early initiators" (metformin within 3 months)
# vs "delayed initiators" (3-12 months), several sources of immortal time
# bias can arise:
#
# 1. SURVIVAL REQUIREMENT FOR CLASSIFICATION:
#    To be classified as a "delayed initiator," a patient must survive at least
#    3 months (to reach the delayed window). The time between T2DM diagnosis
#    and 3 months is "immortal" for the delayed group -- they cannot experience
#    the outcome during this period by definition. This artificially inflates
#    survival in the delayed group.
#
# 2. MISCLASSIFICATION OF PERSON-TIME:
#    If follow-up starts at T2DM diagnosis for both groups but treatment
#    classification depends on future events (whether and when metformin is
#    started), person-time before treatment initiation is misclassified.
#    A patient who starts metformin at month 2 contributes 2 months of
#    "untreated" time that is credited to the "early" group.
#
# 3. EXCLUSION OF NON-INITIATORS WHO DIE EARLY:
#    Patients who die before 3 months and never start metformin may be excluded
#    entirely if the study requires treatment initiation. This creates
#    survivorship bias.
#
# HOW TARGET TRIAL EMULATION AVOIDS THIS:
# - Time zero is aligned with eligibility (T2DM diagnosis), not treatment start
# - The clone-censor-weight approach assigns each patient to BOTH strategies
#   at time zero, censors them when they deviate from their assigned strategy,
#   and uses inverse probability of censoring weights (IPCW) to correct for
#   the informative censoring

# --- Part (c): How new-user, active comparator design addresses confounding ---
#
# The NEW-USER (INCIDENT USER) DESIGN:
# - Only includes patients at the TIME OF FIRST metformin prescription
#   (or the decision point: T2DM diagnosis)
# - Avoids "prevalent user bias" where including patients already on metformin
#   selectively includes those who tolerated and responded to the drug
#   (survivors of the initial treatment period)
# - Captures early effects (including side effects) that might be missed
#   in prevalent user analyses
#
# The ACTIVE COMPARATOR DESIGN:
# - Compares early metformin vs delayed metformin (not metformin vs no treatment)
# - Patients choosing delayed initiation are more comparable to early initiators
#   than patients who never initiate (who may differ fundamentally in disease
#   severity, healthcare access, or physician preferences)
# - Reduces confounding by indication because both groups are deemed to need
#   metformin -- the question is only about TIMING
# - Makes the positivity assumption more plausible: most patients have a
#   realistic probability of being in either group
# - Mimics the clinical question: "Should I start metformin now or wait?"

# --- Part (d): Unmeasured confounders and sensitivity analysis ---
#
# POTENTIAL UNMEASURED CONFOUNDERS:
#
# 1. HbA1c trajectory / diabetes severity:
#    Patients with more rapidly rising HbA1c may receive metformin earlier.
#    If severity also affects CVD risk, this confounds the comparison.
#    (May be partially captured in EHR but with measurement timing issues.)
#
# 2. Patient health literacy and adherence behaviour:
#    Patients who seek care promptly and fill prescriptions early may also
#    engage in other health-promoting behaviours (exercise, diet) that
#    independently reduce CVD risk. This is a form of "healthy user bias."
#
# 3. Physician prescribing patterns / quality of care:
#    Physicians who prescribe metformin early may also provide better
#    overall cardiovascular risk management.
#
# 4. Socioeconomic status / insurance:
#    Patients with better insurance or higher SES may fill prescriptions
#    faster and have better access to follow-up care.
#
# 5. Diet, exercise, and lifestyle factors:
#    Rarely captured in EHR data but strongly associated with both
#    treatment initiation patterns and CVD outcomes.
#
# SENSITIVITY ANALYSIS APPROACHES:
#
# 1. E-VALUE:
#    Calculate the E-value for the primary estimate to quantify how strong
#    an unmeasured confounder would need to be (in terms of associations with
#    both treatment and outcome) to fully explain away the observed effect.
#
# 2. QUANTITATIVE BIAS ANALYSIS:
#    Use external data (e.g., surveys with lifestyle data) to estimate the
#    likely magnitude of confounding by specific unmeasured factors. Apply
#    the bias formula to adjust the estimate.
#
# 3. NEGATIVE CONTROL OUTCOMES:
#    Test outcomes that should NOT be affected by metformin timing (e.g.,
#    accidental injuries). If an association is found, it suggests residual
#    confounding.
#
# 4. NEGATIVE CONTROL EXPOSURES:
#    Test treatments that share the same confounding structure but should
#    not affect CVD (e.g., timing of proton pump inhibitor initiation).
#
# 5. INSTRUMENTAL VARIABLE ANALYSIS:
#    If a valid instrument exists (e.g., physician preference for early
#    prescribing), IV analysis can estimate the causal effect even with
#    unmeasured confounding.

cat("Exercise 4 is a conceptual exercise.\n")
cat("All answers are provided as detailed comments in this script.\n")
cat("Review the comments for:\n")
cat("  (a) Complete target trial protocol table\n")
cat("  (b) Sources of immortal time bias\n")
cat("  (c) How new-user, active comparator design helps\n")
cat("  (d) Unmeasured confounders and sensitivity approaches\n")

Show Python solution

# =============================================================================
# Chapter 17 - Exercise 4: Target Trial Emulation (Conceptual)
# Early vs delayed metformin initiation and 5-year cardiovascular events
# =============================================================================

# This is a conceptual exercise. All answers are provided as detailed comments.

# --- Part (a): Complete target trial protocol table ---
#
# Protocol Component     | Target Trial                          | Observational Emulation
# -----------------------|---------------------------------------|---------------------------------------
# Eligibility criteria   | Adults aged 30-80 with new T2DM       | Same criteria in EHR: first T2DM
#                        | diagnosis, no prior CVD, no prior      | ICD-10 E11 code, no prior MACE,
#                        | metformin use, eGFR >= 30, no          | no metformin dispensing before index,
#                        | contraindications to metformin.        | eGFR >= 30 within prior 6 months.
#                        |                                       |
# Treatment strategies   | Strategy 1: Initiate metformin        | Same. Defined by first metformin
#                        | within 3 months of T2DM diagnosis.    | prescription fill date relative to
#                        | Strategy 2: Do not initiate within     | T2DM diagnosis date.
#                        | 3 months (delayed: 3-12 months).      |
#                        |                                       |
# Treatment assignment   | Random assignment at T2DM diagnosis.  | Not random. Adjusted via IPTW or
#                        |                                       | PS matching on baseline confounders.
#                        |                                       |
# Start of follow-up     | Date of randomisation (= date of      | Date of T2DM diagnosis (time zero).
# (time zero)            | T2DM diagnosis).                      | All eligible patients enter at this
#                        |                                       | date, regardless of treatment timing.
#                        |                                       |
# Outcome                | First MACE (MI, stroke, or CV death)  | Same. MACE via ICD-10 codes.
#                        | within 5 years.                       | Censored at 5 years, loss to
#                        |                                       | follow-up, or non-CV death.
#                        |                                       |
# Causal contrast        | ITT: effect of assignment to early    | ITT: compare groups as assigned.
#                        | vs delayed. Per-protocol: effect of   | Per-protocol: clone-censor-weight.
#                        | adhering to assignment.                |
#                        |                                       |
# Analysis plan          | Cox PH model; ITT primary.            | IPTW-weighted Cox PH for ITT.
#                        |                                       | Clone-censor-weight for per-protocol.

# --- Part (b): Sources of immortal time bias ---
#
# 1. SURVIVAL REQUIREMENT FOR CLASSIFICATION:
#    To be classified as a "delayed initiator," a patient must survive at
#    least 3 months without starting metformin. This period is "immortal
#    time" -- the patient cannot die (by definition) during this window.
#    This artificially inflates survival in the delayed group.
#
# 2. MISCLASSIFICATION OF PERSON-TIME:
#    If follow-up starts at T2DM diagnosis but treatment classification
#    depends on future behavior, person-time before treatment initiation
#    is misclassified. The period before a patient's first metformin
#    prescription is attributed to the wrong treatment group.
#
# 3. EXCLUSION OF EARLY DEATHS:
#    Patients who die before initiating metformin may be excluded from
#    the analysis entirely, creating survivorship bias.
#
# HOW TARGET TRIAL EMULATION AVOIDS THIS:
# - Aligns time zero with eligibility (T2DM diagnosis), not treatment
# - Uses clone-censor-weight: clones each patient into both strategies,
#   censors when they deviate, and applies IPCW to correct for the
#   informative censoring introduced by this approach.

# --- Part (c): New-user, active comparator design ---
#
# NEW-USER (INCIDENT USER) DESIGN:
# - Includes patients only at the point of first treatment decision
# - Avoids prevalent user bias (selective survival of tolerators)
# - Captures early effects including initial side effects
# - Provides a clear "time zero" aligned with treatment decision
#
# ACTIVE COMPARATOR DESIGN:
# - Compares early metformin vs delayed metformin (not vs no treatment)
# - Patients in both groups are deemed to need metformin
# - Reduces confounding by indication: the comparison is about timing,
#   not about treatment vs no treatment
# - Improves positivity: most T2DM patients can plausibly be in either
#   timing group
# - Reduces healthy user bias: both groups seek treatment
# - Mimics the clinically relevant question

# --- Part (d): Unmeasured confounders and sensitivity analysis ---
#
# POTENTIAL UNMEASURED CONFOUNDERS:
# 1. Diabetes severity trajectory (HbA1c trends)
# 2. Health literacy and medication adherence behaviour
# 3. Physician quality / prescribing culture
# 4. Socioeconomic status / insurance coverage
# 5. Diet, exercise, and lifestyle factors
# 6. Patient preferences and shared decision-making
#
# SENSITIVITY ANALYSIS APPROACHES:
# 1. E-value: quantify minimum confounder strength to nullify the result
# 2. Quantitative bias analysis: use external data to model bias
# 3. Negative control outcomes: test outcomes unrelated to metformin
# 4. Negative control exposures: test unrelated treatments
# 5. Instrumental variable analysis: e.g., physician preference

print("=" * 70)
print("Exercise 4: Target Trial Emulation (Conceptual)")
print("=" * 70)
print()
print("All answers are provided as comments in this script.")
print("Review the comments for:")
print("  (a) Complete target trial protocol table")
print("  (b) Sources of immortal time bias in a naive analysis")
print("  (c) How new-user, active comparator design addresses confounding")
print("  (d) Unmeasured confounders and sensitivity analysis approaches")
print()
print("Key takeaway: Target trial emulation forces explicit specification")
print("of all protocol components, making assumptions transparent and")
print("reducing common biases like immortal time bias and prevalent user bias.")

D.14 Chapter 18: Meta-Analysis

D.14.1 Exercise 1

Show R solution

# =============================================================================
# Chapter 18 - Exercise 1: Basic Meta-Analysis in R
# New anticoagulant vs warfarin for stroke prevention in atrial fibrillation
# =============================================================================

library(tidyverse)
library(meta)
library(metafor)

# --- Dataset ---
af_trials <- data.frame(
  study = c("TRAIL-1", "GUARD-AF", "SHIELD", "ORBIT-AF",
            "VENTURE", "COMPASS-AF", "PIONEER-2", "ATLAS-AF"),
  events_new = c(28, 45, 112, 67, 33, 89, 52, 41),
  n_new      = c(1200, 2500, 5400, 3100, 1800, 4200, 2800, 2100),
  events_warf = c(42, 58, 148, 84, 29, 102, 61, 53),
  n_warf      = c(1200, 2500, 5400, 3100, 1800, 4200, 2800, 2100)
)

# --- Part (a): Compute RR and 95% CI for each trial ---
# Using metafor's escalc function
es <- escalc(measure = "RR",
             ai = events_new, n1i = n_new,
             ci = events_warf, n2i = n_warf,
             data = af_trials)

# Add human-readable columns
es$RR <- exp(es$yi)
es$ci_lo <- exp(es$yi - 1.96 * sqrt(es$vi))
es$ci_hi <- exp(es$yi + 1.96 * sqrt(es$vi))

cat("Individual trial risk ratios:\n")
print(es[, c("study", "RR", "ci_lo", "ci_hi")], digits = 3)

# By hand verification for first trial
rr_trail1 <- (28/1200) / (42/1200)
se_trail1 <- sqrt(1/28 - 1/1200 + 1/42 - 1/1200)
cat("\nManual check (TRAIL-1):")
cat("\n  RR =", round(rr_trail1, 3))
cat("\n  95% CI: (", round(exp(log(rr_trail1) - 1.96*se_trail1), 3),
    ",", round(exp(log(rr_trail1) + 1.96*se_trail1), 3), ")\n")

# --- Part (b): Random-effects meta-analysis ---
# Using the meta package
m1 <- metabin(
  event.e = events_new, n.e = n_new,
  event.c = events_warf, n.c = n_warf,
  studlab = study,
  data = af_trials,
  sm = "RR",
  method.tau = "REML",
  prediction = TRUE
)

cat("\n--- Random-effects meta-analysis ---\n")
print(summary(m1))

# Also using metafor for comparison
res <- rma(yi, vi, data = es, method = "REML")
cat("\n--- metafor results ---\n")
print(summary(res))

# --- Part (c): Forest plot ---
forest(m1,
       sortvar = TE,
       label.left = "Favours new anticoagulant",
       label.right = "Favours warfarin",
       col.diamond = "steelblue",
       col.square = "darkblue",
       print.tau2 = TRUE,
       print.I2 = TRUE,
       print.pval.Q = TRUE,
       prediction = TRUE,
       main = "New Anticoagulant vs Warfarin: Stroke Prevention in AF")

cat("\n--- Does the pooled effect favour the new anticoagulant? ---\n")
pooled_rr <- exp(m1$TE.random)
cat("Pooled RR:", round(pooled_rr, 3), "\n")
cat("95% CI: (", round(exp(m1$lower.random), 3), ",",
    round(exp(m1$upper.random), 3), ")\n")
if (pooled_rr < 1 & exp(m1$upper.random) < 1) {
  cat("Yes, the pooled effect significantly favours the new anticoagulant.\n")
} else if (pooled_rr < 1) {
  cat("The point estimate favours the new anticoagulant, but the CI includes 1.\n")
} else {
  cat("The pooled effect does not favour the new anticoagulant.\n")
}

# --- Part (d): I-squared and prediction interval ---
cat("\n--- Heterogeneity ---\n")
cat("I-squared:", round(m1$I2, 1), "%\n")
cat("tau-squared:", round(m1$tau2, 4), "\n")
cat("Q statistic:", round(m1$Q, 2), ", df =", m1$df.Q,
    ", p =", round(m1$pval.Q, 4), "\n")

cat("\n--- Prediction interval ---\n")
cat("Prediction interval for RR: (",
    round(exp(m1$lower.predict), 3), ",",
    round(exp(m1$upper.predict), 3), ")\n")

# Interpretation
if (m1$I2 < 25) {
  cat("I-squared suggests LOW heterogeneity.\n")
} else if (m1$I2 < 50) {
  cat("I-squared suggests MODERATE heterogeneity.\n")
} else if (m1$I2 < 75) {
  cat("I-squared suggests SUBSTANTIAL heterogeneity.\n")
} else {
  cat("I-squared suggests CONSIDERABLE heterogeneity.\n")
}

cat("\nThe prediction interval tells us the range within which the true\n")
cat("effect in a future study is expected to fall. If it crosses 1,\n")
cat("some settings might see no benefit from the new anticoagulant.\n")

# --- Part (e): Funnel plot and Egger's test ---
funnel(m1,
       xlab = "Risk Ratio (log scale)",
       studlab = TRUE,
       col = "steelblue",
       pch = 16,
       main = "Funnel Plot")

cat("\n--- Egger's test for funnel plot asymmetry ---\n")
egger <- metabias(m1, method.bias = "Egger")
print(egger)

if (egger$p.value < 0.05) {
  cat("Egger's test is significant (p < 0.05), suggesting potential\n")
  cat("publication bias or small-study effects.\n")
} else {
  cat("Egger's test is not significant, no strong evidence of\n")
  cat("publication bias. However, with only", length(af_trials$study),
      "studies,\n")
  cat("the test has limited power (recommended: >= 10 studies).\n")
}

# Trim-and-fill as additional check
tf <- trimfill(m1)
cat("\n--- Trim-and-fill ---\n")
cat("Imputed missing studies:", tf$k0, "\n")
cat("Adjusted pooled RR:", round(exp(tf$TE.random), 3), "\n")

# --- Part (f): Leave-one-out sensitivity analysis ---
cat("\n--- Leave-one-out sensitivity analysis ---\n")
l1o <- leave1out(res)
print(l1o)

# Check if removing any single study changes the conclusion
cat("\nIs the result robust to removing any single study?\n")
all_sig <- all(exp(l1o$ci.lb) < 1 | exp(l1o$ci.ub) > 1)
# Check if all leave-one-out CIs exclude 1 (if pooled is significant)
if (exp(res$ci.ub) < 1) {
  robust <- all(exp(l1o$ci.ub) < 1)
} else {
  robust <- TRUE
}

if (robust) {
  cat("Yes - no single study removal changes the direction or significance.\n")
} else {
  cat("No - removing at least one study changes the conclusion.\n")
  cat("The result may be driven by specific influential studies.\n")
}

# Influence diagnostics
inf <- influence(res)
plot(inf, main = "Influence Diagnostics")

Show Python solution

# =============================================================================
# Chapter 18 - Exercise 1: Basic Meta-Analysis in Python
# (Python version of the R Exercise 1 for completeness)
# New anticoagulant vs warfarin for stroke prevention in AF
# =============================================================================

import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt

# --- Dataset ---
af_trials = pd.DataFrame({
    'study': ['TRAIL-1', 'GUARD-AF', 'SHIELD', 'ORBIT-AF',
              'VENTURE', 'COMPASS-AF', 'PIONEER-2', 'ATLAS-AF'],
    'events_new':  [28, 45, 112, 67, 33, 89, 52, 41],
    'n_new':       [1200, 2500, 5400, 3100, 1800, 4200, 2800, 2100],
    'events_warf': [42, 58, 148, 84, 29, 102, 61, 53],
    'n_warf':      [1200, 2500, 5400, 3100, 1800, 4200, 2800, 2100]
})

# --- Part (a): Compute log RR and variances ---
af_trials['rr'] = (af_trials['events_new'] / af_trials['n_new']) / \
                  (af_trials['events_warf'] / af_trials['n_warf'])
af_trials['log_rr'] = np.log(af_trials['rr'])
af_trials['var_log_rr'] = (
    1/af_trials['events_new'] - 1/af_trials['n_new'] +
    1/af_trials['events_warf'] - 1/af_trials['n_warf']
)
af_trials['se_log_rr'] = np.sqrt(af_trials['var_log_rr'])

# 95% CI for each study
af_trials['ci_lo'] = np.exp(af_trials['log_rr'] - 1.96 * af_trials['se_log_rr'])
af_trials['ci_hi'] = np.exp(af_trials['log_rr'] + 1.96 * af_trials['se_log_rr'])

print("Individual trial risk ratios:")
print(af_trials[['study', 'rr', 'ci_lo', 'ci_hi']].to_string(
    float_format=lambda x: f"{x:.3f}", index=False))

# --- Part (b): Fixed-effect (inverse variance) ---
w_fe = 1 / af_trials['var_log_rr']
pooled_fe = np.sum(w_fe * af_trials['log_rr']) / np.sum(w_fe)
se_fe = np.sqrt(1 / np.sum(w_fe))
ci_fe = (pooled_fe - 1.96*se_fe, pooled_fe + 1.96*se_fe)

print(f"\n--- Fixed-effect meta-analysis ---")
print(f"Pooled log RR: {pooled_fe:.4f}")
print(f"Pooled RR: {np.exp(pooled_fe):.4f} "
      f"(95% CI: {np.exp(ci_fe[0]):.4f} - {np.exp(ci_fe[1]):.4f})")

# --- Part (c): DerSimonian-Laird random-effects ---
k = len(af_trials)
Q = np.sum(w_fe * (af_trials['log_rr'] - pooled_fe)**2)
C = np.sum(w_fe) - np.sum(w_fe**2) / np.sum(w_fe)
tau2 = max(0, (Q - (k - 1)) / C)

w_re = 1 / (af_trials['var_log_rr'] + tau2)
pooled_re = np.sum(w_re * af_trials['log_rr']) / np.sum(w_re)
se_re = np.sqrt(1 / np.sum(w_re))
ci_re = (pooled_re - 1.96*se_re, pooled_re + 1.96*se_re)

# --- Part (d): Q, I-squared, tau-squared ---
I2 = max(0, (Q - (k-1)) / Q) * 100
p_Q = 1 - stats.chi2.cdf(Q, k-1)

print(f"\n--- Random-effects meta-analysis (DerSimonian-Laird) ---")
print(f"Pooled log RR: {pooled_re:.4f}")
print(f"Pooled RR: {np.exp(pooled_re):.4f} "
      f"(95% CI: {np.exp(ci_re[0]):.4f} - {np.exp(ci_re[1]):.4f})")
print(f"\nHeterogeneity:")
print(f"  tau^2: {tau2:.4f}")
print(f"  I^2: {I2:.1f}%")
print(f"  Q: {Q:.2f}, df = {k-1}, p = {p_Q:.4f}")

# Prediction interval
t_crit = stats.t.ppf(0.975, k - 2)
pi_lo = pooled_re - t_crit * np.sqrt(se_re**2 + tau2)
pi_hi = pooled_re + t_crit * np.sqrt(se_re**2 + tau2)
print(f"\nPrediction interval for RR: ({np.exp(pi_lo):.4f}, {np.exp(pi_hi):.4f})")

if np.exp(pi_lo) < 1 and np.exp(pi_hi) > 1:
    print("The prediction interval crosses 1, meaning some future settings")
    print("might not see a benefit from the new anticoagulant.")
else:
    print("The prediction interval does not cross 1.")

# --- Part (e): Forest plot ---
fig, ax = plt.subplots(figsize=(10, 8))

y_pos = list(range(k, 0, -1))
weights = w_re / w_re.max()

# Individual studies
for i, (_, row) in enumerate(af_trials.iterrows()):
    ci_lo_i = row['ci_lo']
    ci_hi_i = row['ci_hi']
    ax.plot([ci_lo_i, ci_hi_i], [y_pos[i], y_pos[i]], 'k-', linewidth=1)
    size = weights.iloc[i] * 200
    ax.scatter(row['rr'], y_pos[i], s=size, c='steelblue',
               zorder=5, edgecolors='darkblue')

# Null line
ax.axvline(x=1, color='black', linestyle='-', linewidth=0.5)

# Pooled estimate line
ax.axvline(x=np.exp(pooled_re), color='steelblue', linestyle='--', alpha=0.5)

# Pooled diamond
diamond_y = 0
dx = [np.exp(ci_re[0]), np.exp(pooled_re), np.exp(ci_re[1]), np.exp(pooled_re)]
dy = [diamond_y, diamond_y + 0.3, diamond_y, diamond_y - 0.3]
ax.fill(dx, dy, color='steelblue', alpha=0.7)

ax.set_yticks(y_pos + [0])
ax.set_yticklabels(list(af_trials['study']) + ['Pooled RE'])
ax.set_xlabel('Risk Ratio')
ax.set_title('Forest Plot: New Anticoagulant vs Warfarin for AF')
ax.set_xscale('log')
plt.tight_layout()
plt.savefig('forest_plot_af.png', dpi=300)
plt.show()

# --- Part (f): Funnel plot and Egger's test ---
fig, ax = plt.subplots(figsize=(8, 6))

ax.scatter(af_trials['log_rr'], af_trials['se_log_rr'],
           c='steelblue', s=80, edgecolors='darkblue', zorder=5)
ax.axvline(x=pooled_re, color='grey', linestyle='--', alpha=0.7)

# Pseudo-confidence region
se_range = np.linspace(0, af_trials['se_log_rr'].max() * 1.1, 100)
ax.plot(pooled_re - 1.96*se_range, se_range, 'k--', alpha=0.3)
ax.plot(pooled_re + 1.96*se_range, se_range, 'k--', alpha=0.3)

# Add study labels
for _, row in af_trials.iterrows():
    ax.annotate(row['study'], (row['log_rr'], row['se_log_rr']),
                textcoords="offset points", xytext=(5, 5), fontsize=7)

ax.invert_yaxis()
ax.set_xlabel('Log Risk Ratio')
ax.set_ylabel('Standard Error')
ax.set_title('Funnel Plot')
plt.tight_layout()
plt.savefig('funnel_plot_af.png', dpi=300)
plt.show()

# Egger's test: weighted regression of effect on SE
import statsmodels.api as sm

X_egger = sm.add_constant(af_trials['se_log_rr'])
egger_model = sm.WLS(af_trials['log_rr'], X_egger, weights=w_fe).fit()

print(f"\n--- Egger's test ---")
print(f"Intercept: {egger_model.params.iloc[0]:.4f}")
print(f"p-value: {egger_model.pvalues.iloc[0]:.4f}")

if egger_model.pvalues.iloc[0] < 0.05:
    print("Egger's test is significant, suggesting potential publication bias.")
else:
    print("Egger's test is not significant.")
    print(f"Note: With only {k} studies, the test has limited power.")

# --- Leave-one-out sensitivity analysis ---
print(f"\n--- Leave-one-out sensitivity analysis ---")
print(f"{'Excluded study':<15} {'Pooled RR':>10} {'95% CI':>25} {'I2':>8}")
print("-" * 60)

for i in range(k):
    # Exclude study i
    mask = af_trials.index != i
    w_i = 1 / af_trials.loc[mask, 'var_log_rr']

    # Fixed-effect pooled
    pooled_i_fe = np.sum(w_i * af_trials.loc[mask, 'log_rr']) / np.sum(w_i)

    # Q and tau2
    Q_i = np.sum(w_i * (af_trials.loc[mask, 'log_rr'] - pooled_i_fe)**2)
    C_i = np.sum(w_i) - np.sum(w_i**2) / np.sum(w_i)
    tau2_i = max(0, (Q_i - (k - 2)) / C_i)

    # Random-effects
    w_re_i = 1 / (af_trials.loc[mask, 'var_log_rr'] + tau2_i)
    pooled_i = np.sum(w_re_i * af_trials.loc[mask, 'log_rr']) / np.sum(w_re_i)
    se_i = np.sqrt(1 / np.sum(w_re_i))
    ci_i = (pooled_i - 1.96*se_i, pooled_i + 1.96*se_i)
    I2_i = max(0, (Q_i - (k-2)) / Q_i) * 100

    print(f"{af_trials.loc[i, 'study']:<15} {np.exp(pooled_i):>10.4f} "
          f"({np.exp(ci_i[0]):.4f}, {np.exp(ci_i[1]):.4f})  {I2_i:>6.1f}%")

print("\nIf removing any single study changes the conclusion (CI crosses 1),")
print("the result is sensitive to that study.")

D.14.2 Exercise 2

Show R solution

# =============================================================================
# Chapter 18 - Exercise 2: Meta-Analysis from Scratch in R
# (Exercise 2 is Python-focused, but we provide an R version as well)
# Manual implementation of meta-analysis computations
# =============================================================================

library(tidyverse)

# --- Dataset (same as Exercise 1) ---
af_trials <- data.frame(
  study = c("TRAIL-1", "GUARD-AF", "SHIELD", "ORBIT-AF",
            "VENTURE", "COMPASS-AF", "PIONEER-2", "ATLAS-AF"),
  events_new  = c(28, 45, 112, 67, 33, 89, 52, 41),
  n_new       = c(1200, 2500, 5400, 3100, 1800, 4200, 2800, 2100),
  events_warf = c(42, 58, 148, 84, 29, 102, 61, 53),
  n_warf      = c(1200, 2500, 5400, 3100, 1800, 4200, 2800, 2100)
)

# --- Part (a): Compute log RR and variances by hand ---
af_trials$rr <- (af_trials$events_new / af_trials$n_new) /
                (af_trials$events_warf / af_trials$n_warf)
af_trials$log_rr <- log(af_trials$rr)
af_trials$var_log_rr <- (1/af_trials$events_new - 1/af_trials$n_new +
                         1/af_trials$events_warf - 1/af_trials$n_warf)
af_trials$se_log_rr <- sqrt(af_trials$var_log_rr)

cat("Log risk ratios and SEs:\n")
print(af_trials[, c("study", "log_rr", "se_log_rr", "rr")])

# --- Part (b): Fixed-effect inverse-variance method ---
w_fe <- 1 / af_trials$var_log_rr
pooled_fe <- sum(w_fe * af_trials$log_rr) / sum(w_fe)
se_fe <- sqrt(1 / sum(w_fe))
ci_fe_lo <- pooled_fe - 1.96 * se_fe
ci_fe_hi <- pooled_fe + 1.96 * se_fe
z_fe <- pooled_fe / se_fe
p_fe <- 2 * (1 - pnorm(abs(z_fe)))

cat("\n--- Fixed-effect meta-analysis (by hand) ---\n")
cat("Pooled log RR:", round(pooled_fe, 4), "\n")
cat("Pooled RR:", round(exp(pooled_fe), 4),
    "(95% CI:", round(exp(ci_fe_lo), 4), "-", round(exp(ci_fe_hi), 4), ")\n")
cat("z =", round(z_fe, 3), ", p =", round(p_fe, 4), "\n")

# --- Part (c): DerSimonian-Laird random-effects ---
k <- nrow(af_trials)

# Cochran's Q
Q <- sum(w_fe * (af_trials$log_rr - pooled_fe)^2)

# C constant
C_val <- sum(w_fe) - sum(w_fe^2) / sum(w_fe)

# Between-study variance
tau2 <- max(0, (Q - (k - 1)) / C_val)

# Random-effects weights
w_re <- 1 / (af_trials$var_log_rr + tau2)
pooled_re <- sum(w_re * af_trials$log_rr) / sum(w_re)
se_re <- sqrt(1 / sum(w_re))
ci_re_lo <- pooled_re - 1.96 * se_re
ci_re_hi <- pooled_re + 1.96 * se_re
z_re <- pooled_re / se_re
p_re <- 2 * (1 - pnorm(abs(z_re)))

cat("\n--- Random-effects meta-analysis (DerSimonian-Laird, by hand) ---\n")
cat("Pooled log RR:", round(pooled_re, 4), "\n")
cat("Pooled RR:", round(exp(pooled_re), 4),
    "(95% CI:", round(exp(ci_re_lo), 4), "-", round(exp(ci_re_hi), 4), ")\n")
cat("z =", round(z_re, 3), ", p =", round(p_re, 4), "\n")

# --- Part (d): Q, I-squared, tau-squared ---
p_Q <- 1 - pchisq(Q, df = k - 1)
I2 <- max(0, (Q - (k - 1)) / Q) * 100

cat("\n--- Heterogeneity statistics ---\n")
cat("Q =", round(Q, 2), ", df =", k - 1, ", p =", round(p_Q, 4), "\n")
cat("tau^2 =", round(tau2, 4), "\n")
cat("I^2 =", round(I2, 1), "%\n")

# Prediction interval
t_crit <- qt(0.975, df = k - 2)
pi_lo <- pooled_re - t_crit * sqrt(se_re^2 + tau2)
pi_hi <- pooled_re + t_crit * sqrt(se_re^2 + tau2)
cat("Prediction interval for RR: (", round(exp(pi_lo), 4),
    ",", round(exp(pi_hi), 4), ")\n")

# --- Part (e): Forest plot (base R, no packages) ---
af_trials$w_re <- w_re
af_trials$ci_lo <- exp(af_trials$log_rr - 1.96 * af_trials$se_log_rr)
af_trials$ci_hi <- exp(af_trials$log_rr + 1.96 * af_trials$se_log_rr)

par(mar = c(5, 10, 4, 2))
y_pos <- k:1

plot(af_trials$rr, y_pos,
     xlim = c(0.3, 2.0), ylim = c(-0.5, k + 0.5),
     pch = 15, cex = af_trials$w_re / max(af_trials$w_re) * 2,
     xlab = "Risk Ratio", ylab = "", yaxt = "n",
     main = "Forest Plot (Built from Scratch)",
     log = "x", col = "steelblue")

# Study CIs
segments(af_trials$ci_lo, y_pos, af_trials$ci_hi, y_pos, lwd = 1.5)

# Study labels
axis(2, at = y_pos, labels = af_trials$study, las = 1)

# Null line
abline(v = 1, lty = 2, col = "grey50")

# Pooled estimate diamond
diamond_x <- c(exp(ci_re_lo), exp(pooled_re), exp(ci_re_hi), exp(pooled_re))
diamond_y <- c(0, 0.3, 0, -0.3)
polygon(diamond_x, diamond_y, col = "steelblue", border = "darkblue")

# Pooled reference line
abline(v = exp(pooled_re), col = "steelblue", lty = 3, lwd = 1)

# --- Part (f): Funnel plot and Egger's test (by hand) ---
par(mar = c(5, 5, 4, 2))
plot(af_trials$log_rr, af_trials$se_log_rr,
     pch = 16, col = "steelblue", cex = 1.5,
     xlim = c(min(af_trials$log_rr) - 0.3, max(af_trials$log_rr) + 0.3),
     ylim = rev(c(0, max(af_trials$se_log_rr) * 1.1)),
     xlab = "Log Risk Ratio", ylab = "Standard Error",
     main = "Funnel Plot (Built from Scratch)")

abline(v = pooled_re, col = "grey50", lty = 2)

# Pseudo-CI region
se_seq <- seq(0, max(af_trials$se_log_rr) * 1.1, length.out = 100)
lines(pooled_re - 1.96 * se_seq, se_seq, lty = 3, col = "grey70")
lines(pooled_re + 1.96 * se_seq, se_seq, lty = 3, col = "grey70")

# Egger's test: weighted regression of standardised effect on precision
# Regression of yi/sei on 1/sei, weighted by 1/vi
# Equivalent: regress yi on sei, weighted by 1/vi
egger_fit <- lm(log_rr ~ se_log_rr, data = af_trials, weights = w_fe)
cat("\n--- Egger's test (by hand) ---\n")
cat("Intercept:", round(coef(egger_fit)[1], 4), "\n")
cat("SE:", round(summary(egger_fit)$coefficients[1, 2], 4), "\n")
cat("p-value:", round(summary(egger_fit)$coefficients[1, 4], 4), "\n")

if (summary(egger_fit)$coefficients[1, 4] < 0.05) {
  cat("Significant asymmetry detected.\n")
} else {
  cat("No significant asymmetry (limited power with", k, "studies).\n")
}

Show Python solution

# =============================================================================
# Chapter 18 - Exercise 2: Meta-Analysis from Scratch in Python
# New anticoagulant vs warfarin for stroke prevention in AF
# =============================================================================

import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm
import matplotlib.pyplot as plt

# --- Dataset ---
af_trials = pd.DataFrame({
    'study': ['TRAIL-1', 'GUARD-AF', 'SHIELD', 'ORBIT-AF',
              'VENTURE', 'COMPASS-AF', 'PIONEER-2', 'ATLAS-AF'],
    'events_new':  [28, 45, 112, 67, 33, 89, 52, 41],
    'n_new':       [1200, 2500, 5400, 3100, 1800, 4200, 2800, 2100],
    'events_warf': [42, 58, 148, 84, 29, 102, 61, 53],
    'n_warf':      [1200, 2500, 5400, 3100, 1800, 4200, 2800, 2100]
})

# --- Part (a): Compute log risk ratios and variances ---
af_trials['rr'] = (af_trials['events_new'] / af_trials['n_new']) / \
                  (af_trials['events_warf'] / af_trials['n_warf'])
af_trials['log_rr'] = np.log(af_trials['rr'])

# Variance of log RR: 1/a - 1/n1 + 1/c - 1/n2
af_trials['var_log_rr'] = (
    1/af_trials['events_new'] - 1/af_trials['n_new'] +
    1/af_trials['events_warf'] - 1/af_trials['n_warf']
)
af_trials['se_log_rr'] = np.sqrt(af_trials['var_log_rr'])

print("Part (a): Log risk ratios and standard errors")
print(af_trials[['study', 'log_rr', 'se_log_rr', 'rr']].to_string(
    float_format=lambda x: f"{x:.4f}", index=False))

# --- Part (b): Fixed-effect inverse-variance method ---
w_fe = 1 / af_trials['var_log_rr']
pooled_fe = np.sum(w_fe * af_trials['log_rr']) / np.sum(w_fe)
se_fe = np.sqrt(1 / np.sum(w_fe))
ci_fe = (pooled_fe - 1.96 * se_fe, pooled_fe + 1.96 * se_fe)
z_fe = pooled_fe / se_fe
p_fe = 2 * (1 - stats.norm.cdf(abs(z_fe)))

print(f"\n--- Part (b): Fixed-effect meta-analysis ---")
print(f"Pooled log RR: {pooled_fe:.4f}")
print(f"Pooled RR: {np.exp(pooled_fe):.4f} "
      f"(95% CI: {np.exp(ci_fe[0]):.4f} - {np.exp(ci_fe[1]):.4f})")
print(f"z = {z_fe:.3f}, p = {p_fe:.4f}")

# Study weights (as percentages)
af_trials['weight_fe_pct'] = (w_fe / w_fe.sum() * 100).round(1)
print(f"\nStudy weights (fixed-effect):")
print(af_trials[['study', 'weight_fe_pct']].to_string(index=False))

# --- Part (c): DerSimonian-Laird random-effects ---
k = len(af_trials)

# Step 1: Cochran's Q statistic
Q = np.sum(w_fe * (af_trials['log_rr'] - pooled_fe)**2)

# Step 2: C constant
C = np.sum(w_fe) - np.sum(w_fe**2) / np.sum(w_fe)

# Step 3: Between-study variance tau^2
tau2 = max(0, (Q - (k - 1)) / C)

# Step 4: Random-effects weights
w_re = 1 / (af_trials['var_log_rr'] + tau2)
pooled_re = np.sum(w_re * af_trials['log_rr']) / np.sum(w_re)
se_re = np.sqrt(1 / np.sum(w_re))
ci_re = (pooled_re - 1.96 * se_re, pooled_re + 1.96 * se_re)
z_re = pooled_re / se_re
p_re = 2 * (1 - stats.norm.cdf(abs(z_re)))

print(f"\n--- Part (c): Random-effects meta-analysis (DerSimonian-Laird) ---")
print(f"Pooled log RR: {pooled_re:.4f}")
print(f"Pooled RR: {np.exp(pooled_re):.4f} "
      f"(95% CI: {np.exp(ci_re[0]):.4f} - {np.exp(ci_re[1]):.4f})")
print(f"z = {z_re:.3f}, p = {p_re:.4f}")

af_trials['weight_re_pct'] = (w_re / w_re.sum() * 100).round(1)
print(f"\nStudy weights (random-effects):")
print(af_trials[['study', 'weight_fe_pct', 'weight_re_pct']].to_string(index=False))

# --- Part (d): Q, I-squared, tau-squared ---
p_Q = 1 - stats.chi2.cdf(Q, k - 1)
I2 = max(0, (Q - (k - 1)) / Q) * 100

print(f"\n--- Part (d): Heterogeneity statistics ---")
print(f"Q = {Q:.2f}, df = {k-1}, p = {p_Q:.4f}")
print(f"tau^2 = {tau2:.4f}")
print(f"tau = {np.sqrt(tau2):.4f}")
print(f"I^2 = {I2:.1f}%")

if I2 < 25:
    print("Interpretation: LOW heterogeneity")
elif I2 < 50:
    print("Interpretation: MODERATE heterogeneity")
elif I2 < 75:
    print("Interpretation: SUBSTANTIAL heterogeneity")
else:
    print("Interpretation: CONSIDERABLE heterogeneity")

# Prediction interval
t_crit = stats.t.ppf(0.975, k - 2)
pi_lo = pooled_re - t_crit * np.sqrt(se_re**2 + tau2)
pi_hi = pooled_re + t_crit * np.sqrt(se_re**2 + tau2)
print(f"\nPrediction interval for RR: ({np.exp(pi_lo):.4f}, {np.exp(pi_hi):.4f})")

# --- Part (e): Forest plot ---
fig, ax = plt.subplots(figsize=(10, 8))

y_pos = list(range(k, 0, -1))
weights_norm = w_re / w_re.max()

for i, (_, row) in enumerate(af_trials.iterrows()):
    ci_lo_i = np.exp(row['log_rr'] - 1.96 * row['se_log_rr'])
    ci_hi_i = np.exp(row['log_rr'] + 1.96 * row['se_log_rr'])

    # CI line
    ax.plot([ci_lo_i, ci_hi_i], [y_pos[i], y_pos[i]], 'k-', linewidth=1)

    # Study point (size proportional to weight)
    size = weights_norm.iloc[i] * 200
    ax.scatter(row['rr'], y_pos[i], s=size, c='steelblue',
               zorder=5, edgecolors='darkblue')

    # Weight annotation
    ax.text(2.0, y_pos[i], f"{row['weight_re_pct']:.1f}%",
            ha='left', va='center', fontsize=9)

# Null line
ax.axvline(x=1, color='black', linestyle='-', linewidth=0.5)

# Pooled estimate line
ax.axvline(x=np.exp(pooled_re), color='steelblue', linestyle='--', alpha=0.5)

# Pooled diamond
diamond_x = [np.exp(ci_re[0]), np.exp(pooled_re),
             np.exp(ci_re[1]), np.exp(pooled_re)]
diamond_y = [0, 0.3, 0, -0.3]
ax.fill(diamond_x, diamond_y, color='steelblue', alpha=0.7)

# Labels
ax.set_yticks(y_pos + [0])
ax.set_yticklabels(list(af_trials['study']) + ['Pooled RE'])
ax.set_xlabel('Risk Ratio')
ax.set_title('Forest Plot: New Anticoagulant vs Warfarin for AF\n'
             f'RE Model: RR = {np.exp(pooled_re):.3f} '
             f'(95% CI: {np.exp(ci_re[0]):.3f}-{np.exp(ci_re[1]):.3f}), '
             f'I\u00B2 = {I2:.1f}%')
ax.set_xscale('log')

# Add "Favours" labels
ax.text(0.5, -1.2, 'Favours new anticoagulant', ha='center',
        fontsize=9, style='italic')
ax.text(1.8, -1.2, 'Favours warfarin', ha='center',
        fontsize=9, style='italic')

plt.tight_layout()
plt.savefig('forest_plot_af.png', dpi=300)
plt.show()

# --- Part (f): Funnel plot and Egger's test ---
fig, ax = plt.subplots(figsize=(8, 6))

ax.scatter(af_trials['log_rr'], af_trials['se_log_rr'],
           c='steelblue', s=80, edgecolors='darkblue', zorder=5)
ax.axvline(x=pooled_re, color='grey', linestyle='--', alpha=0.7,
           label='Pooled effect')

# Pseudo-confidence regions
se_range = np.linspace(0, af_trials['se_log_rr'].max() * 1.1, 100)
ax.plot(pooled_re - 1.96*se_range, se_range, 'k--', alpha=0.3)
ax.plot(pooled_re + 1.96*se_range, se_range, 'k--', alpha=0.3)

# Add study labels
for _, row in af_trials.iterrows():
    ax.annotate(row['study'], (row['log_rr'], row['se_log_rr']),
                textcoords="offset points", xytext=(5, 5), fontsize=7)

ax.invert_yaxis()
ax.set_xlabel('Log Risk Ratio')
ax.set_ylabel('Standard Error')
ax.set_title('Funnel Plot: Checking for Publication Bias')
ax.legend()
plt.tight_layout()
plt.savefig('funnel_plot_af.png', dpi=300)
plt.show()

# Egger's regression test
# Weighted regression of effect estimate on standard error
X_egger = sm.add_constant(af_trials['se_log_rr'])
egger_model = sm.WLS(af_trials['log_rr'], X_egger,
                     weights=1/af_trials['var_log_rr']).fit()

print(f"\n--- Part (f): Egger's test ---")
print(f"Intercept: {egger_model.params.iloc[0]:.4f}")
print(f"SE of intercept: {egger_model.bse.iloc[0]:.4f}")
print(f"t-statistic: {egger_model.tvalues.iloc[0]:.3f}")
print(f"p-value: {egger_model.pvalues.iloc[0]:.4f}")

if egger_model.pvalues.iloc[0] < 0.05:
    print("Egger's test is significant (p < 0.05).")
    print("This suggests potential publication bias or small-study effects.")
else:
    print("Egger's test is not significant.")
    print(f"Note: With only {k} studies, the test has limited power.")
    print("At least 10 studies are recommended for reliable Egger's test.")

D.14.3 Exercise 3

Show R solution

# =============================================================================
# Chapter 18 - Exercise 3: Subgroup Analysis and Meta-Regression in R
# =============================================================================

library(tidyverse)
library(meta)
library(metafor)

# --- Dataset ---
af_trials <- data.frame(
  study = c("TRAIL-1", "GUARD-AF", "SHIELD", "ORBIT-AF",
            "VENTURE", "COMPASS-AF", "PIONEER-2", "ATLAS-AF"),
  events_new  = c(28, 45, 112, 67, 33, 89, 52, 41),
  n_new       = c(1200, 2500, 5400, 3100, 1800, 4200, 2800, 2100),
  events_warf = c(42, 58, 148, 84, 29, 102, 61, 53),
  n_warf      = c(1200, 2500, 5400, 3100, 1800, 4200, 2800, 2100),
  region      = c("Europe", "North America", "Europe", "Asia",
                   "North America", "Europe", "Asia", "North America"),
  mean_age    = c(72, 68, 74, 65, 70, 71, 63, 69),
  pct_female  = c(38, 42, 35, 48, 40, 37, 52, 44)
)

# Compute effect sizes using metafor
es <- escalc(measure = "RR",
             ai = events_new, n1i = n_new,
             ci = events_warf, n2i = n_warf,
             data = af_trials)

# Base meta-analysis
m1 <- metabin(
  event.e = events_new, n.e = n_new,
  event.c = events_warf, n.c = n_warf,
  studlab = study, data = af_trials,
  sm = "RR", method.tau = "REML"
)

# --- Part (a): Subgroup analysis by region ---
cat("--- Part (a): Subgroup analysis by region ---\n\n")

# Using meta package
m_sub <- update(m1, subgroup = af_trials$region)
cat("Subgroup analysis results:\n")
print(summary(m_sub))

# Forest plot by subgroup
forest(m_sub,
       sortvar = TE,
       label.left = "Favours new anticoagulant",
       label.right = "Favours warfarin",
       col.diamond = "steelblue",
       print.tau2 = TRUE,
       print.I2 = TRUE,
       print.subgroup.name = TRUE,
       main = "Subgroup Analysis by Region")

# Using metafor for the test of subgroup differences
res_sub <- rma(yi, vi, mods = ~ region, data = es, method = "REML")
cat("\nTest for subgroup differences (metafor):\n")
print(summary(res_sub))

cat("\nDo treatment effects differ by region?\n")
cat("Test for moderation: QM =", round(res_sub$QM, 2),
    ", df =", res_sub$m, ", p =", round(res_sub$QMp, 4), "\n")
if (res_sub$QMp < 0.05) {
  cat("Yes, there is a statistically significant difference between regions.\n")
} else {
  cat("No, there is no statistically significant difference between regions.\n")
  cat("However, with only", nrow(es), "studies split across",
      length(unique(es$region)), "regions,\n")
  cat("statistical power for detecting subgroup differences is limited.\n")
}

# --- Part (b): Meta-regression with mean age ---
cat("\n--- Part (b): Meta-regression with mean age ---\n\n")

res_age <- rma(yi, vi, mods = ~ mean_age, data = es, method = "REML")
print(summary(res_age))

cat("\nInterpretation:\n")
cat("Coefficient for mean_age:", round(coef(res_age)["mean_age"], 4), "\n")
cat("p-value:", round(res_age$pval[2], 4), "\n")
cat("R^2 (proportion of heterogeneity explained):",
    round(max(0, res_age$R2), 1), "%\n")

if (res_age$pval[2] < 0.05) {
  cat("There IS a statistically significant relationship between mean age\n")
  cat("and treatment effect. Each 1-year increase in mean age is associated\n")
  cat("with a", round(coef(res_age)["mean_age"], 4),
      "change in log RR.\n")
} else {
  cat("There is NO statistically significant relationship between mean age\n")
  cat("and treatment effect (p =", round(res_age$pval[2], 3), ").\n")
}

# --- Part (c): Meta-regression with percentage female ---
cat("\n--- Part (c): Meta-regression with percentage female ---\n\n")

res_fem <- rma(yi, vi, mods = ~ pct_female, data = es, method = "REML")
print(summary(res_fem))

cat("\nInterpretation:\n")
cat("Coefficient for pct_female:", round(coef(res_fem)["pct_female"], 4), "\n")
cat("p-value:", round(res_fem$pval[2], 4), "\n")
cat("R^2:", round(max(0, res_fem$R2), 1), "%\n")

cat("\n*** ECOLOGICAL FALLACY WARNING ***\n")
cat("Even if there is an association between trial-level percentage female\n")
cat("and the treatment effect, this does NOT prove that individual women\n")
cat("respond differently to the treatment than individual men.\n")
cat("Trials with higher percentage female may differ in other ways:\n")
cat("  - Geographic region (cultural differences in enrolment)\n")
cat("  - Mean age (women often present with AF at older ages)\n")
cat("  - Comorbidity profiles\n")
cat("  - Trial methodology\n")
cat("Only an individual participant data (IPD) meta-analysis with a\n")
cat("treatment-by-sex interaction term can properly assess whether\n")
cat("sex modifies the treatment effect.\n")

# --- Part (d): Bubble plot of meta-regression on mean age ---
cat("\n--- Part (d): Bubble plot ---\n")

# Method 1: Using metafor's built-in bubble plot
regplot(res_age,
        xlab = "Mean Age (years)",
        ylab = "Log Risk Ratio",
        main = "Meta-Regression: Treatment Effect vs Mean Age",
        ci = TRUE,
        pi = TRUE,  # prediction interval
        col = "steelblue",
        bg = "lightblue",
        las = 1)

# Method 2: Using ggplot2 for more control
es$w_re <- 1 / (es$vi + res_age$tau2)
es$w_norm <- es$w_re / max(es$w_re)

# Predicted line from meta-regression
age_seq <- seq(min(es$mean_age) - 2, max(es$mean_age) + 2, length.out = 100)
pred <- predict(res_age, newmods = age_seq)

pred_df <- data.frame(
  mean_age = age_seq,
  pred = pred$pred,
  ci_lo = pred$ci.lb,
  ci_hi = pred$ci.ub,
  pi_lo = pred$pi.lb,
  pi_hi = pred$pi.ub
)

p_bubble <- ggplot() +
  # Prediction interval
  geom_ribbon(data = pred_df, aes(x = mean_age, ymin = pi_lo, ymax = pi_hi),
              fill = "grey90", alpha = 0.5) +
  # Confidence interval for regression line
  geom_ribbon(data = pred_df, aes(x = mean_age, ymin = ci_lo, ymax = ci_hi),
              fill = "steelblue", alpha = 0.2) +
  # Regression line
  geom_line(data = pred_df, aes(x = mean_age, y = pred),
            colour = "steelblue", linewidth = 1) +
  # Study points (bubbles)
  geom_point(data = es, aes(x = mean_age, y = yi, size = w_norm),
             colour = "steelblue", alpha = 0.7) +
  # Study labels
  geom_text(data = es, aes(x = mean_age, y = yi, label = study),
            size = 2.5, vjust = -1.5) +
  # Reference line at null
  geom_hline(yintercept = 0, linetype = "dashed", colour = "grey50") +
  # Aesthetics
  scale_size_continuous(range = c(3, 10), guide = "none") +
  labs(x = "Mean Age (years)",
       y = "Log Risk Ratio",
       title = "Bubble Plot: Meta-Regression of Treatment Effect on Mean Age",
       subtitle = paste0("Slope = ", round(coef(res_age)["mean_age"], 4),
                        ", p = ", round(res_age$pval[2], 3),
                        "; bubble size proportional to study weight")) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 12),
    axis.title = element_text(face = "bold")
  )

print(p_bubble)

ggsave("bubble_plot_age.png", p_bubble, width = 8, height = 6, dpi = 300)

cat("\nBubble plot saved. Larger bubbles indicate more precise studies.\n")
cat("The regression line shows the relationship between mean age\n")
cat("and treatment effect, with shaded confidence and prediction intervals.\n")

Show Python solution

# =============================================================================
# Chapter 18 - Exercise 3: Subgroup Analysis and Meta-Regression in Python
# =============================================================================

import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm
import matplotlib.pyplot as plt

# --- Dataset ---
af_trials = pd.DataFrame({
    'study': ['TRAIL-1', 'GUARD-AF', 'SHIELD', 'ORBIT-AF',
              'VENTURE', 'COMPASS-AF', 'PIONEER-2', 'ATLAS-AF'],
    'events_new':  [28, 45, 112, 67, 33, 89, 52, 41],
    'n_new':       [1200, 2500, 5400, 3100, 1800, 4200, 2800, 2100],
    'events_warf': [42, 58, 148, 84, 29, 102, 61, 53],
    'n_warf':      [1200, 2500, 5400, 3100, 1800, 4200, 2800, 2100],
    'region':      ['Europe', 'North America', 'Europe', 'Asia',
                    'North America', 'Europe', 'Asia', 'North America'],
    'mean_age':    [72, 68, 74, 65, 70, 71, 63, 69],
    'pct_female':  [38, 42, 35, 48, 40, 37, 52, 44]
})

# Compute effect sizes
af_trials['rr'] = (af_trials['events_new'] / af_trials['n_new']) / \
                  (af_trials['events_warf'] / af_trials['n_warf'])
af_trials['log_rr'] = np.log(af_trials['rr'])
af_trials['var_log_rr'] = (
    1/af_trials['events_new'] - 1/af_trials['n_new'] +
    1/af_trials['events_warf'] - 1/af_trials['n_warf']
)
af_trials['se_log_rr'] = np.sqrt(af_trials['var_log_rr'])


def dl_random_effects(log_rr, var):
    """DerSimonian-Laird random-effects meta-analysis."""
    k = len(log_rr)
    w_fe = 1 / var
    pooled_fe = np.sum(w_fe * log_rr) / np.sum(w_fe)
    Q = np.sum(w_fe * (log_rr - pooled_fe)**2)
    C = np.sum(w_fe) - np.sum(w_fe**2) / np.sum(w_fe)
    tau2 = max(0, (Q - (k - 1)) / C)
    w_re = 1 / (var + tau2)
    pooled_re = np.sum(w_re * log_rr) / np.sum(w_re)
    se_re = np.sqrt(1 / np.sum(w_re))
    I2 = max(0, (Q - (k-1)) / Q) * 100 if Q > 0 else 0
    return pooled_re, se_re, tau2, Q, I2, w_re


# Overall random-effects analysis
pooled_re, se_re, tau2, Q, I2, w_re = dl_random_effects(
    af_trials['log_rr'].values, af_trials['var_log_rr'].values
)

# --- Part (a): Subgroup analysis by region ---
print("=" * 60)
print("Part (a): Subgroup Analysis by Region")
print("=" * 60)

regions = af_trials['region'].unique()
subgroup_results = []

for region in regions:
    mask = af_trials['region'] == region
    sub = af_trials[mask]

    if len(sub) >= 2:
        p_re, s_re, t2, q, i2, w = dl_random_effects(
            sub['log_rr'].values, sub['var_log_rr'].values
        )
        ci = (p_re - 1.96*s_re, p_re + 1.96*s_re)
    else:
        # Single study: use study estimate directly
        p_re = sub['log_rr'].values[0]
        s_re = sub['se_log_rr'].values[0]
        t2, q, i2 = 0, 0, 0
        ci = (p_re - 1.96*s_re, p_re + 1.96*s_re)

    subgroup_results.append({
        'region': region,
        'k': len(sub),
        'pooled_rr': np.exp(p_re),
        'ci_lo': np.exp(ci[0]),
        'ci_hi': np.exp(ci[1]),
        'I2': i2,
        'tau2': t2
    })

    print(f"\n{region} (k={len(sub)}):")
    print(f"  Pooled RR: {np.exp(p_re):.4f} "
          f"(95% CI: {np.exp(ci[0]):.4f} - {np.exp(ci[1]):.4f})")
    print(f"  I^2: {i2:.1f}%")

# Test for subgroup differences using meta-regression with region dummies
region_dummies = pd.get_dummies(af_trials['region'], drop_first=True).astype(float)
X_sub = sm.add_constant(region_dummies)
wls_sub = sm.WLS(af_trials['log_rr'], X_sub,
                 weights=1/af_trials['var_log_rr']).fit()

# Wald test for region coefficients
from scipy.stats import chi2
f_stat = wls_sub.f_test(np.eye(len(wls_sub.params))[1:])
print(f"\nTest for subgroup differences:")
print(f"  F-statistic: {f_stat.fvalue[0][0]:.3f}")
print(f"  p-value: {f_stat.pvalue:.4f}")

if f_stat.pvalue < 0.05:
    print("  Significant difference between regions.")
else:
    print("  No significant difference between regions.")
    print(f"  (Limited power with only {len(af_trials)} studies)")

# --- Part (b): Meta-regression with mean age ---
print(f"\n{'=' * 60}")
print("Part (b): Meta-Regression with Mean Age")
print("=" * 60)

X_age = sm.add_constant(af_trials['mean_age'])
wls_age = sm.WLS(af_trials['log_rr'], X_age,
                 weights=1/af_trials['var_log_rr']).fit()

print(f"\nMeta-regression: log_rr ~ mean_age")
print(f"  Intercept: {wls_age.params.iloc[0]:.4f} (p = {wls_age.pvalues.iloc[0]:.4f})")
print(f"  Mean age slope: {wls_age.params.iloc[1]:.4f} (p = {wls_age.pvalues.iloc[1]:.4f})")
print(f"  R-squared: {wls_age.rsquared:.3f}")

if wls_age.pvalues.iloc[1] < 0.05:
    print("\n  There IS a significant relationship between mean age and effect size.")
    if wls_age.params.iloc[1] < 0:
        print("  Trials with older populations show larger treatment benefits.")
    else:
        print("  Trials with older populations show smaller treatment benefits.")
else:
    print(f"\n  No significant relationship (p = {wls_age.pvalues.iloc[1]:.3f}).")

# --- Part (c): Meta-regression with percentage female ---
print(f"\n{'=' * 60}")
print("Part (c): Meta-Regression with Percentage Female")
print("=" * 60)

X_fem = sm.add_constant(af_trials['pct_female'])
wls_fem = sm.WLS(af_trials['log_rr'], X_fem,
                 weights=1/af_trials['var_log_rr']).fit()

print(f"\nMeta-regression: log_rr ~ pct_female")
print(f"  Intercept: {wls_fem.params.iloc[0]:.4f} (p = {wls_fem.pvalues.iloc[0]:.4f})")
print(f"  Pct female slope: {wls_fem.params.iloc[1]:.4f} (p = {wls_fem.pvalues.iloc[1]:.4f})")
print(f"  R-squared: {wls_fem.rsquared:.3f}")

print("\n  *** ECOLOGICAL FALLACY WARNING ***")
print("  This analysis examines TRIAL-LEVEL associations between percentage")
print("  female and treatment effect. Even if significant, this does NOT")
print("  prove that individual women respond differently to treatment.")
print("  Trials with more women may differ in other characteristics:")
print("    - Geographic region and healthcare systems")
print("    - Mean age and comorbidity profiles")
print("    - Trial methodology and inclusion criteria")
print("  Only individual participant data (IPD) meta-analysis with a")
print("  treatment-by-sex interaction can properly test effect modification.")

# --- Part (d): Bubble plot ---
print(f"\n{'=' * 60}")
print("Part (d): Bubble Plot")
print("=" * 60)

fig, ax = plt.subplots(figsize=(10, 7))

# Study weights for bubble sizes
w_plot = 1 / af_trials['var_log_rr']
sizes = (w_plot / w_plot.max()) * 500

# Scatter: bubbles
ax.scatter(af_trials['mean_age'], af_trials['log_rr'],
           s=sizes, alpha=0.6, c='steelblue', edgecolors='darkblue',
           linewidth=1, zorder=5)

# Add study labels
for _, row in af_trials.iterrows():
    ax.annotate(row['study'],
                (row['mean_age'], row['log_rr']),
                textcoords="offset points", xytext=(8, 8),
                fontsize=8, color='grey30')

# Meta-regression line
age_range = np.linspace(af_trials['mean_age'].min() - 2,
                        af_trials['mean_age'].max() + 2, 100)
pred = wls_age.params.iloc[0] + wls_age.params.iloc[1] * age_range
ax.plot(age_range, pred, 'steelblue', linewidth=2,
        label=f'Meta-regression line (slope={wls_age.params.iloc[1]:.4f})')

# Confidence band (approximate)
X_pred = sm.add_constant(age_range)
pred_full = wls_age.get_prediction(X_pred)
pred_ci = pred_full.conf_int(alpha=0.05)
ax.fill_between(age_range, pred_ci[:, 0], pred_ci[:, 1],
                alpha=0.15, color='steelblue', label='95% CI')

# Reference line
ax.axhline(y=0, color='grey', linestyle='--', linewidth=0.8,
           label='Null (log RR = 0)')

ax.set_xlabel('Mean Age (years)', fontweight='bold', fontsize=12)
ax.set_ylabel('Log Risk Ratio', fontweight='bold', fontsize=12)
ax.set_title('Bubble Plot: Meta-Regression of Treatment Effect on Mean Age\n'
             f'(Slope p = {wls_age.pvalues.iloc[1]:.3f}; '
             f'bubble size = study weight)',
             fontweight='bold', fontsize=13)
ax.legend(loc='best', frameon=True, framealpha=0.9)

plt.tight_layout()
plt.savefig('bubble_plot_age.png', dpi=300)
plt.show()

print("\nBubble plot saved as 'bubble_plot_age.png'")
print("Larger bubbles represent more precise (higher-weight) studies.")

D.14.4 Exercise 4

Show R solution

# =============================================================================
# Chapter 18 - Exercise 4: Critical Appraisal of a Meta-Analysis (Conceptual)
# 12 trials of new surgical technique vs standard care for knee OA
# =============================================================================

# This is a conceptual exercise. Answers are provided as detailed comments.

# Given information:
# - Pooled SMD for pain: -0.62 (95% CI: -0.89 to -0.35), p < 0.001
# - I^2 = 78%, tau^2 = 0.15, Q test p < 0.001
# - Prediction interval: -1.42 to 0.18
# - Egger's test: p = 0.03
# - 8 of 12 trials were single-centre with fewer than 100 participants

# --- Part (a): Interpret the pooled effect and clinical significance ---
#
# The pooled SMD of -0.62 favours the new surgical technique (negative =
# lower pain scores = better outcome).
#
# Using Cohen's conventions for interpreting SMD:
#   - Small effect: |d| = 0.2
#   - Medium effect: |d| = 0.5
#   - Large effect: |d| = 0.8
#
# An SMD of -0.62 is a MEDIUM-TO-LARGE effect size. The 95% CI (-0.89 to
# -0.35) excludes zero, and the p-value is < 0.001, indicating statistical
# significance.
#
# HOWEVER, statistical significance does not imply clinical significance.
# To assess clinical importance, we should consider:
# 1. The MINIMUM CLINICALLY IMPORTANT DIFFERENCE (MCID) for the pain
#    outcome. For common knee OA pain scales, MCID is typically SMD ~0.3-0.5.
#    The point estimate exceeds this, but the CI includes values near 0.35.
# 2. The ABSOLUTE DIFFERENCE on the original scale would be more
#    interpretable (e.g., "5 points on a 0-100 VAS scale").
# 3. Whether the pain reduction justifies the RISKS AND COSTS of a new
#    surgical procedure (risk-benefit analysis).

# --- Part (b): Prediction interval vs confidence interval ---
#
# The CONFIDENCE INTERVAL (-0.89 to -0.35) tells us about the AVERAGE
# true effect across all settings. It says: "We are 95% confident that
# the mean true effect lies between -0.89 and -0.35."
#
# The PREDICTION INTERVAL (-1.42 to 0.18) tells us where the true effect
# of a FUTURE STUDY (in a new setting) is likely to fall. It incorporates
# BOTH sampling uncertainty AND between-study heterogeneity.
#
# Key insight: The prediction interval CROSSES ZERO (includes 0.18).
# This means:
# - While the average effect is beneficial, in some settings the new
#   technique might provide NO BENEFIT or even be slightly harmful.
# - A new centre implementing this technique cannot be confident of
#   seeing a benefit.
# - The prediction interval is the more honest representation of
#   uncertainty for clinical decision-making.
#
# This discrepancy (significant CI but prediction interval crossing null)
# is common when I^2 is high and highlights why reporting only the pooled
# CI can be misleading.

# --- Part (c): Implications of I^2 = 78% ---
#
# I^2 = 78% means that 78% of the observed variability in effect sizes
# is due to TRUE BETWEEN-STUDY HETEROGENEITY rather than chance.
# This is classified as CONSIDERABLE heterogeneity (>75%).
#
# Implications:
# 1. The assumption of a single common effect is clearly violated.
# 2. The RANDOM-EFFECTS model is appropriate (and was used), but the
#    pooled estimate should be interpreted as an AVERAGE across heterogeneous
#    true effects, not as a single definitive answer.
# 3. The heterogeneity is statistically significant (Q test p < 0.001),
#    confirming this is not due to sampling variation.
# 4. INVESTIGATING SOURCES of heterogeneity is essential:
#    - Subgroup analysis by study characteristics (e.g., surgical
#      technique variation, patient severity, follow-up duration)
#    - Meta-regression to model effect modifiers
#    - Sensitivity analysis excluding outlier or high-risk-of-bias studies
# 5. The random-effects model gives MORE WEIGHT TO SMALL STUDIES,
#    which in this case (8 of 12 are small) means the pooled estimate
#    is heavily influenced by potentially biased small studies.

# --- Part (d): Concerns about Egger's test and small trials ---
#
# Egger's test is significant (p = 0.03), suggesting FUNNEL PLOT ASYMMETRY.
# Combined with the fact that 8 of 12 trials are small single-centre
# studies, this raises several concerns:
#
# 1. PUBLICATION BIAS: Small trials showing no benefit may not have been
#    published. The available evidence may overestimate the true effect.
#    This is the most common interpretation of funnel plot asymmetry.
#
# 2. SMALL-STUDY EFFECTS: Small studies may have different effect sizes
#    for legitimate reasons:
#    - More selected patient populations (more severe cases)
#    - More enthusiastic, expert surgeons (performance bias)
#    - Less rigorous outcome assessment (detection bias)
#    - Higher risk of bias in general
#
# 3. SINGLE-CENTRE BIAS: Single-centre trials often report larger effects
#    than multicentre trials because of:
#    - Surgeon expertise (learning curve effects)
#    - Patient selection
#    - Lack of external validity
#    - Potential unblinding or outcome assessment bias
#
# 4. The combination of significant Egger's test + predominantly small
#    single-centre trials is a RED FLAG. The pooled effect may be
#    substantially overestimated.

# --- Part (e): Additional analyses wanted ---
#
# 1. TRIM-AND-FILL ANALYSIS: To estimate the adjusted pooled effect after
#    accounting for potentially missing studies.
#
# 2. RISK OF BIAS ASSESSMENT: Using Cochrane Risk of Bias tool (RoB 2) for
#    each trial. Present a risk of bias summary plot.
#
# 3. SENSITIVITY ANALYSIS RESTRICTED TO LARGER TRIALS: Re-run the meta-
#    analysis using only the 4 larger/multicentre trials to see if the
#    effect persists.
#
# 4. SENSITIVITY ANALYSIS BY RISK OF BIAS: Exclude high-risk-of-bias
#    studies and re-estimate.
#
# 5. LEAVE-ONE-OUT ANALYSIS: Check if any single study is driving the
#    result.
#
# 6. SUBGROUP ANALYSIS by:
#    - Study size (small vs large)
#    - Single-centre vs multicentre
#    - Blinding status
#    - Follow-up duration
#    - OA severity at baseline
#
# 7. META-REGRESSION: Examine whether study-level characteristics
#    (sample size, year, risk of bias score) moderate the effect.
#
# 8. FUNCTIONAL OUTCOMES: Pain is subjective and susceptible to placebo
#    effects (especially in surgical trials). Objective outcomes (e.g.,
#    range of motion, need for total knee replacement) would be more
#    convincing.
#
# 9. SHAM SURGERY COMPARISON: Were any trials sham-controlled? Without
#    sham surgery, the "placebo effect of surgery" cannot be distinguished
#    from a true treatment effect.

# --- Part (f): Would you change clinical practice? ---
#
# NO, I would NOT change clinical practice based on this meta-analysis
# alone. The reasons are:
#
# 1. HIGH HETEROGENEITY (I^2 = 78%): The treatment effect is highly
#    variable across settings. Some settings may see no benefit.
#
# 2. PREDICTION INTERVAL CROSSES NULL: A new centre cannot be confident
#    of achieving benefit.
#
# 3. EVIDENCE OF PUBLICATION BIAS: The significant Egger's test and
#    predominance of small positive studies suggest the true effect may
#    be substantially smaller than the pooled estimate.
#
# 4. MOSTLY SMALL SINGLE-CENTRE TRIALS: The evidence base lacks large,
#    multicentre, adequately blinded trials that demonstrate external
#    validity and generalisability.
#
# 5. SURGICAL INTERVENTION: Given the invasiveness, costs, and risks
#    of surgery, a higher evidence bar is appropriate. One would want
#    at least one large, multicentre, preferably sham-controlled RCT
#    showing clinically meaningful benefit.
#
# 6. NO SHAM CONTROL: Surgical procedures are known to have strong
#    placebo effects (cf. Moseley et al., 2002 on arthroscopic surgery
#    for knee OA). Without sham-controlled trials, the observed benefit
#    could be largely or entirely due to placebo.
#
# RECOMMENDED NEXT STEPS:
# - Commission a large, multicentre, sham-controlled RCT
# - Include objective outcomes alongside patient-reported pain
# - Longer follow-up (surgical effects may wane over time)
# - Standardise the surgical technique across centres
# - Register the protocol prospectively

cat("Exercise 4 is a conceptual exercise.\n")
cat("All answers are provided as detailed comments in this script.\n")
cat("Review the comments for:\n")
cat("  (a) Interpretation of pooled SMD and clinical significance\n")
cat("  (b) Prediction interval vs confidence interval\n")
cat("  (c) Implications of I^2 = 78%\n")
cat("  (d) Concerns about Egger's test and small trials\n")
cat("  (e) Additional analyses wanted\n")
cat("  (f) Whether to change clinical practice\n")

Show Python solution

# =============================================================================
# Chapter 18 - Exercise 4: Critical Appraisal of a Meta-Analysis (Conceptual)
# 12 trials of new surgical technique vs standard care for knee OA
# =============================================================================

# This is a conceptual exercise. Answers are provided as detailed comments.

# Given information:
# - Pooled SMD for pain: -0.62 (95% CI: -0.89 to -0.35), p < 0.001
# - I^2 = 78%, tau^2 = 0.15, Q test p < 0.001
# - Prediction interval: -1.42 to 0.18
# - Egger's test: p = 0.03
# - 8 of 12 trials were single-centre with fewer than 100 participants

# --- Part (a): Interpret the pooled effect and clinical significance ---
#
# The pooled SMD of -0.62 favours the new surgical technique (lower pain).
# By Cohen's conventions: |d|=0.2 is small, 0.5 is medium, 0.8 is large.
# So -0.62 represents a MEDIUM-TO-LARGE effect.
#
# The 95% CI (-0.89 to -0.35) excludes zero, and p < 0.001.
#
# HOWEVER, clinical significance requires considering:
# 1. The Minimum Clinically Important Difference (MCID) -- typically
#    SMD ~0.3-0.5 for knee OA pain scales. The CI includes values
#    near MCID threshold.
# 2. The absolute difference on the original pain scale.
# 3. Risk-benefit ratio for a surgical procedure.
# 4. Patient values and preferences.

# --- Part (b): Prediction interval vs confidence interval ---
#
# CI (-0.89 to -0.35): Describes uncertainty about the AVERAGE effect.
# PI (-1.42 to 0.18): Describes where the NEXT STUDY'S true effect
#                      is likely to fall.
#
# Critical difference: The PI CROSSES ZERO (upper bound = 0.18).
# This means in some settings, the technique might provide NO BENEFIT.
# A new centre implementing this cannot be confident of benefit.
# The PI is more relevant for individual clinical decisions.

# --- Part (c): Implications of I^2 = 78% ---
#
# I^2 = 78% = CONSIDERABLE heterogeneity (>75% threshold).
# 78% of variability is due to true between-study differences.
#
# Implications:
# 1. No single common effect exists -- the pooled estimate is an average.
# 2. Sources of heterogeneity must be investigated (subgroup analysis,
#    meta-regression, sensitivity analysis).
# 3. Random-effects model gives more weight to small studies, which
#    here (8/12 small) means heavy influence from potentially biased work.
# 4. Reporting the pooled estimate without the PI is misleading.

# --- Part (d): Concerns about Egger's test + small trials ---
#
# Egger's test p = 0.03 indicates significant funnel plot asymmetry.
# Combined with 8/12 small single-centre trials, concerns include:
#
# 1. PUBLICATION BIAS: Negative small trials may not have been published.
# 2. SMALL-STUDY EFFECTS: Small trials often show larger effects due to
#    selected populations, enthusiastic surgeons, or less rigorous methods.
# 3. SINGLE-CENTRE BIAS: Lack of external validity, expertise effects.
# 4. The combination is a RED FLAG: the pooled effect is likely
#    OVERESTIMATED.

# --- Part (e): Additional analyses wanted ---
#
# 1. Trim-and-fill to estimate adjusted effect
# 2. Risk of bias assessment (Cochrane RoB 2) for each trial
# 3. Sensitivity analysis restricted to larger/multicentre trials only
# 4. Sensitivity analysis excluding high-risk-of-bias studies
# 5. Leave-one-out analysis
# 6. Subgroup analyses: by study size, single vs multicentre, blinding
# 7. Meta-regression on study-level characteristics
# 8. Objective functional outcomes (not just subjective pain)
# 9. Check for sham-controlled trials (surgical placebo effect)

# --- Part (f): Would you change clinical practice? ---
#
# NO. Reasons:
# 1. High heterogeneity (I^2 = 78%) -- variable effects across settings
# 2. Prediction interval crosses null -- some settings may see no benefit
# 3. Evidence of publication bias (Egger's p = 0.03)
# 4. Evidence base dominated by small single-centre trials
# 5. Surgery requires higher evidence bar due to invasiveness/costs/risks
# 6. No sham-controlled trials mentioned (surgical placebo effect)
#
# Recommended: Commission a large multicentre sham-controlled RCT with
# objective outcomes and long follow-up before changing practice.

# Print summary for verification
print("=" * 70)
print("Exercise 4: Critical Appraisal of a Published Meta-Analysis")
print("=" * 70)
print()

print("Given:")
print("  Pooled SMD: -0.62 (95% CI: -0.89 to -0.35)")
print("  I^2 = 78%, tau^2 = 0.15, Q test p < 0.001")
print("  Prediction interval: -1.42 to 0.18")
print("  Egger's test: p = 0.03")
print("  8/12 trials: small, single-centre (<100 participants)")
print()

print("(a) POOLED EFFECT: Medium-to-large (SMD = -0.62), statistically")
print("    significant but clinical significance depends on MCID and")
print("    risk-benefit analysis for a surgical intervention.")
print()
print("(b) PREDICTION INTERVAL crosses zero (upper bound 0.18), meaning")
print("    some settings may see no benefit. The CI alone is misleading.")
print()
print("(c) I^2 = 78% indicates considerable heterogeneity. The pooled")
print("    estimate is an average of very different true effects.")
print()
print("(d) Egger's p = 0.03 + mostly small trials = RED FLAG for")
print("    publication bias and small-study effects. The true effect")
print("    is likely overestimated.")
print()
print("(e) Need: trim-and-fill, risk of bias assessment, sensitivity")
print("    analyses (excluding small/high-bias trials), sham control check.")
print()
print("(f) Do NOT change practice. Evidence is insufficient due to")
print("    heterogeneity, publication bias concerns, and reliance on")
print("    small single-centre trials. Need large multicentre sham-RCT.")

D.15 Chapter 19: Journal-Ready Analysis

D.15.1 Exercise 1

Show R solution

# =============================================================================
# Chapter 19 - Exercise 1: Table 1 in R
# Using the lung dataset, create a publication-quality Table 1
# =============================================================================

library(tidyverse)
library(gtsummary)
library(gt)
library(survival)

# --- Load data ---
data(lung, package = "survival")

# Prepare data: recode sex for clarity
lung_clean <- lung %>%
  mutate(
    sex = factor(sex, levels = c(1, 2), labels = c("Male", "Female")),
    ph.ecog = factor(ph.ecog, levels = 0:4,
                     labels = c("Fully active", "Restricted",
                                "Ambulatory", "Limited self-care",
                                "Completely disabled")),
    status = factor(status, levels = c(1, 2), labels = c("Censored", "Dead"))
  )

# --- Part (a): Create Table 1 stratified by sex ---
table1 <- lung_clean %>%
  select(sex, age, ph.ecog, ph.karno, meal.cal, wt.loss) %>%
  tbl_summary(
    by = sex,
    statistic = list(
      all_continuous() ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = list(
      all_continuous() ~ 1,
      all_categorical() ~ c(0, 1)
    ),
    label = list(
      age ~ "Age, years",
      ph.ecog ~ "ECOG performance status",
      ph.karno ~ "Karnofsky performance score",
      meal.cal ~ "Meal calories, kcal/day",
      wt.loss ~ "Weight loss, kg (last 6 months)"
    ),
    missing_text = "Missing"
  ) %>%
  add_overall() %>%
  add_n()

# --- Part (b): Add SMDs instead of p-values ---
# gtsummary supports SMD via add_difference()
# For Table 1 with SMDs, we use a different approach
table1_smd <- lung_clean %>%
  select(sex, age, ph.ecog, ph.karno, meal.cal, wt.loss) %>%
  tbl_summary(
    by = sex,
    statistic = list(
      all_continuous() ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = list(
      all_continuous() ~ 1,
      all_categorical() ~ c(0, 1)
    ),
    label = list(
      age ~ "Age, years",
      ph.ecog ~ "ECOG performance status",
      ph.karno ~ "Karnofsky performance score",
      meal.cal ~ "Meal calories, kcal/day",
      wt.loss ~ "Weight loss, kg (last 6 months)"
    ),
    missing_text = "Missing"
  ) %>%
  add_overall() %>%
  add_n() %>%
  add_difference(
    estimate_fun = list(all_continuous() ~ function(x) style_sigfig(x, digits = 3)),
    pvalue_fun = ~ style_pvalue(.x, digits = 3)
  ) %>%
  modify_header(label = "**Characteristic**") %>%
  modify_spanning_header(c("stat_1", "stat_2") ~ "**Sex**") %>%
  bold_labels()

# Compute SMDs manually for continuous variables
compute_smd <- function(x, group) {
  g1 <- x[group == "Male"]
  g2 <- x[group == "Female"]
  g1 <- g1[!is.na(g1)]
  g2 <- g2[!is.na(g2)]
  pooled_sd <- sqrt((var(g1) + var(g2)) / 2)
  if (pooled_sd == 0) return(0)
  (mean(g1) - mean(g2)) / pooled_sd
}

cat("Standardised Mean Differences (Male vs Female):\n")
cat("  Age:", round(compute_smd(lung_clean$age, lung_clean$sex), 3), "\n")
cat("  Karnofsky:", round(compute_smd(lung_clean$ph.karno, lung_clean$sex), 3), "\n")
cat("  Meal calories:", round(compute_smd(lung_clean$meal.cal, lung_clean$sex), 3), "\n")
cat("  Weight loss:", round(compute_smd(lung_clean$wt.loss, lung_clean$sex), 3), "\n")

# Create the final publication table with gt formatting
final_table <- table1_smd %>%
  as_gt() %>%
  gt::tab_header(
    title = "Table 1. Baseline Characteristics of the Study Population",
    subtitle = "NCCTG Lung Cancer Dataset, Stratified by Sex"
  ) %>%
  gt::tab_footnote(
    footnote = "Values are mean (SD) for continuous variables and n (%) for categorical variables. SMD = standardised mean difference.",
    locations = gt::cells_title()
  )

print(final_table)

# --- Part (c): Export to Word and LaTeX ---
# Export to Word (.docx)
# gt::gtsave(final_table, "table1_lung.docx")
cat("\nTo export to Word: gt::gtsave(final_table, 'table1_lung.docx')\n")

# Export to LaTeX
# gt::as_latex(final_table) %>% writeLines("table1_lung.tex")
cat("To export to LaTeX: gt::as_latex(final_table)\n")

# Export to HTML
# gt::gtsave(final_table, "table1_lung.html")
cat("To export to HTML: gt::gtsave(final_table, 'table1_lung.html')\n")

cat("\nNote: SMDs < 0.1 indicate good balance. In an RCT, p-values for\n")
cat("baseline comparisons are inappropriate because any differences are\n")
cat("due to chance. In observational studies, SMDs are preferred over\n")
cat("p-values because they are independent of sample size.\n")

Show Python solution

# =============================================================================
# Chapter 19 - Exercise 1: Table 1 in Python
# Using the lung dataset, create a publication-quality Table 1
# =============================================================================

import pandas as pd
import numpy as np

# --- Load data ---
# Use lifelines' lung dataset
from lifelines.datasets import load_lung
lung = load_lung()

# Prepare data
lung['sex_label'] = lung['sex'].map({1: 'Male', 2: 'Female'})
lung['ph.ecog_label'] = lung['ph.ecog'].map({
    0: 'Fully active', 1: 'Restricted', 2: 'Ambulatory',
    3: 'Limited self-care', 4: 'Completely disabled'
})

# --- Part (a): Create Table 1 stratified by sex with SMDs ---
try:
    from tableone import TableOne

    # Define variable types
    columns = ['age', 'ph.ecog', 'ph.karno', 'meal.cal', 'wt.loss']
    categorical = ['ph.ecog']
    nonnormal = []  # all continuous reported as mean (SD)

    # Create Table 1 with SMDs instead of p-values
    table1 = TableOne(
        lung,
        columns=columns,
        categorical=categorical,
        groupby='sex_label',
        nonnormal=nonnormal,
        pval=False,  # No p-values
        smd=True,    # Include SMDs
        rename={
            'age': 'Age, years',
            'ph.ecog': 'ECOG performance status',
            'ph.karno': 'Karnofsky performance score',
            'meal.cal': 'Meal calories, kcal/day',
            'wt.loss': 'Weight loss, kg (last 6 months)'
        },
        missing=True,
        overall=True
    )

    print("=" * 70)
    print("Table 1. Baseline Characteristics of the Study Population")
    print("NCCTG Lung Cancer Dataset, Stratified by Sex")
    print("=" * 70)
    print(table1.tabulate(tablefmt="github"))

    # --- Part (c): Export ---
    # table1.to_excel('table1_lung.xlsx')
    # table1.to_latex('table1_lung.tex')
    # table1.to_html('table1_lung.html')
    print("\nExport options:")
    print("  table1.to_excel('table1_lung.xlsx')")
    print("  table1.to_latex('table1_lung.tex')")
    print("  table1.to_html('table1_lung.html')")

except ImportError:
    print("tableone package not installed. Building Table 1 manually.\n")

    # --- Manual Table 1 construction ---
    def compute_smd_cont(data, var, group_col):
        """Compute SMD for a continuous variable."""
        g1 = data[data[group_col] == 'Male'][var].dropna()
        g2 = data[data[group_col] == 'Female'][var].dropna()
        pooled_sd = np.sqrt((g1.var() + g2.var()) / 2)
        if pooled_sd == 0:
            return 0
        return (g1.mean() - g2.mean()) / pooled_sd

    def summarise_continuous(data, var, group_col):
        """Summarise continuous variable by group."""
        results = {}
        for group in ['Overall'] + list(data[group_col].unique()):
            if group == 'Overall':
                subset = data[var].dropna()
            else:
                subset = data[data[group_col] == group][var].dropna()
            results[group] = f"{subset.mean():.1f} ({subset.std():.1f})"
        return results

    def summarise_categorical(data, var, group_col):
        """Summarise categorical variable by group."""
        results = {}
        categories = sorted(data[var].dropna().unique())
        for group in ['Overall'] + list(data[group_col].unique()):
            if group == 'Overall':
                subset = data[var].dropna()
            else:
                subset = data[data[group_col] == group][var].dropna()
            counts = subset.value_counts()
            total = len(subset)
            group_results = {}
            for cat in categories:
                n = counts.get(cat, 0)
                pct = n / total * 100
                group_results[cat] = f"{n} ({pct:.1f}%)"
            results[group] = group_results
        return results

    # Build table
    cont_vars = {
        'age': 'Age, years',
        'ph.karno': 'Karnofsky performance score',
        'meal.cal': 'Meal calories, kcal/day',
        'wt.loss': 'Weight loss, kg (last 6 months)'
    }

    print(f"{'Variable':<40} {'Overall':>18} {'Male':>18} {'Female':>18} {'SMD':>8}")
    print("-" * 104)

    for var, label in cont_vars.items():
        summ = summarise_continuous(lung, var, 'sex_label')
        smd = compute_smd_cont(lung, var, 'sex_label')
        n_miss = lung[var].isna().sum()
        miss_str = f" [missing: {n_miss}]" if n_miss > 0 else ""
        print(f"{label + miss_str:<40} {summ['Overall']:>18} "
              f"{summ.get('Male', 'N/A'):>18} {summ.get('Female', 'N/A'):>18} "
              f"{abs(smd):>7.3f}")

    # ECOG as categorical
    print(f"\n{'ECOG performance status':<40}")
    ecog_summ = summarise_categorical(lung, 'ph.ecog', 'sex_label')
    for cat in sorted(lung['ph.ecog'].dropna().unique()):
        label_map = {0: '  Fully active', 1: '  Restricted',
                     2: '  Ambulatory', 3: '  Limited self-care',
                     4: '  Completely disabled'}
        label = label_map.get(cat, f'  {cat}')
        overall = ecog_summ['Overall'].get(cat, '0 (0.0%)')
        male = ecog_summ.get('Male', {}).get(cat, '0 (0.0%)')
        female = ecog_summ.get('Female', {}).get(cat, '0 (0.0%)')
        print(f"{label:<40} {overall:>18} {male:>18} {female:>18}")

# --- Part (b): SMD interpretation ---
print("\n" + "=" * 70)
print("Standardised Mean Differences (SMD)")
print("=" * 70)

def compute_smd(data, var, group_col):
    g1 = data[data[group_col] == 'Male'][var].dropna()
    g2 = data[data[group_col] == 'Female'][var].dropna()
    pooled_sd = np.sqrt((g1.var() + g2.var()) / 2)
    if pooled_sd == 0:
        return 0
    return (g1.mean() - g2.mean()) / pooled_sd

for var, label in [('age', 'Age'), ('ph.karno', 'Karnofsky score'),
                   ('meal.cal', 'Meal calories'), ('wt.loss', 'Weight loss')]:
    smd = compute_smd(lung, var, 'sex_label')
    balanced = "Balanced" if abs(smd) < 0.1 else "Imbalanced"
    print(f"  {label:<20}: SMD = {smd:>7.3f}  [{balanced}]")

print("\nNote: SMD < 0.1 indicates good balance between groups.")
print("In observational studies, SMDs are preferred over p-values because")
print("they are independent of sample size, whereas p-values from large")
print("samples can be significant for clinically trivial differences.")

D.15.2 Exercise 2

Show R solution

# =============================================================================
# Chapter 19 - Exercise 2: Publication-Quality Multi-Panel Figure in R
# Survival analysis of the lung dataset
# =============================================================================

library(tidyverse)
library(survival)
library(survminer)
library(broom)
library(patchwork)

# --- Publication theme and colour-blind palette ---
theme_publication <- function(base_size = 11) {
  theme_minimal(base_size = base_size) %+replace%
    theme(
      text = element_text(family = "Arial"),
      plot.title = element_text(size = base_size + 2, face = "bold",
                                hjust = 0, margin = margin(b = 10)),
      axis.title = element_text(size = base_size, face = "bold"),
      axis.text = element_text(size = base_size - 1, colour = "black"),
      legend.title = element_text(size = base_size, face = "bold"),
      legend.text = element_text(size = base_size - 1),
      legend.position = "bottom",
      panel.grid.minor = element_blank(),
      panel.grid.major = element_line(colour = "grey90"),
      strip.text = element_text(size = base_size, face = "bold"),
      plot.margin = margin(10, 10, 10, 10)
    )
}

cb_palette <- c("#0072B2", "#D55E00", "#009E73", "#CC79A7",
                "#F0E442", "#56B4E9", "#E69F00", "#000000")

# --- Load and prepare data ---
data(lung, package = "survival")
lung <- lung %>%
  mutate(
    sex_label = factor(sex, levels = c(1, 2), labels = c("Male", "Female")),
    status_event = status - 1  # Convert to 0/1
  ) %>%
  filter(!is.na(ph.ecog) & !is.na(ph.karno) & !is.na(wt.loss))

# =============================================================================
# Panel A: Kaplan-Meier Curve by Sex
# =============================================================================
fit_km <- survfit(Surv(time, status_event) ~ sex_label, data = lung)

# Using ggsurvplot for KM curve
p_km <- ggsurvplot(
  fit_km,
  data = lung,
  palette = cb_palette[1:2],
  risk.table = FALSE,
  pval = TRUE,
  pval.coord = c(700, 0.9),
  conf.int = TRUE,
  conf.int.alpha = 0.15,
  xlab = "Time (days)",
  ylab = "Survival probability",
  legend.labs = c("Male", "Female"),
  legend.title = "Sex",
  ggtheme = theme_publication(),
  title = "A"
)

panel_a <- p_km$plot +
  theme(legend.position = c(0.8, 0.85),
        legend.background = element_rect(fill = "white", colour = "grey80"),
        plot.title = element_text(size = 14, face = "bold"))

# =============================================================================
# Panel B: Forest Plot of Hazard Ratios from Multivariable Cox Model
# =============================================================================
cox_model <- coxph(Surv(time, status_event) ~ age + sex_label + ph.ecog +
                     ph.karno + wt.loss,
                   data = lung)

# Extract coefficients
cox_tidy <- tidy(cox_model, conf.int = TRUE, exponentiate = TRUE)
cox_tidy <- cox_tidy %>%
  mutate(
    label = case_when(
      term == "age" ~ "Age (per year)",
      term == "sex_labelFemale" ~ "Female vs Male",
      term == "ph.ecog" ~ "ECOG PS (per level)",
      term == "ph.karno" ~ "Karnofsky (per 10 pts)",
      term == "wt.loss" ~ "Weight loss (per kg)"
    ),
    # Rescale Karnofsky to per-10-point increase for interpretability
    estimate = ifelse(term == "ph.karno", estimate^10, estimate),
    conf.low = ifelse(term == "ph.karno", conf.low^10, conf.low),
    conf.high = ifelse(term == "ph.karno", conf.high^10, conf.high)
  )

panel_b <- ggplot(cox_tidy, aes(x = estimate, y = reorder(label, estimate))) +
  geom_vline(xintercept = 1, linetype = "dashed", colour = "grey50") +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high),
                 height = 0.2, colour = cb_palette[1], linewidth = 0.8) +
  geom_point(size = 3, colour = cb_palette[1]) +
  geom_text(aes(label = sprintf("%.2f (%.2f-%.2f)", estimate, conf.low, conf.high)),
            hjust = -0.1, size = 3, nudge_y = 0.25) +
  scale_x_log10() +
  labs(x = "Hazard Ratio (95% CI)",
       y = "",
       title = "B") +
  theme_publication() +
  theme(
    legend.position = "none",
    plot.title = element_text(size = 14, face = "bold")
  )

# =============================================================================
# Panel C: Calibration Plot for 1-Year Survival Prediction
# =============================================================================

# Predict 1-year (365 day) survival probabilities
# Use Cox model baseline hazard
surv_pred <- summary(survfit(cox_model), times = 365)

# For calibration, split into deciles of predicted risk
lung$pred_1yr <- 1 - predict(cox_model, type = "expected",
                              newdata = lung) / max(lung$time) * 365
# Alternative: use survfit-based prediction
baseline_surv <- basehaz(cox_model, centered = FALSE)
# Get linear predictor
lung$lp <- predict(cox_model, type = "lp")

# Simple calibration: bin by predicted risk deciles
lung$pred_surv <- exp(-predict(cox_model, type = "expected"))

# Create risk groups based on linear predictor
lung$risk_group <- cut(lung$lp, breaks = quantile(lung$lp, probs = seq(0, 1, 0.1)),
                       include.lowest = TRUE, labels = FALSE)

# Calculate observed 1-year survival in each group
cal_data <- lung %>%
  group_by(risk_group) %>%
  summarise(
    n = n(),
    events = sum(status_event),
    # KM estimate at 1 year
    obs_surv = {
      fit <- survfit(Surv(time, status_event) ~ 1, data = cur_data())
      s <- summary(fit, times = 365)
      if (length(s$surv) > 0) s$surv else NA
    },
    mean_lp = mean(lp),
    .groups = "drop"
  ) %>%
  filter(!is.na(obs_surv))

# Get predicted survival at each mean LP
h0_365 <- approx(baseline_surv$time, baseline_surv$hazard, xout = 365)$y
if (is.na(h0_365)) h0_365 <- max(baseline_surv$hazard)
cal_data$pred_surv <- exp(-h0_365 * exp(cal_data$mean_lp))

panel_c <- ggplot(cal_data, aes(x = 1 - pred_surv, y = 1 - obs_surv)) +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", colour = "grey50") +
  geom_point(size = 3, colour = cb_palette[1]) +
  geom_smooth(method = "loess", se = TRUE, colour = cb_palette[2],
              fill = cb_palette[2], alpha = 0.2) +
  coord_equal(xlim = c(0, 1), ylim = c(0, 1)) +
  labs(x = "Predicted 1-year mortality risk",
       y = "Observed 1-year mortality",
       title = "C") +
  theme_publication() +
  theme(plot.title = element_text(size = 14, face = "bold"))

# =============================================================================
# Combine panels using patchwork
# =============================================================================
# Panel A takes the top row, B and C share the bottom row
combined <- panel_a / (panel_b | panel_c) +
  plot_layout(heights = c(1, 1))

# Save at publication quality
# Double-column layout: 183 mm wide
ggsave("figure_survival_analysis.tiff",
       plot = combined,
       width = 183, height = 200,
       units = "mm", dpi = 300,
       compression = "lzw")

ggsave("figure_survival_analysis.pdf",
       plot = combined,
       width = 183, height = 200,
       units = "mm", device = cairo_pdf)

print(combined)

cat("\nMulti-panel figure saved as:\n")
cat("  figure_survival_analysis.tiff (300 DPI, TIFF)\n")
cat("  figure_survival_analysis.pdf (vector PDF)\n")
cat("\nSpecifications:\n")
cat("  Width: 183 mm (double-column)\n")
cat("  Resolution: 300 DPI\n")
cat("  Colour palette: colour-blind accessible\n")
cat("  Font: Arial\n")

Show Python solution

# =============================================================================
# Chapter 19 - Exercise 2: Publication-Quality Multi-Panel Figure in Python
# Survival analysis of the lung dataset
# =============================================================================

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from lifelines import KaplanMeierFitter, CoxPHFitter
from lifelines.datasets import load_lung
from lifelines.statistics import logrank_test

# --- Publication settings ---
plt.rcParams.update({
    'font.family': 'Arial',
    'font.size': 10,
    'axes.labelsize': 11,
    'axes.titlesize': 12,
    'axes.labelweight': 'bold',
    'figure.dpi': 300,
    'savefig.dpi': 300,
    'axes.spines.top': False,
    'axes.spines.right': False,
})

# Colour-blind friendly palette
cb_palette = ['#0072B2', '#D55E00', '#009E73', '#CC79A7']

# --- Load and prepare data ---
lung = load_lung()
lung['status'] = lung['status'] - 1  # Convert to 0/1
lung = lung.dropna(subset=['ph.ecog', 'ph.karno', 'wt.loss', 'age', 'sex'])

# =============================================================================
# Create the multi-panel figure
# =============================================================================
fig = plt.figure(figsize=(7.2, 9.0))  # ~183mm x ~230mm
gs = gridspec.GridSpec(2, 2, height_ratios=[1, 1], hspace=0.35, wspace=0.35)

# =============================================================================
# Panel A: Kaplan-Meier Curve by Sex
# =============================================================================
ax_a = fig.add_subplot(gs[0, :])  # Top row, full width

kmf = KaplanMeierFitter()

for sex_val, label, color in [(1, 'Male', cb_palette[0]),
                               (2, 'Female', cb_palette[1])]:
    mask = lung['sex'] == sex_val
    kmf.fit(lung.loc[mask, 'T'], lung.loc[mask, 'status'], label=label)
    kmf.plot_survival_function(ax=ax_a, color=color, ci_show=True, ci_alpha=0.15)

# Log-rank test
lr = logrank_test(
    lung.loc[lung['sex'] == 1, 'T'], lung.loc[lung['sex'] == 2, 'T'],
    lung.loc[lung['sex'] == 1, 'status'], lung.loc[lung['sex'] == 2, 'status']
)
ax_a.text(0.7, 0.85, f'Log-rank p = {lr.p_value:.3f}',
          transform=ax_a.transAxes, fontsize=10,
          bbox=dict(boxstyle='round', facecolor='white', edgecolor='grey'))

ax_a.set_xlabel('Time (days)')
ax_a.set_ylabel('Survival Probability')
ax_a.set_title('A', loc='left', fontweight='bold', fontsize=14)
ax_a.set_ylim(0, 1.05)
ax_a.legend(loc='lower left', frameon=True, framealpha=0.9)

# Number at risk
n_at_risk_times = [0, 200, 400, 600, 800, 1000]
risk_text = "At risk:\n"
for sex_val, label in [(1, 'Male'), (2, 'Female')]:
    mask = lung['sex'] == sex_val
    times = lung.loc[mask, 'T']
    events = lung.loc[mask, 'status']
    counts = []
    for t in n_at_risk_times:
        counts.append(str(sum(times >= t)))
    risk_text += f"  {label}: " + "  ".join(counts) + "\n"

# =============================================================================
# Panel B: Forest Plot of Hazard Ratios
# =============================================================================
ax_b = fig.add_subplot(gs[1, 0])

# Fit Cox model
cph = CoxPHFitter()
cox_data = lung[['T', 'status', 'age', 'sex', 'ph.ecog', 'ph.karno', 'wt.loss']].copy()
cph.fit(cox_data, duration_col='T', event_col='status')

# Extract results
summary_df = cph.summary[['exp(coef)', 'exp(coef) lower 95%', 'exp(coef) upper 95%']].copy()
summary_df.columns = ['HR', 'CI_lo', 'CI_hi']

# Rename for display
name_map = {
    'age': 'Age (per year)',
    'sex': 'Sex (Female vs Male)',
    'ph.ecog': 'ECOG PS (per level)',
    'ph.karno': 'Karnofsky (per point)',
    'wt.loss': 'Weight loss (per kg)'
}
summary_df.index = summary_df.index.map(lambda x: name_map.get(x, x))

# Plot forest
y_pos = range(len(summary_df) - 1, -1, -1)
labels = list(summary_df.index)

ax_b.axvline(x=1, color='grey', linestyle='--', linewidth=0.8)

for i, (label, row) in enumerate(summary_df.iterrows()):
    y = list(y_pos)[i]
    ax_b.errorbarh(y, row['HR'], xerr=[[row['HR'] - row['CI_lo']],
                                         [row['CI_hi'] - row['HR']]],
                   fmt='o', color=cb_palette[0], capsize=4, markersize=6,
                   linewidth=1.5)
    # Annotate with HR (CI)
    ax_b.text(max(summary_df['CI_hi']) * 1.15, y,
              f"{row['HR']:.2f} ({row['CI_lo']:.2f}-{row['CI_hi']:.2f})",
              va='center', fontsize=7.5)

ax_b.set_yticks(list(y_pos))
ax_b.set_yticklabels(labels, fontsize=9)
ax_b.set_xlabel('Hazard Ratio (95% CI)')
ax_b.set_xscale('log')
ax_b.set_title('B', loc='left', fontweight='bold', fontsize=14)

# =============================================================================
# Panel C: Calibration Plot for 1-Year Survival
# =============================================================================
ax_c = fig.add_subplot(gs[1, 1])

# Predict survival at 1 year (365 days) using the Cox model
lung_pred = cox_data.copy()
lung_pred['lp'] = cph.predict_partial_hazard(lung_pred)

# Create decile groups based on linear predictor
lung_pred['risk_decile'] = pd.qcut(lung_pred['lp'], q=10, labels=False,
                                    duplicates='drop')

# Predicted and observed 1-year mortality for each decile
cal_results = []
for decile in sorted(lung_pred['risk_decile'].unique()):
    subset = lung_pred[lung_pred['risk_decile'] == decile]

    # Predicted: use Cox model survival function
    predicted_surv = cph.predict_survival_function(subset, times=[365])
    pred_mort = 1 - predicted_surv.mean(axis=1).values[0]

    # Observed: Kaplan-Meier estimate at 365 days
    kmf_cal = KaplanMeierFitter()
    kmf_cal.fit(subset['T'], subset['status'])
    try:
        obs_surv = kmf_cal.predict(365)
        obs_mort = 1 - obs_surv
    except Exception:
        obs_mort = np.nan

    cal_results.append({
        'decile': decile,
        'predicted': pred_mort,
        'observed': obs_mort,
        'n': len(subset)
    })

cal_df = pd.DataFrame(cal_results).dropna()

# Plot calibration
ax_c.plot([0, 1], [0, 1], 'k--', alpha=0.5, linewidth=0.8, label='Perfect calibration')
ax_c.scatter(cal_df['predicted'], cal_df['observed'],
             s=cal_df['n'] * 2, c=cb_palette[0], alpha=0.7,
             edgecolors='darkblue', linewidth=0.5)

# Fit LOESS-like smoothing
from numpy.polynomial import polynomial as P
if len(cal_df) > 3:
    z = np.polyfit(cal_df['predicted'], cal_df['observed'], 2)
    p = np.poly1d(z)
    x_smooth = np.linspace(cal_df['predicted'].min(), cal_df['predicted'].max(), 100)
    ax_c.plot(x_smooth, p(x_smooth), color=cb_palette[1], linewidth=1.5,
              label='Observed (smoothed)')

ax_c.set_xlabel('Predicted 1-year mortality')
ax_c.set_ylabel('Observed 1-year mortality')
ax_c.set_title('C', loc='left', fontweight='bold', fontsize=14)
ax_c.set_xlim(0, 1)
ax_c.set_ylim(0, 1)
ax_c.set_aspect('equal')
ax_c.legend(loc='lower right', fontsize=8, frameon=True)

# =============================================================================
# Save figure
# =============================================================================
plt.tight_layout()
plt.savefig('figure_survival_analysis.tiff', dpi=300, bbox_inches='tight')
plt.savefig('figure_survival_analysis.pdf', bbox_inches='tight')
plt.show()

print("Multi-panel figure saved as:")
print("  figure_survival_analysis.tiff (300 DPI, raster)")
print("  figure_survival_analysis.pdf (vector)")
print()
print("Specifications:")
print("  Width: ~183 mm (double-column)")
print("  Resolution: 300 DPI")
print("  Colour palette: colour-blind accessible")
print("  Font: Arial")
print()
print("Panel A: Kaplan-Meier survival curves by sex with log-rank p-value")
print("Panel B: Forest plot of hazard ratios from multivariable Cox model")
print("Panel C: Calibration plot for 1-year mortality prediction")

D.15.3 Exercise 3

Show R solution

# =============================================================================
# Chapter 19 - Exercise 3: Write a Statistical Methods Section
# Retrospective cohort study: SGLT2i vs DPP4i and MACE in T2DM
# =============================================================================

# This is a conceptual exercise. The statistical methods section is provided
# as a multi-line character string that could be included in a manuscript.

methods_section <- '
STATISTICAL METHODS

Study Design and Population
This was a retrospective cohort study of 5,000 adults with type 2 diabetes
mellitus identified from the hospital electronic health records (EHR) database
between 1 January 2015 and 31 December 2023. Patients were eligible if they
had a new prescription for either an SGLT2 inhibitor or a DPP-4 inhibitor, with
no prior use of the comparator drug class. Patients with a history of major
adverse cardiovascular events (MACE) prior to the index date were excluded.
The index date (time zero) was defined as the date of first prescription of
the study drug, consistent with a new-user, active comparator design to
minimise immortal time bias and confounding by indication.

Primary and Secondary Outcomes
The primary outcome was time to first MACE, defined as a composite of
myocardial infarction (ICD-10: I21), ischaemic stroke (ICD-10: I63), or
cardiovascular death (underlying cause of death codes I00-I99). Patients were
followed from the index date until the first MACE event, death from non-
cardiovascular causes, loss to follow-up, end of the study period
(31 December 2023), or 5 years after the index date, whichever occurred first.

Sample Size
With 5,000 patients and an anticipated event rate of 8% over 5 years in the
DPP-4 inhibitor group, the study had approximately 80% power to detect a
hazard ratio of 0.70 or smaller at a two-sided alpha of 0.05, assuming a
1:1 treatment group ratio and accounting for 10% loss to follow-up.

Descriptive Statistics
Baseline characteristics were summarised as means (SD) for normally
distributed continuous variables, medians (IQR) for skewed continuous
variables, and frequencies (percentages) for categorical variables.
Standardised mean differences (SMDs) were used to compare baseline
characteristics between treatment groups, with an absolute SMD < 0.1
indicating adequate balance; p-values were not used for baseline comparisons
in accordance with current recommendations.

Propensity Score Estimation and Matching
The propensity score -- the probability of receiving an SGLT2 inhibitor
versus a DPP-4 inhibitor -- was estimated using multivariable logistic
regression. Covariates included age, sex, body mass index (BMI), glycated
haemoglobin (HbA1c), estimated glomerular filtration rate (eGFR), history
of cardiovascular disease, hypertension, and smoking status. These covariates
were selected a priori based on clinical knowledge and a directed acyclic
graph (DAG) encoding assumed causal relationships. Propensity scores were
used for 1:1 nearest-neighbour matching without replacement, using a caliper
of 0.2 standard deviations of the logit of the propensity score. Covariate
balance after matching was assessed using SMDs, with all covariates required
to achieve an absolute SMD < 0.1. Propensity score overlap was assessed
visually using density plots.

Primary Analysis
The primary analysis estimated the average treatment effect on the treated
(ATT) using a Cox proportional hazards regression model fitted to the matched
cohort, with SGLT2 inhibitor use as the sole covariate. The proportional
hazards assumption was tested using scaled Schoenfeld residuals and
log-log survival plots. If the assumption was violated, a time-varying
coefficient or restricted mean survival time (RMST) analysis was planned
as an alternative. Hazard ratios (HRs) with 95% confidence intervals (CIs)
were reported.

Missing Data
Missing covariate data ranged from 2% (age) to 15% (BMI). Missingness was
assumed to be missing at random (MAR) conditional on observed variables.
Multiple imputation by chained equations (MICE) was performed with 50
imputed datasets and 20 iterations per dataset. The imputation model
included all analysis variables (covariates, exposure, outcome indicator,
and the Nelson-Aalen cumulative hazard estimate) to ensure compatibility
with the substantive analysis model. Propensity score estimation and
matching were performed within each imputed dataset, and results were
pooled using Rubin rules. A complete-case analysis was performed as a
sensitivity analysis.

Sensitivity Analyses
Five pre-specified sensitivity analyses were conducted: (1) complete-case
analysis; (2) inverse probability of treatment weighting (IPTW) with
stabilised weights as an alternative to matching; (3) inclusion of
additional covariates (income quintile, number of medications) in the
propensity score model; (4) restriction to patients with at least
12 months of follow-up; and (5) E-value calculation to quantify the
minimum strength of association an unmeasured confounder would need with
both the treatment and the outcome to explain away the observed association.

Multiple Comparisons
As this study had a single pre-specified primary outcome, no adjustment
for multiple comparisons was applied to the primary analysis. Secondary
outcomes were interpreted with appropriate caution as hypothesis-generating.

Software
All analyses were conducted using R version 4.4.1 (R Foundation for
Statistical Computing, Vienna, Austria) with the following packages:
MatchIt (v4.5.5) for propensity score matching, cobalt (v4.5.1) for
balance assessment, survival (v3.7-0) for Cox regression, mice (v3.16.0)
for multiple imputation, and survey (v4.4-2) for IPTW analyses. The
random seed was set to 42 for reproducibility. Analysis code is available
at [repository URL]. Two-sided p-values < 0.05 were considered
statistically significant.
'

# Print the methods section
cat(methods_section)

# --- Checklist verification ---
cat("\n\n=== CHECKLIST: Key Items Addressed ===\n")
cat("1. Study design and population:           YES\n")
cat("2. Primary and secondary outcomes:        YES\n")
cat("3. Sample size / power calculation:        YES\n")
cat("4. Descriptive statistics approach:        YES\n")
cat("5. Primary analysis model:                YES\n")
cat("6. Assumptions and how checked:           YES (PH assumption)\n")
cat("7. Missing data (extent & handling):       YES (MICE, 50 datasets)\n")
cat("8. Sensitivity analyses:                  YES (5 pre-specified)\n")
cat("9. Multiple comparisons:                  YES (single primary)\n")
cat("10. Software and versions:                YES\n")
cat("11. Active comparator, new-user design:   YES\n")
cat("12. DAG for covariate selection:          YES\n")
cat("13. Balance assessment (SMDs):            YES\n")
cat("14. E-value for unmeasured confounding:   YES\n")

Show Python solution

# =============================================================================
# Chapter 19 - Exercise 3: Write a Statistical Methods Section
# Retrospective cohort study: SGLT2i vs DPP4i and MACE in T2DM
# =============================================================================

# This is a conceptual exercise. The statistical methods section is provided
# as a detailed text output.

methods_section = """
STATISTICAL METHODS

Study Design and Population
This was a retrospective cohort study of 5,000 adults with type 2 diabetes
mellitus identified from the hospital electronic health records (EHR) database
between 1 January 2015 and 31 December 2023. Patients were eligible if they
had a new prescription for either an SGLT2 inhibitor or a DPP-4 inhibitor,
with no prior use of the comparator drug class. Patients with a history of
major adverse cardiovascular events (MACE) prior to the index date were
excluded. The index date (time zero) was defined as the date of first
prescription of the study drug, consistent with a new-user, active comparator
design to minimise immortal time bias and confounding by indication.

Primary and Secondary Outcomes
The primary outcome was time to first MACE, defined as a composite of
myocardial infarction (ICD-10: I21), ischaemic stroke (ICD-10: I63), or
cardiovascular death (underlying cause of death codes I00-I99). Patients were
followed from the index date until the first MACE event, death from non-
cardiovascular causes, loss to follow-up, end of the study period
(31 December 2023), or 5 years after the index date, whichever occurred first.

Sample Size
With 5,000 patients and an anticipated event rate of 8% over 5 years in the
DPP-4 inhibitor group, the study had approximately 80% power to detect a
hazard ratio of 0.70 or smaller at a two-sided alpha of 0.05, assuming a
1:1 treatment group ratio and accounting for 10% loss to follow-up.

Descriptive Statistics
Baseline characteristics were summarised as means (SD) for normally
distributed continuous variables, medians (IQR) for skewed continuous
variables, and frequencies (percentages) for categorical variables.
Standardised mean differences (SMDs) were used to compare baseline
characteristics between treatment groups, with an absolute SMD < 0.1
indicating adequate balance; p-values were not used for baseline comparisons
in accordance with current recommendations.

Propensity Score Estimation and Matching
The propensity score -- the probability of receiving an SGLT2 inhibitor
versus a DPP-4 inhibitor -- was estimated using multivariable logistic
regression. Covariates included age, sex, body mass index (BMI), glycated
haemoglobin (HbA1c), estimated glomerular filtration rate (eGFR), history
of cardiovascular disease, hypertension, and smoking status. These covariates
were selected a priori based on clinical knowledge and a directed acyclic
graph (DAG) encoding assumed causal relationships. Propensity scores were
used for 1:1 nearest-neighbour matching without replacement, using a caliper
of 0.2 standard deviations of the logit of the propensity score. Covariate
balance after matching was assessed using SMDs, with all covariates required
to achieve an absolute SMD < 0.1. Propensity score overlap was assessed
visually using density plots.

Primary Analysis
The primary analysis estimated the average treatment effect on the treated
(ATT) using a Cox proportional hazards regression model fitted to the matched
cohort, with SGLT2 inhibitor use as the sole covariate. The proportional
hazards assumption was tested using scaled Schoenfeld residuals and
log-log survival plots. If the assumption was violated, a time-varying
coefficient or restricted mean survival time (RMST) analysis was planned
as an alternative. Hazard ratios (HRs) with 95% confidence intervals (CIs)
were reported.

Missing Data
Missing covariate data ranged from 2% (age) to 15% (BMI). Missingness was
assumed to be missing at random (MAR) conditional on observed variables.
Multiple imputation by chained equations (MICE) was performed with 50
imputed datasets and 20 iterations per dataset. The imputation model
included all analysis variables (covariates, exposure, outcome indicator,
and the Nelson-Aalen cumulative hazard estimate) to ensure compatibility
with the substantive analysis model. Propensity score estimation and
matching were performed within each imputed dataset, and results were
pooled using Rubin's rules. A complete-case analysis was performed as a
sensitivity analysis.

Sensitivity Analyses
Five pre-specified sensitivity analyses were conducted: (1) complete-case
analysis; (2) inverse probability of treatment weighting (IPTW) with
stabilised weights as an alternative to matching; (3) inclusion of
additional covariates (income quintile, number of medications) in the
propensity score model; (4) restriction to patients with at least
12 months of follow-up; and (5) E-value calculation to quantify the
minimum strength of association an unmeasured confounder would need with
both the treatment and the outcome to explain away the observed association.

Multiple Comparisons
As this study had a single pre-specified primary outcome, no adjustment
for multiple comparisons was applied to the primary analysis. Secondary
outcomes were interpreted with appropriate caution as hypothesis-generating.

Software
All analyses were conducted using Python version 3.11 with the following
packages: scikit-learn (v1.4.0) for propensity score estimation, lifelines
(v0.29.0) for Cox proportional hazards regression, statsmodels (v0.14.1)
for weighted regression analyses, and tableone (v0.9.1) for baseline
characteristics tables. The random seed was set to 42 for reproducibility.
Analysis code is available at [repository URL]. Two-sided p-values < 0.05
were considered statistically significant.
"""

print(methods_section)

# --- Checklist verification ---
print("=" * 60)
print("CHECKLIST: Key Items Addressed in Methods Section")
print("=" * 60)

checklist = [
    ("Study design and population", True),
    ("Primary and secondary outcomes (defined)", True),
    ("Sample size / power calculation", True),
    ("Descriptive statistics approach", True),
    ("Primary analysis model (Cox PH)", True),
    ("Assumptions and how checked (PH assumption)", True),
    ("Missing data (extent, mechanism, handling)", True),
    ("Sensitivity analyses (5 pre-specified)", True),
    ("Multiple comparisons addressed", True),
    ("Software and package versions", True),
    ("New-user, active comparator design", True),
    ("DAG for covariate selection", True),
    ("Balance assessment method (SMDs)", True),
    ("E-value for unmeasured confounding", True),
    ("Caliper specification for matching", True),
    ("Number of imputations and iterations", True),
]

for item, addressed in checklist:
    status = "YES" if addressed else "NO"
    print(f"  [{status:>3}] {item}")

print("\nWord count: approximately 550 words")
print("(The exercise requests 250-400 words; this is comprehensive")
print("to cover all required items. Can be condensed for journals")
print("with strict word limits by moving details to a supplement.)")

D.15.4 Exercise 4

Show R solution

# =============================================================================
# Chapter 19 - Exercise 4: TRIPOD+AI Checklist (Conceptual)
# Completed for Capstone 1: Cardiovascular Risk Prediction Model
# =============================================================================

# This exercise asks you to complete the TRIPOD+AI checklist for one of the
# capstone projects. Below is a completed checklist for Capstone 1:
# "Development and external validation of a cardiovascular risk prediction
# model using Framingham methodology, validated in NHANES data."

checklist <- data.frame(
  Item = c(
    "1. Title",
    "2. Abstract",
    "3a. Background/rationale",
    "3b. Objectives",
    "4a. Source of data",
    "4b. Dates of study",
    "5a. Key eligibility criteria",
    "5b. Treatments received",
    "6a. Outcome definition",
    "6b. Outcome timing",
    "6c. Blinding of outcome",
    "7. Predictors",
    "8. Sample size",
    "9. Missing data",
    "10a. Statistical analysis: model development",
    "10b. Model specification",
    "10c. Predictor selection",
    "10d. Model performance measures",
    "11. Risk groups",
    "12a. Internal validation",
    "12b. External validation",
    "13. Fairness assessment (AI-specific)",
    "14. Results: participants",
    "15. Results: model development",
    "16. Results: model performance",
    "17. Results: model updating",
    "18. Discussion: interpretation",
    "19. Discussion: limitations",
    "20. Discussion: implications",
    "21. Supplementary: code and data",
    "AI-1. Data preprocessing",
    "AI-2. Hyperparameter tuning",
    "AI-3. Model explainability",
    "AI-4. Software and hardware"
  ),
  Section = c(
    "Title page",
    "Abstract",
    "Introduction",
    "Introduction",
    "Methods",
    "Methods",
    "Methods",
    "Methods",
    "Methods",
    "Methods",
    "Methods",
    "Methods",
    "Methods",
    "Methods",
    "Methods",
    "Methods",
    "Methods",
    "Methods",
    "Methods",
    "Methods",
    "Methods",
    "Methods",
    "Results",
    "Results",
    "Results",
    "Results",
    "Discussion",
    "Discussion",
    "Discussion",
    "Supplement",
    "Methods/Supplement",
    "Methods/Supplement",
    "Results/Supplement",
    "Methods"
  ),
  How_Addressed = c(
    # 1. Title
    "'Development and external validation of a logistic regression model
    for predicting 10-year cardiovascular disease risk: a Framingham-NHANES
    study.' Identifies study type (development + external validation) and
    modelling approach.",

    # 2. Abstract
    "Structured abstract with: objective, study design, data sources,
    outcome (10-year CVD), predictors, sample sizes, C-statistic and
    calibration results for both development and validation cohorts.",

    # 3a. Background
    "Existing CVD risk models (Framingham, QRISK, SCORE2) and their
    limitations. Gap: need for updated validation in contemporary US
    population.",

    # 3b. Objectives
    "To develop a 10-year CVD risk prediction model using Framingham
    data and externally validate it in NHANES 2017-2020.",

    # 4a. Source of data
    "Development: Framingham Heart Study teaching dataset (riskCommunicator
    R package). Validation: NHANES 2017-2020 (nhanesA package). Both are
    publicly available.",

    # 4b. Dates
    "Framingham: original cohort with follow-up through 2005. NHANES:
    2017-2020 survey cycles.",

    # 5a. Eligibility
    "Adults aged 30-74, free of CVD at baseline, with complete data on
    key predictors. Exclude prior MI, stroke, heart failure.",

    # 5b. Treatments
    "Not applicable (prediction model, not treatment comparison).
    However, blood pressure treatment status is included as a predictor.",

    # 6a. Outcome definition
    "10-year cardiovascular event: composite of MI, coronary death,
    stroke, or heart failure requiring hospitalisation.",

    # 6b. Outcome timing
    "10-year follow-up from baseline examination. Censored at death
    from non-CVD causes or loss to follow-up.",

    # 6c. Blinding
    "Not applicable for retrospective analysis. Outcome ascertainment
    in Framingham was by adjudication committee blinded to risk factors.",

    # 7. Predictors
    "Pre-specified based on established Framingham model: age, sex, total
    cholesterol, HDL cholesterol, systolic blood pressure, blood pressure
    treatment status, smoking, diabetes. No data-driven selection.",

    # 8. Sample size
    "Sample size justified using Riley et al. (2020) criteria via
    pmsampsize package: 8 predictors, anticipated outcome prevalence 10%,
    target R-squared 0.15, requiring minimum ~800 events.",

    # 9. Missing data
    "Development cohort: <5% missing. Validation: up to 12% for some
    laboratory values. Handled by multiple imputation (MICE, 50 datasets,
    20 iterations). Complete-case sensitivity analysis performed.",

    # 10a. Model development
    "Logistic regression for 10-year risk. Pre-specified predictors,
    no variable selection. Continuous predictors modelled with restricted
    cubic splines (4 knots for age and SBP; 3 knots for cholesterol)
    based on prior literature.",

    # 10b. Model specification
    "Full model equation provided in supplementary materials. Coefficients
    table with 95% CIs in main text.",

    # 10c. Predictor selection
    "All predictors pre-specified from clinical knowledge. No stepwise
    selection. Rationale provided for each predictor.",

    # 10d. Performance measures
    "Discrimination: C-statistic (Harrell's concordance) with 95% CI.
    Calibration: calibration plot (observed vs predicted), calibration
    slope, calibration-in-the-large. Both reported for development
    (bootstrap-corrected) and external validation.",

    # 11. Risk groups
    "Patients categorised into low (<5%), moderate (5-10%), high (10-20%),
    and very high (>20%) 10-year CVD risk groups. Classification tables
    provided.",

    # 12a. Internal validation
    "500 bootstrap resamples for optimism-corrected C-statistic and
    calibration slope. Apparent and corrected performance reported.",

    # 12b. External validation
    "Model applied to NHANES data without recalibration. C-statistic
    and calibration plot in external data. Recalibration of intercept
    also explored.",

    # 13. Fairness
    "C-statistic and calibration reported separately by sex (male/female)
    and by race/ethnicity (non-Hispanic White, non-Hispanic Black,
    Hispanic, Asian). Differences in performance >0.05 C-statistic
    flagged for discussion.",

    # 14. Participants
    "Flow diagram showing sample selection: patients screened, excluded
    (with reasons), and included in development and validation cohorts.
    Table 1 with baseline characteristics stratified by CVD status.",

    # 15. Model development
    "Full regression table with coefficients, standard errors, odds
    ratios, 95% CIs, and p-values. Presented as Table 2.",

    # 16. Model performance
    "Table 3: discrimination (C-statistic) and calibration metrics for
    development (apparent and corrected) and external validation.
    Figures: calibration plots (development and validation), decision
    curve analysis.",

    # 17. Model updating
    "Recalibration of intercept in NHANES improved calibration-in-the-
    large from X to Y. Full model updating was explored but not
    recommended due to limited improvement.",

    # 18. Interpretation
    "Model performance compared with published Framingham, QRISK3, and
    PCE models. Clinical utility assessed via decision curve analysis
    across clinically relevant threshold probabilities (5-20%).",

    # 19. Limitations
    "Limitations discussed: (1) Framingham cohort may not represent
    contemporary diverse populations; (2) NHANES validation limited by
    shorter follow-up and self-reported outcomes; (3) Missing data
    assumptions; (4) Ecological fallacy in subgroup performance
    assessment.",

    # 20. Implications
    "Clinical implications for CVD risk assessment. Recommendation for
    recalibration before use in non-US populations. Need for prospective
    validation.",

    # 21. Code and data
    "R code provided via GitHub repository [URL]. Framingham data
    available via riskCommunicator package. NHANES data publicly available.
    Reproducible via renv for package management.",

    # AI-1. Data preprocessing
    "Not applicable (logistic regression, not ML/AI). However, data
    cleaning steps, outlier handling, and transformation of predictors
    (restricted cubic splines) described in supplement.",

    # AI-2. Hyperparameter tuning
    "Not applicable for logistic regression. Number of knots for RCS
    pre-specified based on Harrell's recommendations.",

    # AI-3. Model explainability
    "Not applicable (logistic regression is inherently interpretable).
    Coefficient interpretation provided. Nomogram included for clinical
    use.",

    # AI-4. Software
    "R version 4.4.1. Packages: rms (v6.8), survival (v3.7), mice
    (v3.16), gtsummary (v1.7), pmsampsize (v1.1), riskCommunicator
    (v1.0), nhanesA (v1.1). Full renv.lock file in repository."
  ),

  Applicable = c(
    "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",
    "Partially (no treatment comparison)",
    "Yes", "Yes",
    "Partially (retrospective)",
    "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",
    "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",
    "No (not ML)", "No (not ML)", "No (not ML)", "Yes"
  ),

  stringsAsFactors = FALSE
)

# Print the checklist
cat("=" * 70, "\n")
cat("TRIPOD+AI Checklist - Capstone 1: CV Risk Prediction Model\n")
cat("=" * 70, "\n\n")

for (i in 1:nrow(checklist)) {
  cat("ITEM:", checklist$Item[i], "\n")
  cat("  Section:", checklist$Section[i], "\n")
  cat("  Applicable:", checklist$Applicable[i], "\n")
  cat("  How addressed:", trimws(checklist$How_Addressed[i]), "\n\n")
}

# Summary
cat("\n=== SUMMARY ===\n")
n_applicable <- sum(checklist$Applicable == "Yes")
n_partial <- sum(grepl("Partially", checklist$Applicable))
n_na <- sum(checklist$Applicable == "No (not ML)")

cat("Items fully addressed:", n_applicable, "\n")
cat("Items partially applicable:", n_partial, "\n")
cat("Items not applicable (ML-specific):", n_na, "\n")
cat("Total items:", nrow(checklist), "\n")

cat("\nNote: AI-specific items (AI-1 through AI-3) are not applicable because\n")
cat("this capstone uses logistic regression, not machine learning. If a\n")
cat("gradient boosted model or neural network were used instead, these\n")
cat("items would need to be addressed with details on feature engineering,\n")
cat("hyperparameter search strategy, and SHAP/LIME explanations.\n")

Show Python solution

# =============================================================================
# Chapter 19 - Exercise 4: TRIPOD+AI Checklist (Conceptual)
# Completed for Capstone 1: Cardiovascular Risk Prediction Model
# =============================================================================

# This exercise asks you to complete the TRIPOD+AI checklist for one of the
# capstone projects. Below is a completed checklist for Capstone 1:
# "Development and external validation of a cardiovascular risk prediction
# model using Framingham methodology, validated in NHANES data."

checklist = [
    {
        "item": "1. Title",
        "section": "Title page",
        "applicable": "Yes",
        "how": (
            "'Development and external validation of a logistic regression "
            "model for predicting 10-year cardiovascular disease risk: a "
            "Framingham-NHANES study.' Identifies study type (development + "
            "external validation) and modelling approach."
        )
    },
    {
        "item": "2. Abstract",
        "section": "Abstract",
        "applicable": "Yes",
        "how": (
            "Structured abstract with: objective, study design, data sources, "
            "outcome (10-year CVD), predictors, sample sizes, C-statistic and "
            "calibration results for both development and validation cohorts."
        )
    },
    {
        "item": "3a. Background/rationale",
        "section": "Introduction",
        "applicable": "Yes",
        "how": (
            "Existing CVD risk models (Framingham, QRISK, SCORE2) and their "
            "limitations. Gap: need for updated validation in contemporary "
            "US population."
        )
    },
    {
        "item": "3b. Objectives",
        "section": "Introduction",
        "applicable": "Yes",
        "how": (
            "To develop a 10-year CVD risk prediction model using Framingham "
            "data and externally validate it in NHANES 2017-2020."
        )
    },
    {
        "item": "4a. Source of data",
        "section": "Methods",
        "applicable": "Yes",
        "how": (
            "Development: Framingham Heart Study teaching dataset "
            "(riskCommunicator R package). Validation: NHANES 2017-2020 "
            "(nhanesA package). Both publicly available."
        )
    },
    {
        "item": "4b. Dates of study",
        "section": "Methods",
        "applicable": "Yes",
        "how": (
            "Framingham: original cohort with follow-up through 2005. "
            "NHANES: 2017-2020 survey cycles."
        )
    },
    {
        "item": "5a. Key eligibility criteria",
        "section": "Methods",
        "applicable": "Yes",
        "how": (
            "Adults aged 30-74, free of CVD at baseline, with complete data "
            "on key predictors. Exclude prior MI, stroke, heart failure."
        )
    },
    {
        "item": "5b. Treatments received",
        "section": "Methods",
        "applicable": "Partially",
        "how": (
            "Not applicable (prediction model, not treatment comparison). "
            "However, BP treatment status included as a predictor."
        )
    },
    {
        "item": "6a. Outcome definition",
        "section": "Methods",
        "applicable": "Yes",
        "how": (
            "10-year CVD event: composite of MI, coronary death, stroke, "
            "or heart failure requiring hospitalisation."
        )
    },
    {
        "item": "6b. Outcome timing",
        "section": "Methods",
        "applicable": "Yes",
        "how": (
            "10-year follow-up from baseline examination. Censored at "
            "death from non-CVD causes or loss to follow-up."
        )
    },
    {
        "item": "6c. Blinding of outcome",
        "section": "Methods",
        "applicable": "Partially",
        "how": (
            "Retrospective analysis. Framingham outcome adjudication was "
            "by committee blinded to risk factors."
        )
    },
    {
        "item": "7. Predictors",
        "section": "Methods",
        "applicable": "Yes",
        "how": (
            "Pre-specified: age, sex, total cholesterol, HDL cholesterol, "
            "systolic BP, BP treatment, smoking, diabetes. No data-driven "
            "selection."
        )
    },
    {
        "item": "8. Sample size",
        "section": "Methods",
        "applicable": "Yes",
        "how": (
            "Justified using Riley et al. (2020) criteria via pmsampsize: "
            "8 predictors, prevalence 10%, target R^2 0.15, requiring "
            "minimum ~800 events."
        )
    },
    {
        "item": "9. Missing data",
        "section": "Methods",
        "applicable": "Yes",
        "how": (
            "Development: <5% missing. Validation: up to 12%. Handled by "
            "MICE (50 datasets, 20 iterations). Complete-case sensitivity "
            "analysis performed."
        )
    },
    {
        "item": "10a. Model development",
        "section": "Methods",
        "applicable": "Yes",
        "how": (
            "Logistic regression for 10-year risk. Pre-specified predictors. "
            "Continuous predictors with restricted cubic splines."
        )
    },
    {
        "item": "10b. Model specification",
        "section": "Methods",
        "applicable": "Yes",
        "how": "Full model equation in supplementary materials."
    },
    {
        "item": "10c. Predictor selection",
        "section": "Methods",
        "applicable": "Yes",
        "how": (
            "All predictors pre-specified from clinical knowledge. "
            "No stepwise selection."
        )
    },
    {
        "item": "10d. Performance measures",
        "section": "Methods",
        "applicable": "Yes",
        "how": (
            "Discrimination: C-statistic with 95% CI. Calibration: "
            "calibration plot, calibration slope, calibration-in-the-large."
        )
    },
    {
        "item": "11. Risk groups",
        "section": "Methods",
        "applicable": "Yes",
        "how": (
            "Low (<5%), moderate (5-10%), high (10-20%), very high (>20%) "
            "10-year CVD risk categories."
        )
    },
    {
        "item": "12a. Internal validation",
        "section": "Methods",
        "applicable": "Yes",
        "how": (
            "500 bootstrap resamples for optimism-corrected C-statistic "
            "and calibration slope."
        )
    },
    {
        "item": "12b. External validation",
        "section": "Methods",
        "applicable": "Yes",
        "how": (
            "Model applied to NHANES without recalibration. C-statistic "
            "and calibration reported. Recalibration of intercept explored."
        )
    },
    {
        "item": "13. Fairness assessment (AI-specific)",
        "section": "Methods",
        "applicable": "Yes",
        "how": (
            "Performance reported by sex and race/ethnicity. Differences "
            ">0.05 C-statistic flagged."
        )
    },
    {
        "item": "14. Results: participants",
        "section": "Results",
        "applicable": "Yes",
        "how": (
            "Flow diagram with sample selection. Table 1 stratified by "
            "CVD event status."
        )
    },
    {
        "item": "15. Results: model development",
        "section": "Results",
        "applicable": "Yes",
        "how": (
            "Full regression table with coefficients, ORs, 95% CIs, "
            "and p-values."
        )
    },
    {
        "item": "16. Results: model performance",
        "section": "Results",
        "applicable": "Yes",
        "how": (
            "Table of discrimination and calibration metrics. Calibration "
            "plots and decision curve analysis figures."
        )
    },
    {
        "item": "17. Results: model updating",
        "section": "Results",
        "applicable": "Yes",
        "how": (
            "Recalibration of intercept in NHANES. Full model updating "
            "explored but not recommended."
        )
    },
    {
        "item": "18. Discussion: interpretation",
        "section": "Discussion",
        "applicable": "Yes",
        "how": (
            "Comparison with Framingham, QRISK3, PCE models. Clinical "
            "utility via decision curve analysis."
        )
    },
    {
        "item": "19. Discussion: limitations",
        "section": "Discussion",
        "applicable": "Yes",
        "how": (
            "Framingham may not represent diverse populations. NHANES "
            "validation limited by follow-up. Missing data assumptions."
        )
    },
    {
        "item": "20. Discussion: implications",
        "section": "Discussion",
        "applicable": "Yes",
        "how": (
            "Clinical implications for CVD risk assessment. Need for "
            "recalibration in non-US populations."
        )
    },
    {
        "item": "21. Code and data",
        "section": "Supplement",
        "applicable": "Yes",
        "how": (
            "Code on GitHub. Data publicly available via R packages. "
            "Reproducible via renv/conda."
        )
    },
    {
        "item": "AI-1. Data preprocessing",
        "section": "Methods/Supplement",
        "applicable": "No (logistic regression)",
        "how": (
            "Not applicable for logistic regression. Data cleaning and "
            "spline transformations described in supplement."
        )
    },
    {
        "item": "AI-2. Hyperparameter tuning",
        "section": "Methods/Supplement",
        "applicable": "No (logistic regression)",
        "how": (
            "Not applicable. Number of RCS knots pre-specified based on "
            "Harrell's recommendations."
        )
    },
    {
        "item": "AI-3. Model explainability",
        "section": "Results/Supplement",
        "applicable": "No (logistic regression)",
        "how": (
            "Not applicable (logistic regression is inherently "
            "interpretable). Nomogram provided."
        )
    },
    {
        "item": "AI-4. Software and hardware",
        "section": "Methods",
        "applicable": "Yes",
        "how": (
            "Python 3.11 with scikit-learn, lifelines, statsmodels, "
            "tableone. Full requirements.txt in repository."
        )
    },
]

# --- Print the checklist ---
print("=" * 70)
print("TRIPOD+AI Checklist")
print("Capstone 1: Cardiovascular Risk Prediction Model")
print("=" * 70)

for entry in checklist:
    print(f"\nITEM: {entry['item']}")
    print(f"  Section: {entry['section']}")
    print(f"  Applicable: {entry['applicable']}")
    print(f"  How addressed: {entry['how']}")

# --- Summary ---
n_yes = sum(1 for e in checklist if e['applicable'] == 'Yes')
n_partial = sum(1 for e in checklist if 'Partially' in e['applicable'])
n_no = sum(1 for e in checklist if e['applicable'].startswith('No'))

print(f"\n{'=' * 70}")
print("SUMMARY")
print(f"{'=' * 70}")
print(f"Items fully addressed:               {n_yes}")
print(f"Items partially applicable:          {n_partial}")
print(f"Items not applicable (ML-specific):  {n_no}")
print(f"Total items:                         {len(checklist)}")

print("""
Note on AI-specific items:
AI-specific items (AI-1 through AI-3) are not applicable because this
capstone uses logistic regression, not machine learning. If a gradient
boosted model or neural network were used, these items would require:
  - AI-1: Feature engineering pipeline, normalization, encoding
  - AI-2: Hyperparameter search strategy (grid, random, Bayesian)
  - AI-3: SHAP values, partial dependence plots, or LIME explanations

The TRIPOD+AI checklist is available from the EQUATOR Network:
https://www.equator-network.org/reporting-guidelines/tripod-ai/

Key points for completing TRIPOD+AI:
1. NEVER report AUROC/C-statistic without calibration
2. Always include a calibration plot
3. Fairness assessment across demographic subgroups is now required
4. Code and data availability statements are mandatory
5. For AI/ML models, explainability methods must be described
""")

--- title: "Exercise Solutions" --- # Exercise Solutions {#sec-solutions .unnumbered} This appendix collects solutions for all exercises in the course. Each solution is provided in both R and Python. Use them to check your work after attempting the exercises yourself. ::: {.callout-warning} ## Try First These solutions are most valuable after you have made a genuine attempt at each exercise. Resist the temptation to read the solution before trying. ::: The solution files are also available in the course repository under `solutions/R/` and `solutions/python/`. ## Chapter 04: Continuous Variables and Splines ### Exercise 1 ::: {.panel-tabset} #### R ```{r} #| label: sol-ch04-ex1-r #| eval: false #| code-fold: true #| code-summary: "Show R solution" # ============================================================================= # Chapter 4, Exercise 1: Recognising the Problem # Categorisation of systolic blood pressure and stroke risk # ============================================================================= # This exercise is primarily conceptual. Answers are provided as comments, # with supporting code to illustrate the points. library(ggplot2) # --- Question (a) --- # What assumptions does categorising SBP into Normal (<120), Elevated (120-129), # Stage 1 HTN (130-139), Stage 2 HTN (>=140) impose? # # ANSWER: # 1. Within each category, the relationship between SBP and stroke risk is FLAT. # A patient with SBP=121 is treated identically to SBP=129. # 2. At category boundaries, there is a SUDDEN JUMP in risk. # SBP=129 and SBP=130 are treated as fundamentally different despite only # 1 mmHg difference. # 3. The cut-points (120, 130, 140) are assumed to be biologically meaningful # boundaries, which may not reflect the true continuous biology. # --- Question (b) --- # A patient with SBP 131 is classified the same as SBP 139. Why is this problematic? # # ANSWER: # The difference of 8 mmHg (131 vs 139) is clinically substantial. In reality, # stroke risk increases with higher SBP. By lumping these patients together, # we lose the ability to distinguish their different risk levels. The patient # at 139 mmHg (close to Stage 2) likely has meaningfully higher stroke risk # than the patient at 131 mmHg, but the categorised model assigns them # identical predicted risk. This information loss reduces statistical power # and can bias effect estimates. # --- Question (c) --- # Suggest a better modelling approach. # # ANSWER: # Use a restricted cubic spline (RCS) with 3-5 knots to model SBP as a # continuous predictor. This: # - Preserves the full information in SBP # - Allows the relationship to be non-linear (capturing any steepening # at higher pressures) # - Constrains the curve to be linear in the tails (sensible extrapolation) # - Uses only k-1 degrees of freedom for k knots (parsimonious) # --- Supporting demonstration --- # Simulate data to visually compare categorised vs spline approaches library(rms) set.seed(2025) n <- 1000 sbp <- rnorm(n, mean = 130, sd = 15) sbp <- pmax(sbp, 90) sbp <- pmin(sbp, 200) # True relationship: risk accelerates at higher SBP logit_p <- -6 + 0.02 * sbp + 0.0002 * (sbp - 130)^2 y <- rbinom(n, 1, plogis(logit_p)) sim_data <- data.frame(sbp = sbp, stroke = y) # Categorise as described in the exercise sim_data$sbp_cat <- cut(sim_data$sbp, breaks = c(-Inf, 120, 130, 140, Inf), labels = c("Normal", "Elevated", "Stage1", "Stage2")) # Set up datadist for rms dd <- datadist(sim_data) options(datadist = "dd") # Fit three models fit_cat <- lrm(stroke ~ sbp_cat, data = sim_data) fit_linear <- lrm(stroke ~ sbp, data = sim_data) fit_rcs <- lrm(stroke ~ rcs(sbp, 4), data = sim_data) # Compare AIC cat("AIC Comparison:\n") cat(" Categorised:", AIC(fit_cat), "\n") cat(" Linear: ", AIC(fit_linear), "\n") cat(" RCS (4 knots):", AIC(fit_rcs), "\n") cat("\nLower AIC = better fit. The RCS model captures the true non-linear\n") cat("relationship without imposing arbitrary cut-points.\n") # Plot comparison sbp_seq <- seq(min(sim_data$sbp), max(sim_data$sbp), length.out = 200) pred_rcs <- Predict(fit_rcs, sbp = sbp_seq) pred_linear_vals <- predict(fit_linear, newdata = data.frame(sbp = sbp_seq), type = "fitted") # True probability true_logit <- -6 + 0.02 * sbp_seq + 0.0002 * (sbp_seq - 130)^2 true_prob <- plogis(true_logit) plot_df <- data.frame( sbp = rep(sbp_seq, 3), probability = c(true_prob, pred_linear_vals, plogis(as.data.frame(pred_rcs)$yhat)), Model = rep(c("True relationship", "Linear", "RCS (4 knots)"), each = length(sbp_seq)) ) p <- ggplot(plot_df, aes(x = sbp, y = probability, colour = Model)) + geom_line(linewidth = 1.1) + scale_colour_manual(values = c("True relationship" = "black", "Linear" = "#3498db", "RCS (4 knots)" = "#e74c3c")) + labs(x = "Systolic Blood Pressure (mmHg)", y = "Predicted Probability of Stroke", title = "Comparing modelling approaches for SBP-stroke relationship") + theme_minimal(base_size = 14) + theme(legend.position = "bottom") print(p) ``` #### Python ```{python} #| label: sol-ch04-ex1-py #| eval: false #| code-fold: true #| code-summary: "Show Python solution" # ============================================================================= # Chapter 4, Exercise 1: Recognising the Problem # Categorisation of systolic blood pressure and stroke risk # ============================================================================= # This exercise is primarily conceptual. Answers are provided as comments, # with supporting code to illustrate the points. import numpy as np import pandas as pd import matplotlib.pyplot as plt import statsmodels.api as sm from patsy import dmatrix from scipy.special import expit # --- Question (a) --- # What assumptions does categorising SBP into Normal (<120), Elevated (120-129), # Stage 1 HTN (130-139), Stage 2 HTN (>=140) impose? # # ANSWER: # 1. Within each category, the relationship between SBP and stroke risk is FLAT. # A patient with SBP=121 is treated identically to SBP=129. # 2. At category boundaries, there is a SUDDEN JUMP in risk. # SBP=129 and SBP=130 are treated as fundamentally different despite only # 1 mmHg difference. # 3. The cut-points (120, 130, 140) are assumed to be biologically meaningful # boundaries, which may not reflect the true continuous biology. # --- Question (b) --- # A patient with SBP 131 is classified the same as SBP 139. Why is this problematic? # # ANSWER: # The difference of 8 mmHg (131 vs 139) is clinically substantial. In reality, # stroke risk increases with higher SBP. By lumping these patients together, # we lose the ability to distinguish their different risk levels. The patient # at 139 mmHg (close to Stage 2) likely has meaningfully higher stroke risk # than the patient at 131 mmHg, but the categorised model assigns them # identical predicted risk. This information loss reduces statistical power # and can bias effect estimates. # --- Question (c) --- # Suggest a better modelling approach. # # ANSWER: # Use a restricted cubic spline (RCS) with 3-5 knots to model SBP as a # continuous predictor. This: # - Preserves the full information in SBP # - Allows the relationship to be non-linear (capturing any steepening # at higher pressures) # - Constrains the curve to be linear in the tails (sensible extrapolation) # - Uses only k-1 degrees of freedom for k knots (parsimonious) # --- Supporting demonstration --- # Simulate data to visually compare categorised vs spline approaches np.random.seed(2025) n = 1000 sbp = np.random.normal(130, 15, size=n) sbp = np.clip(sbp, 90, 200) # True relationship: risk accelerates at higher SBP logit_p = -6 + 0.02 * sbp + 0.0002 * (sbp - 130)**2 y = np.random.binomial(1, expit(logit_p)) df = pd.DataFrame({'sbp': sbp, 'stroke': y}) # --- Model 1: Categorised --- df['sbp_cat'] = pd.cut(df['sbp'], bins=[-np.inf, 120, 130, 140, np.inf], labels=['Normal', 'Elevated', 'Stage1', 'Stage2']) X_cat = pd.get_dummies(df['sbp_cat'], drop_first=True).astype(float) X_cat = sm.add_constant(X_cat) fit_cat = sm.GLM(df['stroke'], X_cat, family=sm.families.Binomial()).fit() # --- Model 2: Linear --- X_lin = sm.add_constant(df[['sbp']]) fit_lin = sm.GLM(df['stroke'], X_lin, family=sm.families.Binomial()).fit() # --- Model 3: RCS (natural cubic spline with df=3, i.e., 4 knots) --- X_rcs = dmatrix("cr(sbp, df=3)", data=df, return_type='dataframe') fit_rcs = sm.GLM(df['stroke'], X_rcs, family=sm.families.Binomial()).fit() # Compare AIC print("AIC Comparison:") print(f" Categorised: {fit_cat.aic:.1f}") print(f" Linear: {fit_lin.aic:.1f}") print(f" RCS (4 knots): {fit_rcs.aic:.1f}") print("\nLower AIC = better fit. The RCS model captures the true non-linear") print("relationship without imposing arbitrary cut-points.") # --- Plot comparison --- sbp_range = np.linspace(df['sbp'].min(), df['sbp'].max(), 200) # True probability true_logit = -6 + 0.02 * sbp_range + 0.0002 * (sbp_range - 130)**2 true_prob = expit(true_logit) # Linear predictions X_lin_pred = sm.add_constant(pd.DataFrame({'sbp': sbp_range})) lin_prob = fit_lin.predict(X_lin_pred) # RCS predictions X_rcs_pred = dmatrix("cr(sbp, df=3)", data=pd.DataFrame({'sbp': sbp_range}), return_type='dataframe') rcs_prob = fit_rcs.predict(X_rcs_pred) plt.figure(figsize=(10, 6)) plt.plot(sbp_range, true_prob, 'k--', linewidth=2, label='True relationship') plt.plot(sbp_range, lin_prob, color='#3498db', linewidth=1.5, label='Linear') plt.plot(sbp_range, rcs_prob, color='#e74c3c', linewidth=1.5, label='RCS (4 knots)') plt.xlabel('Systolic Blood Pressure (mmHg)') plt.ylabel('Predicted Probability of Stroke') plt.title('Comparing modelling approaches for SBP-stroke relationship') plt.legend() plt.tight_layout() plt.show() ``` ::: ### Exercise 2 ::: {.panel-tabset} #### R ```{r} #| label: sol-ch04-ex2-r #| eval: false #| code-fold: true #| code-summary: "Show R solution" # ============================================================================= # Chapter 4, Exercise 2: Fitting and Interpreting Spline Models # PBC dataset: bilirubin and transplant status # ============================================================================= library(rms) library(ggplot2) # --- Load the PBC data --- data(pbc, package = "survival") pbc <- pbc[!is.na(pbc$trt), ] # Remove incomplete cases # Create a binary outcome: transplanted (status == 1) vs not pbc$transplant <- as.numeric(pbc$status == 1) # Set up datadist (required by rms) dd <- datadist(pbc) options(datadist = "dd") # --- Task 1: Fit logistic regression with bilirubin as a linear term --- fit_linear <- lrm(transplant ~ bili, data = pbc) cat("=== Linear Model ===\n") print(fit_linear) # --- Task 2: Fit logistic regression with bilirubin using RCS (4 knots) --- fit_rcs <- lrm(transplant ~ rcs(bili, 4), data = pbc) cat("\n=== RCS Model (4 knots) ===\n") print(fit_rcs) # --- Task 3: Test for non-linearity --- cat("\n=== ANOVA for RCS model (test of non-linearity) ===\n") print(anova(fit_rcs)) # The ANOVA output shows: # - Total effect of bili (tests whether bili has ANY association) # - Nonlinear component (tests whether the relationship departs from linear) # If the nonlinear p-value is significant, a linear term is insufficient. # --- Task 4: Compare AIC --- cat("\n=== AIC Comparison ===\n") cat("Linear model AIC:", AIC(fit_linear), "\n") cat("RCS model AIC: ", AIC(fit_rcs), "\n") cat("Lower AIC = better fit (penalised for complexity)\n") # --- Task 5: Plot the relationship --- p <- Predict(fit_rcs, bili) pred_df <- as.data.frame(p) plot_rcs <- ggplot(pred_df, aes(x = bili, y = yhat)) + geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.2, fill = "#3498db") + geom_line(linewidth = 1.2, colour = "#2c3e50") + geom_hline(yintercept = 0, linetype = "dashed", colour = "grey50") + labs(x = "Serum Bilirubin (mg/dL)", y = "Log-Odds of Transplant", title = "Non-linear effect of bilirubin on transplant (RCS, 4 knots)") + theme_minimal(base_size = 14) print(plot_rcs) # --- Task 6: Interpretation --- cat("\n=== Interpretation ===\n") cat("The plot shows the estimated log-odds of transplant as a function of\n") cat("serum bilirubin. If the curve is not a straight line, the relationship\n") cat("is non-linear, and a linear term would be insufficient.\n") cat("\nThe ANOVA non-linearity test (above) formally tests this. If the\n") cat("non-linear p-value is < 0.05, there is evidence against linearity.\n") cat("\nCompare the AIC values: a lower AIC for the RCS model confirms\n") cat("that the additional flexibility improves fit beyond what a linear\n") cat("term provides.\n") ``` #### Python ```{python} #| label: sol-ch04-ex2-py #| eval: false #| code-fold: true #| code-summary: "Show Python solution" # ============================================================================= # Chapter 4, Exercise 2: Fitting and Interpreting Spline Models # PBC dataset: bilirubin and transplant status # ============================================================================= import numpy as np import pandas as pd import matplotlib.pyplot as plt import statsmodels.api as sm from patsy import dmatrix # --- Simulate data similar to PBC bilirubin-transplant relationship --- # (Following the chapter's Python approach since PBC is an R-native dataset) np.random.seed(42) n = 300 bili = np.random.exponential(scale=2, size=n) bili = np.clip(bili, 0.3, 25) logit_p = -3 + 0.1 * bili + 0.005 * bili**2 # Non-linear true relationship prob = 1 / (1 + np.exp(-logit_p)) transplant = np.random.binomial(1, prob) df = pd.DataFrame({'bili': bili, 'transplant': transplant}) # --- Task 1: Fit logistic regression with linear bilirubin --- X_linear = sm.add_constant(df[['bili']]) fit_linear = sm.GLM(df['transplant'], X_linear, family=sm.families.Binomial()).fit() print("=== Linear Model ===") print(fit_linear.summary()) # --- Task 2: Create spline basis and fit logistic regression with splines --- # cr() creates a natural cubic spline (restricted cubic spline) # df=3 means 4 knots (df = number of knots - 1) spline_basis = dmatrix("cr(bili, df=3)", data=df, return_type='dataframe') fit_spline = sm.GLM(df['transplant'], spline_basis, family=sm.families.Binomial()).fit() print("\n=== Spline Model (RCS, 4 knots) ===") print(fit_spline.summary()) # --- Task 3: Compare AIC --- print(f"\n=== AIC Comparison ===") print(f"Linear model AIC: {fit_linear.aic:.1f}") print(f"Spline model AIC: {fit_spline.aic:.1f}") print("Lower AIC = better fit (penalised for complexity)") # --- Task 4: Test for non-linearity --- # Compare the linear model to the spline model using a likelihood ratio test. # The difference in deviance follows a chi-squared distribution with # df = (spline params - linear params) degrees of freedom. from scipy import stats lr_stat = fit_linear.deviance - fit_spline.deviance df_diff = fit_linear.df_model - fit_spline.df_model # difference in parameters # Note: df_model counts differently in statsmodels; compute manually n_params_linear = len(fit_linear.params) n_params_spline = len(fit_spline.params) df_diff = n_params_spline - n_params_linear p_value = 1 - stats.chi2.cdf(lr_stat, df_diff) print(f"\n=== Likelihood Ratio Test for Non-linearity ===") print(f"LR statistic: {lr_stat:.3f}") print(f"Degrees of freedom: {df_diff}") print(f"p-value: {p_value:.4f}") if p_value < 0.05: print("Result: Significant non-linearity detected. The spline model fits better.") else: print("Result: No significant non-linearity. A linear term may suffice.") # --- Task 5: Plot the predicted probabilities across bilirubin range --- bili_range = np.linspace(df['bili'].min(), df['bili'].max(), 200) # Linear predictions X_lin_pred = sm.add_constant(pd.DataFrame({'bili': bili_range})) lin_prob = fit_linear.predict(X_lin_pred) # Spline predictions spline_pred = dmatrix("cr(bili, df=3)", data=pd.DataFrame({'bili': bili_range}), return_type='dataframe') rcs_prob = fit_spline.predict(spline_pred) plt.figure(figsize=(10, 6)) plt.plot(bili_range, lin_prob, color='#3498db', linewidth=1.5, label='Linear model') plt.plot(bili_range, rcs_prob, color='#2c3e50', linewidth=2, label='RCS model (4 knots)') plt.scatter(df['bili'], df['transplant'], alpha=0.15, color='grey', s=20) plt.xlabel('Serum Bilirubin (mg/dL)') plt.ylabel('Predicted Probability of Transplant') plt.title('Non-linear effect of bilirubin on transplant (RCS)') plt.legend() plt.tight_layout() plt.show() # --- Task 6: Interpretation --- print("\n=== Interpretation ===") print("The plot shows the predicted probability of transplant as a function of") print("serum bilirubin, comparing the linear and RCS models.") print("If the RCS curve departs notably from a straight line, the relationship") print("is non-linear and the linear model is insufficient.") print("The AIC comparison and likelihood ratio test formally assess this.") print("The RCS model captures the accelerating risk at higher bilirubin levels") print("that the linear model cannot represent.") ``` ::: ### Exercise 3 ::: {.panel-tabset} #### R ```{r} #| label: sol-ch04-ex3-r #| eval: false #| code-fold: true #| code-summary: "Show R solution" # ============================================================================= # Chapter 4, Exercise 3: Categorisation vs Splines Head-to-Head # Simulate U-shaped relationship and compare approaches # ============================================================================= library(rms) library(ggplot2) set.seed(2025) n <- 1000 x <- rnorm(n, mean = 50, sd = 10) # True U-shaped relationship logit_p <- -5 + 0.004 * (x - 50)^2 y <- rbinom(n, 1, plogis(logit_p)) sim_data <- data.frame(x = x, y = y) dd <- datadist(sim_data) options(datadist = "dd") # --- Model 1: Categorise into quartiles --- sim_data$x_cat <- cut(sim_data$x, breaks = quantile(sim_data$x, c(0, 0.25, 0.5, 0.75, 1)), include.lowest = TRUE) fit_cat <- lrm(y ~ x_cat, data = sim_data) # --- Model 2: Linear term --- fit_lin <- lrm(y ~ x, data = sim_data) # --- Model 3: RCS with 5 knots --- fit_rcs <- lrm(y ~ rcs(x, 5), data = sim_data) # --- Question (a): Compare AIC --- cat("=== Question (a): AIC Comparison ===\n") cat("Categorised (quartiles):", AIC(fit_cat), "\n") cat("Linear: ", AIC(fit_lin), "\n") cat("RCS (5 knots): ", AIC(fit_rcs), "\n") cat("\nThe RCS model should have the lowest AIC, indicating best fit.\n") cat("The linear model has the worst fit because it cannot capture\n") cat("the U-shape at all.\n") # --- Question (b): Plot all three fitted curves with the true function --- x_seq <- seq(min(sim_data$x), max(sim_data$x), length.out = 200) # True curve true_logit <- -5 + 0.004 * (x_seq - 50)^2 true_prob <- plogis(true_logit) # RCS predictions pred_rcs <- Predict(fit_rcs, x = x_seq) rcs_prob <- plogis(as.data.frame(pred_rcs)$yhat) # Linear predictions pred_linear <- predict(fit_linear <- lrm(y ~ x, data = sim_data), newdata = data.frame(x = x_seq), type = "fitted") # Categorised predictions (need to assign each x_seq value to a category) x_seq_cat <- cut(x_seq, breaks = quantile(sim_data$x, c(0, 0.25, 0.5, 0.75, 1)), include.lowest = TRUE) pred_cat <- predict(fit_cat, newdata = data.frame(x_cat = x_seq_cat), type = "fitted") plot_df <- data.frame( x = rep(x_seq, 4), probability = c(true_prob, pred_linear, pred_cat, rcs_prob), Model = rep(c("True relationship", "Linear", "Categorised (quartiles)", "RCS (5 knots)"), each = length(x_seq)) ) p <- ggplot(plot_df, aes(x = x, y = probability, colour = Model, linetype = Model)) + geom_line(linewidth = 1.1) + scale_colour_manual(values = c("True relationship" = "black", "Linear" = "#3498db", "Categorised (quartiles)" = "#e74c3c", "RCS (5 knots)" = "#27ae60")) + scale_linetype_manual(values = c("True relationship" = "dashed", "Linear" = "solid", "Categorised (quartiles)" = "solid", "RCS (5 knots)" = "solid")) + labs(x = "Predictor (x)", y = "Predicted Probability", title = "Comparing modelling strategies for a U-shaped relationship") + theme_minimal(base_size = 14) + theme(legend.position = "bottom") print(p) cat("\n=== Question (b): Interpretation ===\n") cat("The RCS model best recovers the true U-shaped relationship.\n") cat("The linear model completely misses the U-shape (it fits a flat or\n") cat("slightly sloped line). The categorised model captures some of the\n") cat("pattern but in crude steps, losing precision.\n") # --- Question (c): Change quartiles to tertiles --- cat("\n=== Question (c): Tertile categorisation ===\n") sim_data$x_tert <- cut(sim_data$x, breaks = quantile(sim_data$x, c(0, 1/3, 2/3, 1)), include.lowest = TRUE) fit_tert <- lrm(y ~ x_tert, data = sim_data) cat("AIC with tertiles: ", AIC(fit_tert), "\n") cat("AIC with quartiles:", AIC(fit_cat), "\n") cat("AIC with RCS: ", AIC(fit_rcs), "\n") cat("\nChanging from quartiles to tertiles changes the categorised model's\n") cat("estimates. The AIC may differ, and the estimated effects will change\n") cat("because different patients are grouped together. This demonstrates\n") cat("that categorisation results are ARBITRARY and depend on cut-point\n") cat("choice. The RCS model does not have this problem because it models\n") cat("the continuous relationship directly.\n") # Test for non-linearity in the RCS model cat("\n=== Non-linearity test for RCS model ===\n") print(anova(fit_rcs)) ``` #### Python ```{python} #| label: sol-ch04-ex3-py #| eval: false #| code-fold: true #| code-summary: "Show Python solution" # ============================================================================= # Chapter 4, Exercise 3: Categorisation vs Splines Head-to-Head # Simulate U-shaped relationship and compare approaches # ============================================================================= import numpy as np import pandas as pd import matplotlib.pyplot as plt import statsmodels.api as sm from patsy import dmatrix from scipy.special import expit np.random.seed(2025) n = 1000 x = np.random.normal(50, 10, size=n) # True U-shaped relationship logit_p = -5 + 0.004 * (x - 50)**2 y = np.random.binomial(1, expit(logit_p)) df = pd.DataFrame({'x': x, 'y': y}) # --- Model 1: Categorise into quartiles --- df['x_cat'] = pd.qcut(df['x'], q=4, labels=['Q1', 'Q2', 'Q3', 'Q4']) X_cat = pd.get_dummies(df['x_cat'], drop_first=True).astype(float) X_cat = sm.add_constant(X_cat) fit_cat = sm.GLM(df['y'], X_cat, family=sm.families.Binomial()).fit() # --- Model 2: Linear --- X_lin = sm.add_constant(df[['x']]) fit_lin = sm.GLM(df['y'], X_lin, family=sm.families.Binomial()).fit() # --- Model 3: RCS with df=4 (5 knots) --- X_rcs = dmatrix("cr(x, df=4)", data=df, return_type='dataframe') fit_rcs = sm.GLM(df['y'], X_rcs, family=sm.families.Binomial()).fit() # --- Question (a): Compare AIC --- print("=== Question (a): AIC Comparison ===") print(f"Categorised (quartiles): {fit_cat.aic:.1f}") print(f"Linear: {fit_lin.aic:.1f}") print(f"RCS (5 knots): {fit_rcs.aic:.1f}") print("\nThe RCS model should have the lowest AIC, indicating best fit.") print("The linear model has the worst fit because it cannot capture") print("the U-shape at all.\n") # --- Question (b): Plot all three fitted curves with the true function --- x_range = np.linspace(df['x'].min(), df['x'].max(), 200) # True curve true_logit = -5 + 0.004 * (x_range - 50)**2 true_prob = expit(true_logit) # RCS predictions X_rcs_pred = dmatrix("cr(x, df=4)", data=pd.DataFrame({'x': x_range}), return_type='dataframe') rcs_prob = fit_rcs.predict(X_rcs_pred) # Linear predictions X_lin_pred = sm.add_constant(pd.DataFrame({'x': x_range})) lin_prob = fit_lin.predict(X_lin_pred) # Categorised predictions # Assign each x_range value to its quartile, then predict quartile_edges = df['x'].quantile([0, 0.25, 0.5, 0.75, 1.0]).values quartile_edges[0] = -np.inf quartile_edges[-1] = np.inf x_cat_pred = pd.cut(x_range, bins=quartile_edges, labels=['Q1', 'Q2', 'Q3', 'Q4']) X_cat_pred = pd.get_dummies(x_cat_pred, drop_first=True).astype(float) X_cat_pred = sm.add_constant(X_cat_pred) # Ensure columns match for col in X_cat.columns: if col not in X_cat_pred.columns: X_cat_pred[col] = 0.0 X_cat_pred = X_cat_pred[X_cat.columns] cat_prob = fit_cat.predict(X_cat_pred) plt.figure(figsize=(10, 6)) plt.plot(x_range, true_prob, 'k--', linewidth=2, label='True relationship') plt.plot(x_range, rcs_prob, color='#27ae60', linewidth=2, label='RCS (5 knots)') plt.plot(x_range, cat_prob, color='#e74c3c', linewidth=1.5, label='Categorised (quartiles)') plt.plot(x_range, lin_prob, color='#3498db', linewidth=1.5, label='Linear') plt.xlabel('Predictor (x)') plt.ylabel('Predicted Probability') plt.title('Comparing modelling strategies for a U-shaped relationship') plt.legend() plt.tight_layout() plt.show() print("=== Question (b): Interpretation ===") print("The RCS model best recovers the true U-shaped relationship.") print("The linear model completely misses the U-shape.") print("The categorised model captures some pattern but in crude steps.\n") # --- Question (c): Change quartiles to tertiles --- print("=== Question (c): Tertile categorisation ===") df['x_tert'] = pd.qcut(df['x'], q=3, labels=['T1', 'T2', 'T3']) X_tert = pd.get_dummies(df['x_tert'], drop_first=True).astype(float) X_tert = sm.add_constant(X_tert) fit_tert = sm.GLM(df['y'], X_tert, family=sm.families.Binomial()).fit() print(f"AIC with tertiles: {fit_tert.aic:.1f}") print(f"AIC with quartiles: {fit_cat.aic:.1f}") print(f"AIC with RCS: {fit_rcs.aic:.1f}") print("\nChanging from quartiles to tertiles changes the categorised model's") print("estimates. This demonstrates that categorisation results are ARBITRARY") print("and depend on cut-point choice. The RCS model does not have this") print("problem because it models the continuous relationship directly.") # --- Likelihood ratio test for non-linearity --- from scipy import stats lr_stat = fit_lin.deviance - fit_rcs.deviance df_diff = len(fit_rcs.params) - len(fit_lin.params) p_value = 1 - stats.chi2.cdf(lr_stat, df_diff) print(f"\n=== Non-linearity test ===") print(f"LR statistic: {lr_stat:.3f}, df: {df_diff}, p-value: {p_value:.6f}") if p_value < 0.05: print("Significant non-linearity: the U-shaped relationship is confirmed.") ``` ::: ## Chapter 05: Penalised Regression ### Exercise 1 ::: {.panel-tabset} #### R ```{r} #| label: sol-ch05-ex1-r #| eval: false #| code-fold: true #| code-summary: "Show R solution" # ============================================================================= # Chapter 5, Exercise 1: Conceptual Questions # Penalised Regression: Ridge, LASSO, Elastic Net # ============================================================================= # This exercise is conceptual. Answers are provided as comments. # --- Question (a) --- # A colleague says: "I used LASSO and it removed age from my model predicting # mortality. This means age is not a risk factor for death." # Explain why this interpretation is incorrect. # # ANSWER: # The LASSO removing age does NOT mean age is unimportant. It means that, # given the OTHER predictors in the model, age does not provide enough # ADDITIONAL predictive information to justify its inclusion under the # penalty. Possible reasons why LASSO dropped age: # # 1. Age may be highly CORRELATED with another retained variable (e.g., # number of comorbidities, renal function). The LASSO tends to pick # one of a group of correlated predictors and drop the rest arbitrarily. # # 2. The sample size may be too small to detect age's effect given the # penalty strength. # # 3. The LASSO performs variable selection based on PREDICTION, not # CAUSAL importance. A variable can be a genuine risk factor but # redundant for prediction when other variables are present. # # The correct interpretation is: "Age was not selected by the LASSO model, # possibly because its predictive information overlaps with other retained # variables. This does not imply age is not a risk factor for death." # --- Question (b) --- # Ridge CV-RMSE = 2.1 days. Standard linear regression RMSE = 1.8 days # (on the same training data). Colleague says standard model is better. # What is wrong? # # ANSWER: # The comparison is unfair because: # # 1. The Ridge RMSE of 2.1 is CROSS-VALIDATED -- it estimates performance # on UNSEEN data by training and testing on different folds. # # 2. The standard regression RMSE of 1.8 is likely computed on the SAME # training data used to fit the model (apparent performance). This is # OVERLY OPTIMISTIC because the model has memorised the training data's # noise. # # 3. If you computed a cross-validated RMSE for the standard regression, # it would likely be HIGHER than 2.1 (worse than Ridge), especially # with 20 predictors. # # The correct comparison requires evaluating both models using the SAME # methodology -- either both cross-validated, or both on a held-out test # set. Never compare a cross-validated metric to a training metric. # --- Question (c) --- # Explain in one sentence why the LASSO produces exact zeros but Ridge does not. # # ANSWER: # The L1 penalty (LASSO) constrains coefficients to a diamond-shaped region # whose corners lie on the axes, so the loss function contours are much more # likely to first touch the constraint region at a corner (where one or more # coefficients are exactly zero) than the smooth circular boundary of the L2 # penalty (Ridge), which has no corners. # --- Supporting code to illustrate the geometry --- library(ggplot2) theta <- seq(0, 2 * pi, length.out = 200) # L2 constraint (circle) l2_x <- cos(theta) l2_y <- sin(theta) # L1 constraint (diamond) l1_x <- c(1, 0, -1, 0, 1) l1_y <- c(0, 1, 0, -1, 0) p <- ggplot() + # L2 constraint region geom_polygon(data = data.frame(x = l2_x, y = l2_y), aes(x, y), fill = "#3498db", alpha = 0.15) + geom_path(data = data.frame(x = l2_x, y = l2_y), aes(x, y), colour = "#3498db", linewidth = 1.2) + # L1 constraint region geom_polygon(data = data.frame(x = l1_x, y = l1_y), aes(x, y), fill = "#e74c3c", alpha = 0.15) + geom_path(data = data.frame(x = l1_x, y = l1_y), aes(x, y), colour = "#e74c3c", linewidth = 1.2) + # Labels annotate("text", x = 0.7, y = 0.7, label = "L2 (Ridge)", colour = "#3498db", size = 4) + annotate("text", x = -0.7, y = -0.3, label = "L1 (LASSO)", colour = "#e74c3c", size = 4) + annotate("text", x = 0, y = -1.3, label = "The diamond has corners on the axes\nwhere coefficients = 0", size = 3.5) + coord_fixed() + labs(x = expression(beta[1]), y = expression(beta[2]), title = "L1 vs L2 constraint geometry") + theme_minimal(base_size = 14) + xlim(-1.5, 1.5) + ylim(-1.5, 1.0) print(p) cat("\nThe L1 diamond has corners at the axes, making it more likely that\n") cat("the optimal solution lies at a corner (coefficient = 0), producing\n") cat("exact sparsity. The L2 circle has no corners, so coefficients are\n") cat("shrunk toward zero but never reach it exactly.\n") ``` #### Python ```{python} #| label: sol-ch05-ex1-py #| eval: false #| code-fold: true #| code-summary: "Show Python solution" # ============================================================================= # Chapter 5, Exercise 1: Conceptual Questions # Penalised Regression: Ridge, LASSO, Elastic Net # ============================================================================= # This exercise is conceptual. Answers are provided as comments. import numpy as np import matplotlib.pyplot as plt # --- Question (a) --- # A colleague says: "I used LASSO and it removed age from my model predicting # mortality. This means age is not a risk factor for death." # Explain why this interpretation is incorrect. # # ANSWER: # The LASSO removing age does NOT mean age is unimportant. It means that, # given the OTHER predictors in the model, age does not provide enough # ADDITIONAL predictive information to justify its inclusion under the # penalty. Possible reasons why LASSO dropped age: # # 1. Age may be highly CORRELATED with another retained variable (e.g., # number of comorbidities, renal function). The LASSO tends to pick # one of a group of correlated predictors and drop the rest arbitrarily. # # 2. The sample size may be too small to detect age's effect given the # penalty strength. # # 3. The LASSO performs variable selection based on PREDICTION, not # CAUSAL importance. A variable can be a genuine risk factor but # redundant for prediction when other variables are present. # # The correct interpretation is: "Age was not selected by the LASSO model, # possibly because its predictive information overlaps with other retained # variables. This does not imply age is not a risk factor for death." # --- Question (b) --- # Ridge CV-RMSE = 2.1 days. Standard linear regression RMSE = 1.8 days # (on the same training data). Colleague says standard model is better. # What is wrong? # # ANSWER: # The comparison is unfair because: # # 1. The Ridge RMSE of 2.1 is CROSS-VALIDATED -- it estimates performance # on UNSEEN data by training and testing on different folds. # # 2. The standard regression RMSE of 1.8 is likely computed on the SAME # training data used to fit the model (apparent performance). This is # OVERLY OPTIMISTIC because the model has memorised the training data's # noise. # # 3. If you computed a cross-validated RMSE for the standard regression, # it would likely be HIGHER than 2.1 (worse than Ridge), especially # with 20 predictors. # # The correct comparison requires evaluating both models using the SAME # methodology -- either both cross-validated, or both on a held-out test # set. Never compare a cross-validated metric to a training metric. # --- Question (c) --- # Explain in one sentence why the LASSO produces exact zeros but Ridge does not. # # ANSWER: # The L1 penalty (LASSO) constrains coefficients to a diamond-shaped region # whose corners lie on the axes, so the loss function contours are much more # likely to first touch the constraint region at a corner (where one or more # coefficients are exactly zero) than the smooth circular boundary of the L2 # penalty (Ridge), which has no corners. # --- Supporting code to illustrate the geometry --- theta = np.linspace(0, 2 * np.pi, 200) # L2 constraint (circle) l2_x = np.cos(theta) l2_y = np.sin(theta) # L1 constraint (diamond) l1_x = [1, 0, -1, 0, 1] l1_y = [0, 1, 0, -1, 0] fig, ax = plt.subplots(figsize=(7, 7)) # L2 region ax.fill(l2_x, l2_y, alpha=0.15, color='#3498db') ax.plot(l2_x, l2_y, color='#3498db', linewidth=1.5, label='L2 (Ridge)') # L1 region ax.fill(l1_x, l1_y, alpha=0.15, color='#e74c3c') ax.plot(l1_x, l1_y, color='#e74c3c', linewidth=1.5, label='L1 (LASSO)') # Mark the corners ax.plot([1, -1, 0, 0], [0, 0, 1, -1], 'ro', markersize=6) ax.annotate('Corner (exact zero for $\\beta_2$)', xy=(1, 0), fontsize=9, xytext=(1.1, 0.3), arrowprops=dict(arrowstyle='->', color='red')) ax.set_xlabel(r'$\beta_1$', fontsize=14) ax.set_ylabel(r'$\beta_2$', fontsize=14) ax.set_title('L1 vs L2 constraint geometry') ax.set_aspect('equal') ax.legend(fontsize=11) ax.set_xlim(-1.5, 1.8) ax.set_ylim(-1.5, 1.5) ax.axhline(0, color='grey', linewidth=0.5) ax.axvline(0, color='grey', linewidth=0.5) plt.tight_layout() plt.show() print("The L1 diamond has corners at the axes, making it more likely that") print("the optimal solution lies at a corner (coefficient = 0), producing") print("exact sparsity. The L2 circle has no corners, so coefficients are") print("shrunk toward zero but never reach it exactly.") ``` ::: ### Exercise 2 ::: {.panel-tabset} #### R ```{r} #| label: sol-ch05-ex2-r #| eval: false #| code-fold: true #| code-summary: "Show R solution" # ============================================================================= # Chapter 5, Exercise 2: Predicting Diabetes Onset # Pima Indians Diabetes dataset # ============================================================================= library(glmnet) library(mlbench) library(ggplot2) # --- Load data --- data(PimaIndiansDiabetes2, package = "mlbench") pima <- na.omit(PimaIndiansDiabetes2) cat("Complete cases:", nrow(pima), "\n\n") # Prepare data X <- as.matrix(pima[, 1:8]) y <- ifelse(pima$diabetes == "pos", 1, 0) # --- Task 1: Fit a LASSO model with 10-fold cross-validation --- set.seed(123) cv_lasso <- cv.glmnet(X, y, family = "binomial", alpha = 1, nfolds = 10) cat("=== LASSO Cross-Validation ===\n") cat("Lambda.min:", round(cv_lasso$lambda.min, 4), "\n") cat("Lambda.1se:", round(cv_lasso$lambda.1se, 4), "\n\n") # --- Task 2: Plot the cross-validation curve --- plot(cv_lasso, main = "LASSO Cross-Validation Curve") # --- Task 3: Which variables are selected at lambda.1se? --- cat("=== Variables selected at lambda.1se ===\n") coef_1se <- coef(cv_lasso, s = "lambda.1se") print(coef_1se) # Count non-zero coefficients (excluding intercept) n_selected_lasso <- sum(coef_1se[-1, ] != 0) cat("\nNumber of variables selected by LASSO at lambda.1se:", n_selected_lasso, "\n") # --- Task 4: Fit Ridge and Elastic Net (alpha = 0.5) --- set.seed(123) cv_ridge <- cv.glmnet(X, y, family = "binomial", alpha = 0, nfolds = 10) set.seed(123) cv_enet <- cv.glmnet(X, y, family = "binomial", alpha = 0.5, nfolds = 10) # --- Task 5: Compare the three methods --- # (a) How many variables does each select at lambda.1se? coef_ridge_1se <- coef(cv_ridge, s = "lambda.1se") coef_enet_1se <- coef(cv_enet, s = "lambda.1se") n_selected_ridge <- sum(abs(coef_ridge_1se[-1, ]) > 1e-6) n_selected_enet <- sum(abs(coef_enet_1se[-1, ]) > 1e-6) cat("\n=== Number of variables selected (lambda.1se) ===\n") cat("LASSO: ", n_selected_lasso, "of 8\n") cat("Elastic Net: ", n_selected_enet, "of 8\n") cat("Ridge: ", n_selected_ridge, "of 8 (Ridge never sets to exactly 0)\n") # (b) Cross-validated deviance for each cat("\n=== Cross-validated deviance (at lambda.min) ===\n") cat("LASSO: ", round(min(cv_lasso$cvm), 4), "\n") cat("Elastic Net: ", round(min(cv_enet$cvm), 4), "\n") cat("Ridge: ", round(min(cv_ridge$cvm), 4), "\n") # (c) Plot the regularisation path for the LASSO model fit_lasso <- glmnet(X, y, family = "binomial", alpha = 1) plot(fit_lasso, xvar = "lambda", label = TRUE, main = "LASSO Regularisation Path") abline(v = log(cv_lasso$lambda.min), lty = 2, col = "blue") abline(v = log(cv_lasso$lambda.1se), lty = 2, col = "red") legend("topright", legend = c("lambda.min", "lambda.1se"), col = c("blue", "red"), lty = 2, cex = 0.8) # --- Task 6: Recommendation --- cat("\n=== Recommendation ===\n") cat("For the Pima diabetes dataset (8 predictors, ~390 complete cases):\n") cat("- The number of predictors is small relative to sample size,\n") cat(" so penalisation is not strictly necessary.\n") cat("- LASSO is useful here for identifying the most predictive variables.\n") cat("- All three methods produce similar cross-validated deviance,\n") cat(" confirming that the choice of penalty type matters less than\n") cat(" using penalisation vs not.\n") cat("- For variable selection and a parsimonious model, LASSO at\n") cat(" lambda.1se is a good choice.\n") cat("- For best prediction, any of the three at lambda.min would work.\n") # --- Bonus: Compare coefficients side by side --- coef_comparison <- data.frame( Variable = colnames(X), LASSO = as.vector(coef(cv_lasso, s = "lambda.1se"))[-1], Ridge = as.vector(coef(cv_ridge, s = "lambda.1se"))[-1], ElasticNet = as.vector(coef(cv_enet, s = "lambda.1se"))[-1] ) cat("\n=== Coefficient comparison (lambda.1se) ===\n") print(round(coef_comparison, 4)) ``` #### Python ```{python} #| label: sol-ch05-ex2-py #| eval: false #| code-fold: true #| code-summary: "Show Python solution" # ============================================================================= # Chapter 5, Exercise 2: Predicting Diabetes Onset # Pima Indians Diabetes dataset # ============================================================================= import numpy as np import pandas as pd from sklearn.linear_model import LogisticRegression, LogisticRegressionCV from sklearn.preprocessing import StandardScaler from sklearn.model_selection import cross_val_score import matplotlib.pyplot as plt # --- Load the Pima Indians Diabetes dataset --- url = ("https://raw.githubusercontent.com/jbrownlee/Datasets/master/" "pima-indians-diabetes.data.csv") columns = ['pregnancies', 'glucose', 'blood_pressure', 'skin_thickness', 'insulin', 'bmi', 'diabetes_pedigree', 'age', 'outcome'] pima = pd.read_csv(url, names=columns) # Remove rows with zero values where zero is not meaningful cols_no_zero = ['glucose', 'blood_pressure', 'skin_thickness', 'insulin', 'bmi'] pima[cols_no_zero] = pima[cols_no_zero].replace(0, np.nan) pima = pima.dropna() print(f"Complete cases: {len(pima)}\n") X = pima[columns[:-1]].values y = pima['outcome'].values scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # --- Task 1: Fit LASSO with cross-validation --- lasso_cv = LogisticRegressionCV( penalty='l1', solver='saga', Cs=50, cv=10, scoring='neg_log_loss', max_iter=10000, random_state=42 ) lasso_cv.fit(X_scaled, y) print("=== LASSO Cross-Validation ===") print(f"Best C (= 1/lambda): {lasso_cv.C_[0]:.4f}") print(f"Best lambda: {1/lasso_cv.C_[0]:.4f}\n") print("LASSO selected features:") n_selected_lasso = 0 for name, coef in zip(columns[:-1], lasso_cv.coef_[0]): if abs(coef) > 1e-6: print(f" {name}: {coef:.4f}") n_selected_lasso += 1 print(f"\nNumber selected: {n_selected_lasso} of 8\n") # --- Task 2: Fit Ridge with cross-validation --- ridge_cv = LogisticRegressionCV( penalty='l2', solver='lbfgs', Cs=50, cv=10, scoring='neg_log_loss', max_iter=10000, random_state=42 ) ridge_cv.fit(X_scaled, y) # --- Task 3: Fit Elastic Net with cross-validation --- enet_cv = LogisticRegressionCV( penalty='elasticnet', solver='saga', Cs=50, cv=10, l1_ratios=[0.25, 0.5, 0.75], scoring='neg_log_loss', max_iter=10000, random_state=42 ) enet_cv.fit(X_scaled, y) print(f"Elastic Net best l1_ratio (alpha): {enet_cv.l1_ratio_[0]:.2f}") print(f"Elastic Net best C (1/lambda): {enet_cv.C_[0]:.4f}\n") # --- Task 4: Compare cross-validated scores --- print("=== Cross-validated log-loss (lower = better) ===") for name, model in [('LASSO', lasso_cv), ('Ridge', ridge_cv), ('Elastic Net', enet_cv)]: scores = cross_val_score(model, X_scaled, y, cv=10, scoring='neg_log_loss') print(f" {name}: {-scores.mean():.4f} (+/- {scores.std():.4f})") # --- Task 5: Plot regularisation path for LASSO --- Cs = np.logspace(-2, 2, 80) coefs = [] for C in Cs: model = LogisticRegression(penalty='l1', C=C, solver='saga', max_iter=10000, random_state=42) model.fit(X_scaled, y) coefs.append(model.coef_[0]) coefs = np.array(coefs) lambdas = 1.0 / Cs plt.figure(figsize=(10, 6)) for j, name in enumerate(columns[:-1]): plt.plot(np.log10(lambdas), coefs[:, j], label=name) plt.axvline(x=np.log10(1/lasso_cv.C_[0]), color='k', linestyle='--', label='CV optimal lambda') plt.xlabel('log10(lambda)') plt.ylabel('Coefficient') plt.title('LASSO Regularisation Path - Pima Diabetes') plt.legend(loc='best', fontsize=8) plt.tight_layout() plt.show() # --- Task 6: Recommendation --- print("\n=== Recommendation ===") print("For the Pima diabetes dataset (8 predictors, ~390 complete cases):") print("- The number of predictors is small relative to sample size,") print(" so penalisation is not strictly necessary.") print("- LASSO is useful for identifying the most predictive variables.") print("- All three methods produce similar cross-validated performance.") print("- For variable selection, LASSO provides a parsimonious model.") print("- For best prediction, any method works well here.") # --- Bonus: Compare coefficients side by side --- print("\n=== Coefficient comparison ===") coef_df = pd.DataFrame({ 'Variable': columns[:-1], 'LASSO': lasso_cv.coef_[0], 'Ridge': ridge_cv.coef_[0], 'Elastic Net': enet_cv.coef_[0] }) print(coef_df.round(4).to_string(index=False)) ``` ::: ### Exercise 3 ::: {.panel-tabset} #### R ```{r} #| label: sol-ch05-ex3-r #| eval: false #| code-fold: true #| code-summary: "Show R solution" # ============================================================================= # Chapter 5, Exercise 3: The Effect of Correlation # How LASSO and Ridge handle correlated predictors differently # ============================================================================= library(glmnet) library(MASS) set.seed(2025) n <- 200 # --- Create 4 correlated predictors (correlation ~ 0.9) --- Sigma <- matrix(0.9, 4, 4) diag(Sigma) <- 1 X_corr <- mvrnorm(n, mu = rep(0, 4), Sigma = Sigma) # Add 6 independent predictors (noise) X_indep <- matrix(rnorm(n * 6), n, 6) X <- cbind(X_corr, X_indep) colnames(X) <- c(paste0("Corr_", 1:4), paste0("Noise_", 1:6)) # True model: all 4 correlated predictors have effect = 0.5 true_beta <- c(rep(0.5, 4), rep(0, 6)) y <- rbinom(n, 1, plogis(X %*% true_beta)) # --- Task 1: Fit LASSO and examine which variables are selected --- set.seed(42) cv_lasso <- cv.glmnet(X, y, family = "binomial", alpha = 1) cat("=== Task 1: LASSO coefficients (lambda.1se) ===\n") print(round(as.matrix(coef(cv_lasso, s = "lambda.1se")), 3)) # --- Task 2: Fit Ridge and examine coefficients --- set.seed(42) cv_ridge <- cv.glmnet(X, y, family = "binomial", alpha = 0) cat("\n=== Task 2: Ridge coefficients (lambda.1se) ===\n") print(round(as.matrix(coef(cv_ridge, s = "lambda.1se")), 3)) # --- Task 3: Fit Elastic Net (alpha = 0.5) --- set.seed(42) cv_enet <- cv.glmnet(X, y, family = "binomial", alpha = 0.5) cat("\n=== Task 3: Elastic Net coefficients (lambda.1se) ===\n") print(round(as.matrix(coef(cv_enet, s = "lambda.1se")), 3)) # --- Question (a) --- cat("\n=== Question (a): Does LASSO select all 4 correlated predictors? ===\n") lasso_coefs <- as.vector(coef(cv_lasso, s = "lambda.1se"))[-1] names(lasso_coefs) <- colnames(X) selected_lasso <- names(lasso_coefs[abs(lasso_coefs) > 1e-6]) cat("LASSO selected:", paste(selected_lasso, collapse = ", "), "\n") cat("LASSO typically selects only 1-2 of the 4 correlated predictors.\n") cat("This is because the LASSO arbitrarily picks one predictor from a\n") cat("correlated group and drops the rest. With correlation ~0.9, any\n") cat("single correlated predictor carries almost the same information\n") cat("as all four combined, so the LASSO keeps just one (or two) and\n") cat("sets the others to zero.\n") # --- Question (b) --- cat("\n=== Question (b): How does Ridge handle correlated predictors? ===\n") ridge_coefs <- as.vector(coef(cv_ridge, s = "lambda.1se"))[-1] names(ridge_coefs) <- colnames(X) cat("Ridge coefficients for correlated predictors:\n") print(round(ridge_coefs[1:4], 4)) cat("\nRidge distributes the effect EVENLY across all correlated predictors.\n") cat("All four correlated predictors retain non-zero (and similar) coefficients.\n") cat("This is the 'grouping effect' -- Ridge shrinks correlated predictors\n") cat("toward each other rather than picking one arbitrarily.\n") # --- Question (c) --- cat("\n=== Question (c): Does Elastic Net improve on LASSO? ===\n") enet_coefs <- as.vector(coef(cv_enet, s = "lambda.1se"))[-1] names(enet_coefs) <- colnames(X) selected_enet <- names(enet_coefs[abs(enet_coefs) > 1e-6]) cat("Elastic Net selected:", paste(selected_enet, collapse = ", "), "\n") cat("The Elastic Net typically selects MORE of the correlated predictors\n") cat("than LASSO, thanks to the L2 component (grouping effect).\n") cat("It provides a middle ground: some sparsity (like LASSO) but\n") cat("better handling of correlated groups (like Ridge).\n") # --- Question (d): Stability analysis --- cat("\n=== Question (d): Stability analysis (10 different seeds) ===\n") cat("LASSO variable selection across 10 bootstrap samples:\n") for (seed in 1:10) { set.seed(seed) idx <- sample(n, n, replace = TRUE) cv_boot <- cv.glmnet(X[idx, ], y[idx], family = "binomial", alpha = 1, nfolds = 5) boot_coefs <- as.vector(coef(cv_boot, s = "lambda.1se"))[-1] selected <- colnames(X)[abs(boot_coefs) > 1e-6] cat(sprintf(" Seed %2d: %s\n", seed, paste(selected, collapse = ", "))) } cat("\nRidge variable selection (all retained) across 10 bootstrap samples:\n") for (seed in 1:10) { set.seed(seed) idx <- sample(n, n, replace = TRUE) cv_boot <- cv.glmnet(X[idx, ], y[idx], family = "binomial", alpha = 0, nfolds = 5) boot_coefs <- as.vector(coef(cv_boot, s = "lambda.1se"))[-1] selected <- colnames(X)[abs(boot_coefs) > 0.01] cat(sprintf(" Seed %2d: %s\n", seed, paste(selected, collapse = ", "))) } cat("\nConclusion: LASSO variable selection is UNSTABLE for correlated\n") cat("predictors -- the selected predictor(s) change across samples.\n") cat("Ridge is stable because it always retains all predictors.\n") cat("Elastic Net offers a compromise with better stability than LASSO.\n") ``` #### Python ```{python} #| label: sol-ch05-ex3-py #| eval: false #| code-fold: true #| code-summary: "Show Python solution" # ============================================================================= # Chapter 5, Exercise 3: The Effect of Correlation # How LASSO and Ridge handle correlated predictors differently # ============================================================================= import numpy as np import pandas as pd from sklearn.linear_model import LogisticRegressionCV from sklearn.preprocessing import StandardScaler from scipy.special import expit np.random.seed(2025) n = 200 # --- Create 4 correlated predictors (correlation ~ 0.9) --- mean = np.zeros(4) cov = np.full((4, 4), 0.9) np.fill_diagonal(cov, 1.0) X_corr = np.random.multivariate_normal(mean, cov, size=n) # Add 6 independent noise predictors X_indep = np.random.randn(n, 6) X = np.hstack([X_corr, X_indep]) feature_names = ([f"Corr_{i+1}" for i in range(4)] + [f"Noise_{i+1}" for i in range(6)]) # True model: all 4 correlated predictors contribute true_beta = np.array([0.5]*4 + [0.0]*6) y = np.random.binomial(1, expit(X @ true_beta)) scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # --- Task 1: LASSO --- lasso = LogisticRegressionCV(penalty='l1', solver='saga', Cs=50, cv=10, max_iter=10000, random_state=42) lasso.fit(X_scaled, y) print("=== Task 1: LASSO coefficients ===") for name, coef in zip(feature_names, lasso.coef_[0]): marker = " *" if abs(coef) > 1e-6 else "" print(f" {name:10s}: {coef:8.4f}{marker}") # --- Task 2: Ridge --- ridge = LogisticRegressionCV(penalty='l2', solver='lbfgs', Cs=50, cv=10, max_iter=10000, random_state=42) ridge.fit(X_scaled, y) print("\n=== Task 2: Ridge coefficients ===") for name, coef in zip(feature_names, ridge.coef_[0]): print(f" {name:10s}: {coef:8.4f}") # --- Task 3: Elastic Net --- enet = LogisticRegressionCV(penalty='elasticnet', solver='saga', Cs=50, cv=10, l1_ratios=[0.5], max_iter=10000, random_state=42) enet.fit(X_scaled, y) print("\n=== Task 3: Elastic Net coefficients ===") for name, coef in zip(feature_names, enet.coef_[0]): marker = " *" if abs(coef) > 1e-6 else "" print(f" {name:10s}: {coef:8.4f}{marker}") # --- Question (a): Does LASSO select all 4 correlated predictors? --- print("\n=== Question (a) ===") selected_lasso = [name for name, coef in zip(feature_names, lasso.coef_[0]) if abs(coef) > 1e-6] print(f"LASSO selected: {selected_lasso}") print("LASSO typically selects only 1-2 of the 4 correlated predictors.") print("This is because the LASSO arbitrarily picks one predictor from a") print("correlated group and drops the rest. With correlation ~0.9, any") print("single predictor carries almost the same information as all four.") # --- Question (b): How does Ridge handle correlated predictors? --- print("\n=== Question (b) ===") print("Ridge coefficients for correlated predictors:") for i in range(4): print(f" {feature_names[i]}: {ridge.coef_[0][i]:.4f}") print("Ridge distributes the effect EVENLY across all correlated predictors.") print("All four retain non-zero (and similar) coefficients.") # --- Question (c): Does Elastic Net improve on LASSO? --- print("\n=== Question (c) ===") selected_enet = [name for name, coef in zip(feature_names, enet.coef_[0]) if abs(coef) > 1e-6] print(f"Elastic Net selected: {selected_enet}") print("The Elastic Net typically selects MORE of the correlated predictors") print("than LASSO, thanks to the L2 component (grouping effect).") # --- Question (d): Stability analysis --- print("\n=== Question (d): Stability analysis (10 bootstrap samples) ===") print("\nLASSO variable selection:") for seed in range(10): rng = np.random.RandomState(seed) idx = rng.choice(n, n, replace=True) lasso_boot = LogisticRegressionCV(penalty='l1', solver='saga', Cs=20, cv=5, max_iter=10000, random_state=42) lasso_boot.fit(X_scaled[idx], y[idx]) selected = [feature_names[j] for j in range(10) if abs(lasso_boot.coef_[0][j]) > 1e-6] print(f" Seed {seed}: {selected}") print("\nRidge — all predictors retained (non-zero coefficients):") for seed in range(10): rng = np.random.RandomState(seed) idx = rng.choice(n, n, replace=True) ridge_boot = LogisticRegressionCV(penalty='l2', solver='lbfgs', Cs=20, cv=5, max_iter=10000, random_state=42) ridge_boot.fit(X_scaled[idx], y[idx]) # Ridge always keeps all predictors; show which have coef > 0.01 selected = [feature_names[j] for j in range(10) if abs(ridge_boot.coef_[0][j]) > 0.01] print(f" Seed {seed}: {selected}") print("\nConclusion: LASSO variable selection is UNSTABLE for correlated") print("predictors -- the selected predictor(s) change across samples.") print("Ridge is stable because it always retains all predictors.") print("Elastic Net offers a compromise with better stability than LASSO.") ``` ::: ## Chapter 06: Survival Analysis ### Exercise 1 ::: {.panel-tabset} #### R ```{r} #| label: sol-ch06-ex1-r #| eval: false #| code-fold: true #| code-summary: "Show R solution" # ============================================================================= # Chapter 6, Exercise 1: Kaplan-Meier Curves by Disease Stage # PBC dataset: KM curves stratified by stage, log-rank test, median survival # ============================================================================= library(survival) library(survminer) library(tidyverse) # --- Load and prepare the PBC dataset --- data(pbc, package = "survival") pbc_clean <- pbc %>% filter(!is.na(trt)) %>% mutate( status_binary = ifelse(status == 2, 1, 0), time_years = time / 365.25, trt_label = factor(trt, labels = c("D-penicillamine", "Placebo")) ) %>% filter(!is.na(stage)) # Remove missing stage values cat("N =", nrow(pbc_clean), "\n") cat("Events (deaths):", sum(pbc_clean$status_binary), "\n") cat("Stages:", paste(sort(unique(pbc_clean$stage)), collapse = ", "), "\n\n") # --- Fit Kaplan-Meier curves stratified by disease stage --- km_stage <- survfit(Surv(time_years, status_binary) ~ stage, data = pbc_clean) print(km_stage) # --- Report median survival time for each stage --- cat("\n=== Median Survival Time by Stage ===\n") median_surv <- summary(km_stage)$table print(median_surv) # Extract and display more clearly cat("\nMedian survival (years) by stage:\n") for (s in sort(unique(pbc_clean$stage))) { km_s <- survfit(Surv(time_years, status_binary) ~ 1, data = pbc_clean[pbc_clean$stage == s, ]) med <- surv_median(km_s) cat(sprintf(" Stage %d: %.2f years (95%% CI: %.2f - %.2f)\n", s, med$median, med$lower, med$upper)) } # --- Perform the log-rank test --- cat("\n=== Log-Rank Test ===\n") logrank <- survdiff(Surv(time_years, status_binary) ~ stage, data = pbc_clean) print(logrank) # Extract p-value pvalue <- 1 - pchisq(logrank$chisq, length(logrank$n) - 1) cat(sprintf("\nLog-rank test p-value: %.6f\n", pvalue)) if (pvalue < 0.05) { cat("Result: Significant difference in survival across disease stages.\n") } else { cat("Result: No significant difference detected.\n") } # --- Plot the Kaplan-Meier curves --- p <- ggsurvplot( km_stage, data = pbc_clean, pval = TRUE, # Show log-rank p-value conf.int = TRUE, # Show confidence intervals risk.table = TRUE, # Show number at risk table xlab = "Time (years)", ylab = "Survival probability", title = "Kaplan-Meier Curves by Disease Stage (PBC)", legend.title = "Stage", palette = c("#2ecc71", "#3498db", "#e67e22", "#e74c3c"), ggtheme = theme_minimal() ) print(p) # --- Interpretation --- cat("\n=== Interpretation ===\n") cat("The KM curves should show a clear separation by stage, with higher\n") cat("stages having worse survival. The log-rank test formally confirms\n") cat("whether these differences are statistically significant.\n") cat("As expected in PBC, patients with Stage 4 disease have the worst\n") cat("prognosis, while Stage 1-2 patients have substantially better survival.\n") ``` #### Python ```{python} #| label: sol-ch06-ex1-py #| eval: false #| code-fold: true #| code-summary: "Show Python solution" # ============================================================================= # Chapter 6, Exercise 1: Kaplan-Meier Curves by Disease Stage # PBC dataset (simulated): KM curves by stage, log-rank test, median survival # ============================================================================= import numpy as np import pandas as pd from lifelines import KaplanMeierFitter from lifelines.statistics import logrank_test, multivariate_logrank_test import matplotlib.pyplot as plt # --- Simulate PBC-like data with stage variable --- # (Since PBC is an R-native dataset, we simulate its structure as done # in the chapter's Python code) np.random.seed(42) n = 312 # Simulate stage (1-4) with realistic proportions stage = np.random.choice([1, 2, 3, 4], size=n, p=[0.05, 0.25, 0.40, 0.30]) # Simulate survival times that depend on stage (higher stage = worse survival) # Base scale varies by stage scale_by_stage = {1: 14, 2: 10, 3: 7, 4: 4} time_years = np.array([np.random.exponential(scale=scale_by_stage[s]) for s in stage]) time_years = np.clip(time_years, 0.1, 15) # Clip to realistic range # Event probability increases with stage event_prob = {1: 0.20, 2: 0.35, 3: 0.50, 4: 0.65} event = np.array([np.random.binomial(1, event_prob[s]) for s in stage]) pbc = pd.DataFrame({ 'time_years': time_years, 'event': event, 'stage': stage }) print(f"N = {len(pbc)}") print(f"Events (deaths): {pbc['event'].sum()}") print(f"Stages: {sorted(pbc['stage'].unique())}\n") # --- Fit KM for each stage and report median survival --- print("=== Median Survival Time by Stage ===") kmf_dict = {} for s in sorted(pbc['stage'].unique()): mask = pbc['stage'] == s kmf = KaplanMeierFitter() kmf.fit(pbc.loc[mask, 'time_years'], event_observed=pbc.loc[mask, 'event'], label=f'Stage {s}') kmf_dict[s] = kmf median_surv = kmf.median_survival_time_ ci = kmf.confidence_interval_survival_function_ print(f" Stage {s}: Median survival = {median_surv:.2f} years " f"(n={mask.sum()}, events={pbc.loc[mask, 'event'].sum()})") # --- Perform the log-rank test (overall comparison across all stages) --- print("\n=== Log-Rank Test (overall) ===") result = multivariate_logrank_test( pbc['time_years'], pbc['stage'], pbc['event'] ) print(f"Test statistic: {result.test_statistic:.3f}") print(f"p-value: {result.p_value:.6f}") if result.p_value < 0.05: print("Result: Significant difference in survival across disease stages.") else: print("Result: No significant difference detected.") # --- Pairwise log-rank tests (Stage 1 vs 4, etc.) --- print("\n=== Pairwise Log-Rank Tests (selected) ===") for s1, s2 in [(1, 4), (2, 3), (3, 4)]: mask1 = pbc['stage'] == s1 mask2 = pbc['stage'] == s2 if mask1.sum() > 0 and mask2.sum() > 0: lr = logrank_test( pbc.loc[mask1, 'time_years'], pbc.loc[mask2, 'time_years'], pbc.loc[mask1, 'event'], pbc.loc[mask2, 'event'] ) print(f" Stage {s1} vs {s2}: p = {lr.p_value:.4f}") # --- Plot KM curves for all stages --- fig, ax = plt.subplots(figsize=(10, 6)) colors = {1: '#2ecc71', 2: '#3498db', 3: '#e67e22', 4: '#e74c3c'} for s, kmf in kmf_dict.items(): kmf.plot_survival_function(ax=ax, ci_show=True, color=colors[s]) ax.set_xlabel("Time (years)", fontsize=12) ax.set_ylabel("Survival probability", fontsize=12) ax.set_title("Kaplan-Meier Curves by Disease Stage (PBC)", fontsize=14) ax.legend(fontsize=10) # Add p-value annotation ax.text(0.7, 0.95, f'Log-rank p = {result.p_value:.4f}', transform=ax.transAxes, fontsize=11, verticalalignment='top', bbox=dict(boxstyle='round', facecolor='white', alpha=0.8)) plt.tight_layout() plt.show() # --- Interpretation --- print("\n=== Interpretation ===") print("The KM curves show a clear separation by stage, with higher stages") print("having worse survival. The log-rank test formally confirms whether") print("these differences are statistically significant.") print("Patients with Stage 4 disease have the worst prognosis, while") print("Stage 1-2 patients have substantially better survival.") ``` ::: ### Exercise 2 ::: {.panel-tabset} #### R ```{r} #| label: sol-ch06-ex2-r #| eval: false #| code-fold: true #| code-summary: "Show R solution" # ============================================================================= # Chapter 6, Exercise 2: Build and Evaluate a Cox Model # PBC dataset: Cox PH with age, log(bili), albumin, protime, edema # ============================================================================= library(survival) library(survminer) library(tidyverse) # --- Load and prepare data --- data(pbc, package = "survival") pbc_clean <- pbc %>% filter(!is.na(trt)) %>% mutate( status_binary = ifelse(status == 2, 1, 0), time_years = time / 365.25, trt_label = factor(trt, labels = c("D-penicillamine", "Placebo")), log_bili = log(bili) ) %>% select(time_years, status_binary, age, log_bili, albumin, protime, edema) %>% drop_na() cat("Complete cases:", nrow(pbc_clean), "\n") cat("Events:", sum(pbc_clean$status_binary), "\n\n") # --- Task 1: Fit the Cox model and report hazard ratios --- cox_model <- coxph( Surv(time_years, status_binary) ~ age + log_bili + albumin + protime + edema, data = pbc_clean ) cat("=== Task 1: Cox Model Summary ===\n") print(summary(cox_model)) # Extract and display HRs with 95% CIs cat("\n=== Hazard Ratios and 95% CIs ===\n") hr_table <- data.frame( Variable = names(coef(cox_model)), HR = exp(coef(cox_model)), Lower_95 = exp(confint(cox_model))[, 1], Upper_95 = exp(confint(cox_model))[, 2], p_value = summary(cox_model)$coefficients[, 5] ) print(round(hr_table, 4)) # Forest plot ggforest(cox_model, data = pbc_clean) # --- Task 2: Check the proportional hazards assumption --- cat("\n=== Task 2: Proportional Hazards Assumption ===\n") ph_test <- cox.zph(cox_model) print(ph_test) cat("\nInterpretation of PH test:\n") cat("- A significant p-value (< 0.05) indicates violation of the PH assumption.\n") cat("- Look at the GLOBAL test and individual covariate tests.\n") # Which covariates violate PH? ph_df <- data.frame( Variable = rownames(ph_test$table)[-nrow(ph_test$table)], p_value = ph_test$table[-nrow(ph_test$table), "p"] ) violators <- ph_df$Variable[ph_df$p_value < 0.05] if (length(violators) > 0) { cat("Covariates violating PH assumption (p < 0.05):", paste(violators, collapse = ", "), "\n") } else { cat("No covariates significantly violate the PH assumption.\n") } # Plot Schoenfeld residuals par(mfrow = c(2, 3)) plot(ph_test) par(mfrow = c(1, 1)) # --- Task 3: Concordance index --- cat("\n=== Task 3: Concordance Index ===\n") c_index <- summary(cox_model)$concordance[1] c_se <- summary(cox_model)$concordance[2] cat(sprintf("C-index: %.3f (SE: %.3f)\n", c_index, c_se)) cat(sprintf("95%% CI: %.3f - %.3f\n", c_index - 1.96 * c_se, c_index + 1.96 * c_se)) cat("\nInterpretation:\n") cat("- C-index = 0.5: model has no discriminative ability (random)\n") cat("- C-index = 1.0: perfect discrimination\n") cat("- C-index > 0.7: generally considered acceptable\n") cat("- C-index > 0.8: considered strong discrimination\n") cat(sprintf("This model's C-index of %.3f indicates %s discrimination.\n", c_index, ifelse(c_index > 0.8, "strong", ifelse(c_index > 0.7, "acceptable", "modest")))) # --- Task 4: Predict 5-year survival for a specific patient --- cat("\n=== Task 4: 5-Year Survival Prediction ===\n") new_patient <- data.frame( age = 55, log_bili = log(3), # bilirubin = 3 mg/dL albumin = 3.2, protime = 11, edema = 0 # no edema ) cat("Patient profile:\n") cat(" Age: 55 years\n") cat(" Bilirubin: 3 mg/dL (log = ", round(log(3), 2), ")\n") cat(" Albumin: 3.2 g/dL\n") cat(" Prothrombin time: 11 seconds\n") cat(" Edema: none\n\n") # Get predicted survival function pred_surv <- survfit(cox_model, newdata = new_patient) # Extract 5-year survival probability surv_summary <- summary(pred_surv, times = 5) cat(sprintf("Predicted 5-year survival probability: %.3f (%.1f%%)\n", surv_summary$surv, surv_summary$surv * 100)) cat(sprintf("95%% CI: %.3f - %.3f\n", surv_summary$lower, surv_summary$upper)) # Plot the predicted survival curve plot(pred_surv, xlab = "Time (years)", ylab = "Survival probability", main = "Predicted Survival for Patient Profile", lwd = 2, col = "blue") abline(h = 0.5, lty = 2, col = "grey") abline(v = 5, lty = 2, col = "red") legend("topright", c("Predicted survival", "5-year mark"), col = c("blue", "red"), lty = c(1, 2), lwd = c(2, 1)) ``` #### Python ```{python} #| label: sol-ch06-ex2-py #| eval: false #| code-fold: true #| code-summary: "Show Python solution" # ============================================================================= # Chapter 6, Exercise 2: Build and Evaluate a Cox Model # PBC-like dataset: Cox PH with age, log(bili), albumin, protime, edema # ============================================================================= import numpy as np import pandas as pd from lifelines import CoxPHFitter import matplotlib.pyplot as plt # --- Simulate PBC-like clinical dataset --- # (Following the chapter's Python approach with simulated data) np.random.seed(42) n = 276 pbc = pd.DataFrame({ 'time_years': np.abs(np.random.exponential(7, n) + np.random.normal(0, 1, n)), 'event': np.random.binomial(1, 0.45, n), 'age': np.random.normal(50, 10, n), 'log_bili': np.random.normal(0.5, 1.0, n), # log(bilirubin) 'albumin': np.random.normal(3.5, 0.4, n), 'protime': np.random.normal(11, 1, n), 'edema': np.random.choice([0, 0.5, 1], n, p=[0.75, 0.15, 0.10]) }) # Make survival time depend on covariates to create realistic associations hazard_linear = (0.04 * pbc['age'] + 0.3 * pbc['log_bili'] - 0.5 * pbc['albumin'] + 0.2 * pbc['protime'] + 0.8 * pbc['edema']) # Adjust times based on hazard pbc['time_years'] = pbc['time_years'] * np.exp(-0.1 * hazard_linear) pbc['time_years'] = np.clip(pbc['time_years'], 0.05, 15) print(f"N = {len(pbc)}") print(f"Events: {pbc['event'].sum()}") print(f"Median follow-up: {pbc['time_years'].median():.2f} years\n") # --- Task 1: Fit Cox model and report hazard ratios --- cph = CoxPHFitter() cph.fit(pbc, duration_col='time_years', event_col='event', formula='age + log_bili + albumin + protime + edema') print("=== Task 1: Cox Model Summary ===") cph.print_summary() # Extract HR table print("\n=== Hazard Ratios and 95% CIs ===") hr_table = pd.DataFrame({ 'HR': np.exp(cph.params_), 'Lower_95': np.exp(cph.confidence_intervals_.iloc[:, 0]), 'Upper_95': np.exp(cph.confidence_intervals_.iloc[:, 1]), 'p_value': cph.summary['p'] }) print(hr_table.round(4)) # Forest plot fig, ax = plt.subplots(figsize=(8, 5)) cph.plot(ax=ax) ax.set_title("Cox PH Model - Hazard Ratios") plt.tight_layout() plt.show() # --- Task 2: Check proportional hazards assumption --- print("\n=== Task 2: Proportional Hazards Assumption ===") try: cph.check_assumptions(pbc, p_value_threshold=0.05, show_plots=True) except Exception as e: print(f"PH assumption check result: {e}") print("If no warnings are raised, the PH assumption is not violated.") # Interpretation print("\nInterpretation:") print("- A significant p-value (< 0.05) indicates violation of PH.") print("- Look at the Schoenfeld residual plots for trends over time.") print("- If a covariate violates PH, consider stratifying on it or") print(" adding a time interaction.") # --- Task 3: Concordance index --- print(f"\n=== Task 3: Concordance Index ===") c_index = cph.concordance_index_ print(f"C-index: {c_index:.3f}") print(f"\nInterpretation:") print(f"- C-index = 0.5: no discriminative ability (random)") print(f"- C-index = 1.0: perfect discrimination") print(f"- C-index > 0.7: generally considered acceptable") print(f"- C-index > 0.8: considered strong discrimination") quality = "strong" if c_index > 0.8 else ("acceptable" if c_index > 0.7 else "modest") print(f"This model's C-index of {c_index:.3f} indicates {quality} discrimination.") # --- Task 4: 5-year survival for a specific patient --- print(f"\n=== Task 4: 5-Year Survival Prediction ===") print("Patient profile:") print(" Age: 55 years") print(f" Bilirubin: 3 mg/dL (log = {np.log(3):.2f})") print(" Albumin: 3.2 g/dL") print(" Prothrombin time: 11 seconds") print(" Edema: none (0)") new_patient = pd.DataFrame({ 'age': [55], 'log_bili': [np.log(3)], # bilirubin = 3 mg/dL 'albumin': [3.2], 'protime': [11], 'edema': [0] }) # Get predicted survival function surv_func = cph.predict_survival_function(new_patient) # Find 5-year survival # Get the closest time point to 5 years times = surv_func.index closest_5yr = times[np.argmin(np.abs(times - 5))] surv_5yr = surv_func.loc[closest_5yr].values[0] print(f"\nPredicted 5-year survival probability: {surv_5yr:.3f} ({surv_5yr*100:.1f}%)") # Plot predicted survival curve fig, ax = plt.subplots(figsize=(8, 5)) surv_func.plot(ax=ax, color='blue', linewidth=2, label='Predicted survival') ax.axhline(y=0.5, color='grey', linestyle='--', alpha=0.7, label='50% survival') ax.axvline(x=5, color='red', linestyle='--', alpha=0.7, label='5-year mark') ax.set_xlabel("Time (years)", fontsize=12) ax.set_ylabel("Survival probability", fontsize=12) ax.set_title("Predicted Survival for Patient Profile", fontsize=14) ax.legend() plt.tight_layout() plt.show() ``` ::: ### Exercise 3 ::: {.panel-tabset} #### R ```{r} #| label: sol-ch06-ex3-r #| eval: false #| code-fold: true #| code-summary: "Show R solution" # ============================================================================= # Chapter 6, Exercise 3: Competing Risks Thinking # Kidney transplant rejection with death as a competing risk # ============================================================================= # This exercise is primarily conceptual (parts 1-2) with a bonus coding # component (part 3). # --- Question 1 --- # Explain why treating death as censoring would bias the results. # What direction would the bias go? # # ANSWER: # Treating death as simple censoring VIOLATES the assumption of # NON-INFORMATIVE censoring. Non-informative censoring means that # censored patients have the same future risk of the event as those # who remain under observation. But patients who die are NOT like # patients who remain alive -- they have ZERO future risk of rejection # (because dead patients cannot experience rejection). # # The bias direction: treating death as censoring INFLATES (overestimates) # the estimated probability of rejection. This happens because the # Kaplan-Meier estimator assumes that censored patients would eventually # experience the event at the same rate as those still at risk. But dead # patients will NEVER experience rejection, so acting as if they could # leads to an overestimate of the cumulative incidence of rejection. # # In the KM framework, when a patient dies and is "censored," they are # removed from the risk set, effectively assuming they would have had # the same rejection rate as surviving patients. Since they actually # have zero rejection risk, this inflates the estimated rate. # --- Question 2 --- # Which approach for estimating probability of rejection within 2 years: # cause-specific hazards or Fine-Gray? # # ANSWER: # For estimating the PROBABILITY of rejection within 2 years, use the # FINE-GRAY subdistribution hazard model. Here is why: # # - Cause-specific hazards model the RATE of rejection among those # currently alive and rejection-free. This answers: "Among patients # still alive, what is the instantaneous risk of rejection?" This is # useful for understanding ETIOLOGY (what causes rejection). # # - The Fine-Gray model directly models the CUMULATIVE INCIDENCE # FUNCTION, which gives the probability of rejection by time t, # accounting for the fact that some patients will die first. This # is the right quantity for PREDICTION and CLINICAL COMMUNICATION. # # When your goal is prediction ("What is the probability that this # patient's transplant will be rejected within 2 years?"), you need # the cumulative incidence, which properly accounts for the competing # risk of death. The Fine-Gray model provides this directly. # --- Question 3 (Bonus): Fine-Gray model in R --- library(survival) library(tidycmprsk) library(ggplot2) # Simulate kidney transplant data with competing risks set.seed(2025) n <- 500 # Covariates age <- rnorm(n, mean = 50, sd = 12) donor_type <- rbinom(n, 1, 0.4) # 0 = living, 1 = deceased donor hla_mismatch <- rpois(n, lambda = 2) # Simulate competing event times # Time to rejection (cause 1) lambda_reject <- exp(-4 + 0.02 * age + 0.5 * donor_type + 0.2 * hla_mismatch) time_reject <- rexp(n, rate = lambda_reject) # Time to death without rejection (cause 2) lambda_death <- exp(-5 + 0.03 * age + 0.3 * donor_type) time_death <- rexp(n, rate = lambda_death) # Administrative censoring at 10 years time_censor <- runif(n, 5, 10) # Determine observed time and event type time_obs <- pmin(time_reject, time_death, time_censor) event <- ifelse(time_obs == time_censor, 0, ifelse(time_obs == time_reject, 1, 2)) # 0 = censored, 1 = rejection, 2 = death df <- data.frame( time = time_obs, event = event, age = age, donor_type = factor(donor_type, labels = c("Living", "Deceased")), hla_mismatch = hla_mismatch ) cat("=== Dataset summary ===\n") cat("N:", n, "\n") cat("Rejections (event=1):", sum(event == 1), "\n") cat("Deaths (event=2):", sum(event == 2), "\n") cat("Censored (event=0):", sum(event == 0), "\n\n") # --- Cumulative Incidence Function (CIF) --- # Using tidycmprsk for a tidy interface cuminc_fit <- cuminc(Surv(time, event) ~ donor_type, data = df) print(cuminc_fit) # Plot cumulative incidence by donor type p <- ggcuminc(cuminc_fit, outcome = "1") + # outcome 1 = rejection labs(x = "Time (years)", y = "Cumulative Incidence of Rejection", title = "Cumulative Incidence of Transplant Rejection by Donor Type") + theme_minimal(base_size = 14) print(p) # --- Fine-Gray subdistribution hazard model --- cat("\n=== Fine-Gray Model ===\n") # Using tidycmprsk::crr for Fine-Gray regression fg_model <- crr(Surv(time, event) ~ age + donor_type + hla_mismatch, data = df, failcode = 1) # failcode 1 = rejection print(summary(fg_model)) cat("\nInterpretation:\n") cat("- The Fine-Gray model estimates subdistribution hazard ratios.\n") cat("- These quantify how covariates affect the cumulative incidence\n") cat(" of rejection, accounting for the competing risk of death.\n") cat("- HR > 1 means higher cumulative incidence of rejection.\n") # --- Compare with cause-specific Cox model --- cat("\n=== Cause-Specific Cox Model (for comparison) ===\n") # Censor deaths for cause-specific analysis of rejection df$event_cs <- ifelse(df$event == 1, 1, 0) # only rejection is event cox_cs <- coxph(Surv(time, event_cs) ~ age + donor_type + hla_mismatch, data = df) print(summary(cox_cs)) cat("\n=== Comparison Summary ===\n") cat("The cause-specific and Fine-Gray HRs can differ because they\n") cat("answer different questions:\n") cat("- Cause-specific: 'What affects the rate of rejection among\n") cat(" those still alive?'\n") cat("- Fine-Gray: 'What affects the probability of rejection by\n") cat(" time t, accounting for death?'\n") cat("Both are valid; the choice depends on whether your goal is\n") cat("etiological understanding or clinical prediction.\n") ``` #### Python ```{python} #| label: sol-ch06-ex3-py #| eval: false #| code-fold: true #| code-summary: "Show Python solution" # ============================================================================= # Chapter 6, Exercise 3: Competing Risks Thinking # Kidney transplant rejection with death as a competing risk # ============================================================================= # This exercise is primarily conceptual (parts 1-2) with a bonus coding # component (part 3). import numpy as np import pandas as pd import matplotlib.pyplot as plt # --- Question 1 --- # Explain why treating death as censoring would bias the results. # What direction would the bias go? # # ANSWER: # Treating death as simple censoring VIOLATES the assumption of # NON-INFORMATIVE censoring. Non-informative censoring means that # censored patients have the same future risk of the event as those # who remain under observation. But patients who die are NOT like # patients who remain alive -- they have ZERO future risk of rejection # (because dead patients cannot experience rejection). # # The bias direction: treating death as censoring INFLATES (overestimates) # the estimated probability of rejection. This happens because the # Kaplan-Meier estimator assumes that censored patients would eventually # experience the event at the same rate as those still at risk. But dead # patients will NEVER experience rejection, so acting as if they could # leads to an overestimate of the cumulative incidence of rejection. # # In the KM framework, when a patient dies and is "censored," they are # removed from the risk set, effectively assuming they would have had # the same rejection rate as surviving patients. Since they actually # have zero rejection risk, this inflates the estimated rate. # --- Question 2 --- # Which approach for estimating probability of rejection within 2 years: # cause-specific hazards or Fine-Gray? # # ANSWER: # For estimating the PROBABILITY of rejection within 2 years, use the # FINE-GRAY subdistribution hazard model. Here is why: # # - Cause-specific hazards model the RATE of rejection among those # currently alive and rejection-free. This answers: "Among patients # still alive, what is the instantaneous risk of rejection?" This is # useful for understanding ETIOLOGY (what causes rejection). # # - The Fine-Gray model directly models the CUMULATIVE INCIDENCE # FUNCTION, which gives the probability of rejection by time t, # accounting for the fact that some patients will die first. This # is the right quantity for PREDICTION and CLINICAL COMMUNICATION. # # When your goal is prediction ("What is the probability that this # patient's transplant will be rejected within 2 years?"), you need # the cumulative incidence, which properly accounts for the competing # risk of death. The Fine-Gray model provides this directly. # --- Question 3 (Bonus): Competing risks analysis in Python --- from lifelines import KaplanMeierFitter, CoxPHFitter # Simulate kidney transplant data with competing risks np.random.seed(2025) n = 500 # Covariates age = np.random.normal(50, 12, size=n) donor_type = np.random.binomial(1, 0.4, size=n) # 0=living, 1=deceased hla_mismatch = np.random.poisson(2, size=n) # Simulate competing event times # Time to rejection (cause 1) lambda_reject = np.exp(-4 + 0.02 * age + 0.5 * donor_type + 0.2 * hla_mismatch) time_reject = np.random.exponential(1 / lambda_reject) # Time to death without rejection (cause 2) lambda_death = np.exp(-5 + 0.03 * age + 0.3 * donor_type) time_death = np.random.exponential(1 / lambda_death) # Administrative censoring at 5-10 years time_censor = np.random.uniform(5, 10, size=n) # Determine observed time and event type time_obs = np.minimum(np.minimum(time_reject, time_death), time_censor) event = np.where(time_obs == time_censor, 0, np.where(time_obs == time_reject, 1, 2)) # 0 = censored, 1 = rejection, 2 = death df = pd.DataFrame({ 'time': time_obs, 'event': event, 'age': age, 'donor_type': donor_type, 'hla_mismatch': hla_mismatch }) print("=== Dataset summary ===") print(f"N: {n}") print(f"Rejections (event=1): {(event == 1).sum()}") print(f"Deaths (event=2): {(event == 2).sum()}") print(f"Censored (event=0): {(event == 0).sum()}\n") # --- Approach 1: Naive KM (treating death as censoring) --- # This is WRONG but illustrative kmf_naive = KaplanMeierFitter() event_naive = (event == 1).astype(int) # only rejection = event; death = censored kmf_naive.fit(time_obs, event_observed=event_naive, label="Naive KM (death = censored)") # --- Approach 2: Cause-specific Cox model --- # For rejection: censor deaths and administrative censoring print("=== Cause-Specific Cox Model (rejection) ===") df_cs = df.copy() df_cs['event_rejection'] = (df_cs['event'] == 1).astype(int) cph_cs = CoxPHFitter() cph_cs.fit(df_cs, duration_col='time', event_col='event_rejection', formula='age + donor_type + hla_mismatch') cph_cs.print_summary() # --- Approach 3: Cumulative Incidence Function (Aalen-Johansen estimator) --- # Compute the CIF manually using the Aalen-Johansen approach # This properly accounts for competing risks def compute_cif(times, events, cause=1): """ Compute the cumulative incidence function for a specific cause using the Aalen-Johansen estimator. """ # Sort by time order = np.argsort(times) t_sorted = times[order] e_sorted = events[order] unique_times = np.unique(t_sorted[e_sorted > 0]) n_risk = len(times) cif = [] surv = 1.0 # Overall survival (from all causes) cif_times = [] for t in unique_times: at_risk = np.sum(t_sorted >= t) d_cause = np.sum((t_sorted == t) & (e_sorted == cause)) d_all = np.sum((t_sorted == t) & (e_sorted > 0)) if at_risk > 0: # CIF increment: S(t-) * d_cause/n_risk cif_increment = surv * (d_cause / at_risk) # Update overall survival surv *= (1 - d_all / at_risk) cif.append(cif_increment) cif_times.append(t) # Cumulative sum cif_cumulative = np.cumsum(cif) return np.array(cif_times), cif_cumulative # Compute CIF for rejection (cause 1) cif_times, cif_values = compute_cif(time_obs, event, cause=1) # Compute CIF for death (cause 2) cif_times_d, cif_values_d = compute_cif(time_obs, event, cause=2) # --- Plot: Naive KM vs CIF --- fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Left panel: Naive KM (1 - S(t)) vs CIF for rejection ax1 = axes[0] # Naive 1 - KM naive_times = kmf_naive.survival_function_.index naive_1_minus_s = 1 - kmf_naive.survival_function_.values.flatten() ax1.step(naive_times, naive_1_minus_s, color='red', linewidth=2, label='Naive 1-KM (BIASED)', where='post') # Proper CIF ax1.step(cif_times, cif_values, color='blue', linewidth=2, label='Cumulative Incidence (correct)', where='post') ax1.set_xlabel("Time (years)", fontsize=12) ax1.set_ylabel("Probability of Rejection", fontsize=12) ax1.set_title("Naive KM vs Cumulative Incidence", fontsize=13) ax1.legend(fontsize=10) ax1.set_xlim(0, 10) # Right panel: Stacked CIF (rejection + death) ax2 = axes[1] # Interpolate to common time grid from scipy import interpolate t_grid = np.linspace(0.01, min(cif_times.max(), cif_times_d.max()), 200) f_reject = interpolate.interp1d(cif_times, cif_values, kind='previous', bounds_error=False, fill_value=(0, cif_values[-1])) f_death = interpolate.interp1d(cif_times_d, cif_values_d, kind='previous', bounds_error=False, fill_value=(0, cif_values_d[-1])) cif_r = f_reject(t_grid) cif_d = f_death(t_grid) ax2.fill_between(t_grid, 0, cif_r, alpha=0.4, color='#e74c3c', label='Rejection') ax2.fill_between(t_grid, cif_r, cif_r + cif_d, alpha=0.4, color='#3498db', label='Death (competing)') ax2.plot(t_grid, cif_r + cif_d, color='black', linewidth=1) ax2.set_xlabel("Time (years)", fontsize=12) ax2.set_ylabel("Cumulative Incidence", fontsize=12) ax2.set_title("Stacked Cumulative Incidence Functions", fontsize=13) ax2.legend(fontsize=10) ax2.set_xlim(0, 10) ax2.set_ylim(0, 1) plt.tight_layout() plt.show() # --- 2-year cumulative incidence of rejection --- # Find closest time to 2 years idx_2yr = np.argmin(np.abs(cif_times - 2)) ci_2yr = cif_values[idx_2yr] print(f"\n=== 2-Year Cumulative Incidence of Rejection ===") print(f"Proper CIF estimate: {ci_2yr:.3f} ({ci_2yr*100:.1f}%)") # Compare with naive KM naive_2yr = 1 - kmf_naive.predict(2) print(f"Naive 1-KM estimate: {naive_2yr:.3f} ({naive_2yr*100:.1f}%)") print(f"\nThe naive estimate is HIGHER (biased upward) because it treats") print(f"deaths as censoring, overestimating the probability of rejection.") # --- Summary --- print("\n=== Summary ===") print("1. Treating death as censoring overestimates rejection probability.") print("2. For prediction, use the Fine-Gray model / cumulative incidence.") print("3. For etiology, use cause-specific Cox models.") print("4. Always report cumulative incidence functions, not 1-KM,") print(" when competing risks are present.") ``` ::: ## Chapter 07: ML Foundations ### Exercise 1 ::: {.panel-tabset} #### R ```{r} #| label: sol-ch07-ex1-r #| eval: false #| code-fold: true #| code-summary: "Show R solution" # ============================================================================= # Chapter 7, Exercise 1: Cross-Validation Experiment # Compare logistic regression and SVM (RBF kernel) using 10-fold stratified CV. # Report AUC for both models. # ============================================================================= library(tidyverse) library(tidymodels) library(kernlab) # --- Simulate the clinical dataset --- set.seed(123) n <- 600 ex_data <- tibble( age = rnorm(n, 65, 10), creatinine = rlnorm(n, 0, 0.5), hemoglobin = rnorm(n, 12, 2), platelets = rnorm(n, 250, 70), wbc = rlnorm(n, 2, 0.4), icu = factor(rbinom(n, 1, plogis(-4 + 0.03 * rnorm(n, 65, 10) + 0.5 * rlnorm(n, 0, 0.5))), labels = c("No", "Yes")) ) cat("ICU admission rate:", mean(ex_data$icu == "Yes"), "\n") # --- 10-fold stratified cross-validation --- set.seed(42) folds <- vfold_cv(ex_data, v = 10, strata = icu) # --- Logistic Regression workflow --- lr_spec <- logistic_reg() %>% set_engine("glm") lr_recipe <- recipe(icu ~ ., data = ex_data) %>% step_normalize(all_numeric_predictors()) lr_wf <- workflow() %>% add_model(lr_spec) %>% add_recipe(lr_recipe) lr_results <- fit_resamples(lr_wf, resamples = folds, metrics = metric_set(roc_auc)) # --- SVM (RBF kernel) workflow --- svm_spec <- svm_rbf(cost = 1, rbf_sigma = 0.5) %>% set_engine("kernlab") %>% set_mode("classification") svm_recipe <- recipe(icu ~ ., data = ex_data) %>% step_normalize(all_numeric_predictors()) svm_wf <- workflow() %>% add_model(svm_spec) %>% add_recipe(svm_recipe) svm_results <- fit_resamples(svm_wf, resamples = folds, metrics = metric_set(roc_auc)) # --- Collect and compare results --- lr_metrics <- collect_metrics(lr_results) %>% mutate(model = "Logistic Regression") svm_metrics <- collect_metrics(svm_results) %>% mutate(model = "SVM (RBF)") comparison <- bind_rows(lr_metrics, svm_metrics) %>% select(model, .metric, mean, std_err) print(comparison) # --- Interpretation --- # Both models are expected to show similar AUC because the simulated data # has simple, roughly linear relationships between predictors and outcome. # Logistic regression handles linear relationships well, and an RBF SVM # adds flexibility that is not needed here. # With real clinical data that has non-linear or interactive effects, # SVM may outperform logistic regression. ``` #### Python ```{python} #| label: sol-ch07-ex1-py #| eval: false #| code-fold: true #| code-summary: "Show Python solution" # ============================================================================= # Chapter 7, Exercise 1: Cross-Validation Experiment # Compare logistic regression and SVM (RBF kernel) using 10-fold stratified CV. # Report AUC for both models. # ============================================================================= import numpy as np import pandas as pd from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC from sklearn.model_selection import cross_val_score, StratifiedKFold from sklearn.preprocessing import StandardScaler from sklearn.pipeline import make_pipeline # --- Simulate the clinical dataset --- np.random.seed(123) n = 600 X = pd.DataFrame({ 'age': np.random.normal(65, 10, n), 'creatinine': np.random.lognormal(0, 0.5, n), 'hemoglobin': np.random.normal(12, 2, n), 'platelets': np.random.normal(250, 70, n), 'wbc': np.random.lognormal(2, 0.4, n) }) # Simulate ICU outcome (binary) prob = 1 / (1 + np.exp(-(-4 + 0.03 * np.random.normal(65, 10, n) + 0.5 * np.random.lognormal(0, 0.5, n)))) y = np.random.binomial(1, prob) print(f"ICU admission rate: {y.mean():.3f}") # --- 10-fold stratified CV --- cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42) # --- Logistic Regression --- lr_pipe = make_pipeline(StandardScaler(), LogisticRegression(max_iter=1000)) lr_scores = cross_val_score(lr_pipe, X, y, cv=cv, scoring='roc_auc') print(f"\nLogistic Regression AUC: {lr_scores.mean():.3f} (+/- {lr_scores.std():.3f})") # --- SVM (RBF kernel) --- svm_pipe = make_pipeline( StandardScaler(), SVC(kernel='rbf', C=1.0, gamma=0.5, probability=True) ) svm_scores = cross_val_score(svm_pipe, X, y, cv=cv, scoring='roc_auc') print(f"SVM (RBF) AUC: {svm_scores.mean():.3f} (+/- {svm_scores.std():.3f})") # --- Comparison --- results = pd.DataFrame({ 'Model': ['Logistic Regression', 'SVM (RBF)'], 'Mean AUC': [lr_scores.mean(), svm_scores.mean()], 'Std AUC': [lr_scores.std(), svm_scores.std()] }) print("\n", results.to_string(index=False)) # --- Interpretation --- # Both models are expected to show similar AUC because the simulated data # has simple, roughly linear relationships. Logistic regression is well-suited # for such data. The RBF SVM adds flexibility that is not needed here. # With real clinical data containing non-linear effects, SVM may outperform # logistic regression. ``` ::: ### Exercise 2 ::: {.panel-tabset} #### R ```{r} #| label: sol-ch07-ex2-r #| eval: false #| code-fold: true #| code-summary: "Show R solution" # ============================================================================= # Chapter 7, Exercise 2: Feature Engineering Challenge # Engineer clinically meaningful features from raw variables for a diabetes # prediction model. Implement using tidymodels recipes. # ============================================================================= library(tidyverse) library(tidymodels) # --- Simulate raw clinical data --- set.seed(42) n <- 500 raw_data <- tibble( height_cm = rnorm(n, 170, 10), weight_kg = rnorm(n, 80, 15), sbp = rnorm(n, 130, 18), dbp = rnorm(n, 82, 12), fasting_glucose = rnorm(n, 105, 25), hba1c = rnorm(n, 5.8, 0.8), age = rnorm(n, 55, 12), sex = sample(c("Male", "Female"), n, replace = TRUE), waist_circumference = rnorm(n, 95, 12), hip_circumference = rnorm(n, 100, 10), total_cholesterol = rnorm(n, 200, 40), hdl = rnorm(n, 50, 15), ldl = rnorm(n, 120, 35), triglycerides = rnorm(n, 150, 60), diabetes = factor(rbinom(n, 1, 0.3), labels = c("No", "Yes")) ) # --- Part 1: Clinically meaningful engineered features --- # 1. BMI = weight_kg / (height_m)^2 # WHY: Standard obesity measure; strong risk factor for Type 2 diabetes. # # 2. Pulse Pressure = sbp - dbp # WHY: Reflects arterial stiffness; associated with cardiovascular risk # and metabolic syndrome. # # 3. Mean Arterial Pressure (MAP) = dbp + (sbp - dbp) / 3 # WHY: Measures average perfusion pressure; linked to vascular health. # # 4. Waist-to-Hip Ratio (WHR) = waist_circumference / hip_circumference # WHY: Central adiposity is a stronger predictor of insulin resistance # than BMI alone. # # 5. Non-HDL Cholesterol = total_cholesterol - hdl # WHY: Captures all atherogenic lipoproteins; recommended by guidelines # as a secondary target in diabetes management. # # 6. Triglyceride-to-HDL Ratio = triglycerides / hdl # WHY: A proxy for insulin resistance; high TG/HDL ratio is associated # with increased diabetes risk. # # 7. LDL/HDL Ratio = ldl / hdl # WHY: Captures atherogenic dyslipidemia profile common in diabetes. # --- Part 2: Implement feature engineering with recipes --- diabetes_recipe <- recipe(diabetes ~ ., data = raw_data) %>% # BMI: weight / height_m^2 step_mutate( height_m = height_cm / 100, bmi = weight_kg / height_m^2, # Pulse pressure pulse_pressure = sbp - dbp, # Mean arterial pressure map = dbp + (sbp - dbp) / 3, # Waist-to-hip ratio whr = waist_circumference / hip_circumference, # Non-HDL cholesterol non_hdl = total_cholesterol - hdl, # Triglyceride-to-HDL ratio tg_hdl_ratio = triglycerides / hdl, # LDL/HDL ratio ldl_hdl_ratio = ldl / hdl ) %>% # Remove intermediate and redundant variables step_rm(height_m) %>% step_normalize(all_numeric_predictors()) %>% step_dummy(all_nominal_predictors()) # Prepare (bake) the recipe to see the result prepped <- prep(diabetes_recipe) engineered_data <- bake(prepped, new_data = NULL) cat("Original variables:", ncol(raw_data) - 1, "\n") cat("Engineered dataset columns:", ncol(engineered_data) - 1, "\n") cat("\nColumn names:\n") print(names(engineered_data)) # --- Part 3: Discuss redundant features --- # After engineering: # - height_cm and weight_kg may become redundant once BMI is computed # (though height or weight alone may still carry predictive signal). # - sbp and dbp are partially captured by pulse_pressure and MAP, # though keeping them may still be useful for tree-based models. # - waist_circumference and hip_circumference are largely captured by WHR. # - total_cholesterol is partly captured by non_hdl (since non_hdl = TC - HDL). # - Individual lipids (hdl, ldl, triglycerides) overlap with the ratios, # but a LASSO or tree model can sort out which representation is most useful. # # In practice, include both raw and engineered features and let a # regularised model (LASSO, elastic net) or tree-based model perform # implicit feature selection. ``` #### Python ```{python} #| label: sol-ch07-ex2-py #| eval: false #| code-fold: true #| code-summary: "Show Python solution" # ============================================================================= # Chapter 7, Exercise 2: Feature Engineering Challenge # Engineer clinically meaningful features from raw variables for a diabetes # prediction model. Implement using pandas and sklearn. # ============================================================================= import numpy as np import pandas as pd from sklearn.preprocessing import StandardScaler # --- Simulate raw clinical data --- np.random.seed(42) n = 500 raw_data = pd.DataFrame({ 'height_cm': np.random.normal(170, 10, n), 'weight_kg': np.random.normal(80, 15, n), 'sbp': np.random.normal(130, 18, n), 'dbp': np.random.normal(82, 12, n), 'fasting_glucose': np.random.normal(105, 25, n), 'hba1c': np.random.normal(5.8, 0.8, n), 'age': np.random.normal(55, 12, n), 'sex': np.random.choice(['Male', 'Female'], n), 'waist_circumference': np.random.normal(95, 12, n), 'hip_circumference': np.random.normal(100, 10, n), 'total_cholesterol': np.random.normal(200, 40, n), 'hdl': np.random.normal(50, 15, n), 'ldl': np.random.normal(120, 35, n), 'triglycerides': np.random.normal(150, 60, n), }) y = np.random.binomial(1, 0.3, n) # --- Part 1: Clinically meaningful engineered features --- # 1. BMI = weight_kg / (height_m)^2 # WHY: Standard obesity measure; strong risk factor for Type 2 diabetes. # # 2. Pulse Pressure = sbp - dbp # WHY: Reflects arterial stiffness; associated with cardiovascular risk # and metabolic syndrome. # # 3. Mean Arterial Pressure (MAP) = dbp + (sbp - dbp) / 3 # WHY: Measures average perfusion pressure; linked to vascular health. # # 4. Waist-to-Hip Ratio (WHR) = waist / hip # WHY: Central adiposity is a stronger predictor of insulin resistance # than BMI alone. # # 5. Non-HDL Cholesterol = total_cholesterol - hdl # WHY: Captures all atherogenic lipoproteins; recommended by guidelines # as a secondary target in diabetes management. # # 6. Triglyceride-to-HDL Ratio = triglycerides / hdl # WHY: A proxy for insulin resistance; high TG/HDL ratio is associated # with increased diabetes risk. # # 7. LDL/HDL Ratio = ldl / hdl # WHY: Captures atherogenic dyslipidemia profile common in diabetes. # --- Part 2: Implement feature engineering --- df = raw_data.copy() # Compute engineered features df['bmi'] = df['weight_kg'] / (df['height_cm'] / 100) ** 2 df['pulse_pressure'] = df['sbp'] - df['dbp'] df['map'] = df['dbp'] + (df['sbp'] - df['dbp']) / 3 df['whr'] = df['waist_circumference'] / df['hip_circumference'] df['non_hdl'] = df['total_cholesterol'] - df['hdl'] df['tg_hdl_ratio'] = df['triglycerides'] / df['hdl'] df['ldl_hdl_ratio'] = df['ldl'] / df['hdl'] # Encode sex as binary df['sex_male'] = (df['sex'] == 'Male').astype(int) df = df.drop(columns=['sex']) print(f"Original features: {raw_data.shape[1]}") print(f"Engineered features: {df.shape[1]}") print(f"\nColumn names:\n{list(df.columns)}") # Scale numeric features scaler = StandardScaler() numeric_cols = df.select_dtypes(include=[np.number]).columns df_scaled = df.copy() df_scaled[numeric_cols] = scaler.fit_transform(df[numeric_cols]) print(f"\nFirst few rows of engineered data:") print(df_scaled.head()) # --- Part 3: Discuss redundant features --- # After engineering: # - height_cm and weight_kg may become redundant once BMI is computed # (though they may still carry independent predictive signal). # - sbp and dbp are partially captured by pulse_pressure and MAP, # though keeping them may still be useful for tree-based models. # - waist_circumference and hip_circumference are largely captured by WHR. # - total_cholesterol is partly captured by non_hdl (since non_hdl = TC - HDL). # - Individual lipids (hdl, ldl, triglycerides) overlap with the ratios, # but a LASSO or tree model can sort out which representation is most useful. # # In practice, include both raw and engineered features and let a # regularised model (LASSO, elastic net) or tree-based model perform # implicit feature selection. ``` ::: ### Exercise 3 ::: {.panel-tabset} #### R ```{r} #| label: sol-ch07-ex3-r #| eval: false #| code-fold: true #| code-summary: "Show R solution" # ============================================================================= # Chapter 7, Exercise 3: The Bias-Variance Tradeoff in Practice # Fit polynomials of degree 1, 3, 5, 10, and 20 to a training set. # Plot fitted curves, compute training/test error, identify optimal degree. # ============================================================================= library(tidyverse) # --- Generate data --- set.seed(42) x_train <- sort(runif(50, 0, 10)) y_train <- sin(x_train) + rnorm(50, 0, 0.3) x_test <- sort(runif(200, 0, 10)) y_test <- sin(x_test) + rnorm(200, 0, 0.3) train_df <- tibble(x = x_train, y = y_train) test_df <- tibble(x = x_test, y = y_test) # --- Fit polynomials and compute RMSE --- degrees <- c(1, 3, 5, 10, 20) results <- tibble(degree = integer(), train_rmse = double(), test_rmse = double()) # Also store predictions for plotting plot_data <- tibble() for (d in degrees) { # Fit polynomial of degree d fit <- lm(y ~ poly(x, degree = d, raw = TRUE), data = train_df) # Predictions on train and test pred_train <- predict(fit, newdata = train_df) pred_test <- predict(fit, newdata = test_df) # Compute RMSE rmse_train <- sqrt(mean((y_train - pred_train)^2)) rmse_test <- sqrt(mean((y_test - pred_test)^2)) results <- bind_rows(results, tibble(degree = d, train_rmse = rmse_train, test_rmse = rmse_test)) # Smooth curve for plotting x_grid <- seq(0, 10, length.out = 300) pred_grid <- predict(fit, newdata = tibble(x = x_grid)) # Clip extreme predictions for high-degree polynomials pred_grid <- pmin(pmax(pred_grid, -3), 3) plot_data <- bind_rows(plot_data, tibble(x = x_grid, y_pred = pred_grid, degree = paste("Degree", d))) } # --- Print RMSE results --- cat("Polynomial Regression: Training vs Test RMSE\n") cat("=============================================\n") print(results) # --- Part 2: Plot fitted curves --- p1 <- ggplot() + geom_point(data = train_df, aes(x, y), alpha = 0.5, size = 2) + geom_line(data = plot_data, aes(x, y_pred, color = degree), linewidth = 1) + geom_line(data = tibble(x = seq(0, 10, 0.01), y = sin(seq(0, 10, 0.01))), aes(x, y), linetype = "dashed", color = "black", linewidth = 0.8) + labs(title = "Polynomial Fits of Varying Complexity", subtitle = "Dashed line = true function sin(x)", x = "x", y = "y", color = "Polynomial") + theme_minimal(base_size = 14) + theme(legend.position = "top") print(p1) # --- Part 3: Plot training vs test error --- results_long <- results %>% pivot_longer(cols = c(train_rmse, test_rmse), names_to = "set", values_to = "rmse") %>% mutate(set = ifelse(set == "train_rmse", "Training", "Test")) p2 <- ggplot(results_long, aes(x = degree, y = rmse, color = set)) + geom_line(linewidth = 1.2) + geom_point(size = 3) + labs(title = "Training vs Test RMSE by Polynomial Degree", x = "Polynomial Degree", y = "RMSE", color = "Dataset") + theme_minimal(base_size = 14) + theme(legend.position = "top") print(p2) # --- Part 4: Interpretation --- best_degree <- results$degree[which.min(results$test_rmse)] cat("\nBest polynomial degree (lowest test RMSE):", best_degree, "\n") cat("\nInterpretation (Bias-Variance Tradeoff):\n") cat("- Degree 1 (linear): High bias -- too simple to capture the sine curve.\n") cat(" Underfits both training and test data.\n") cat("- Degree 3-5: Good balance. Captures the main curvature of sin(x)\n") cat(" without fitting noise. Test error is minimized here.\n") cat("- Degree 10-20: High variance -- the polynomial wiggles to fit\n") cat(" training noise. Training error drops but test error increases.\n") cat(" This is classic overfitting.\n") cat("\nThe optimal degree (~3-5) sits at the sweet spot of the bias-variance\n") cat("tradeoff, where the model is flexible enough to capture the true\n") cat("pattern but not so flexible that it memorizes noise.\n") ``` #### Python ```{python} #| label: sol-ch07-ex3-py #| eval: false #| code-fold: true #| code-summary: "Show Python solution" # ============================================================================= # Chapter 7, Exercise 3: The Bias-Variance Tradeoff in Practice # Fit polynomials of degree 1, 3, 5, 10, and 20 to a training set. # Plot fitted curves, compute training/test error, identify optimal degree. # ============================================================================= import numpy as np import matplotlib.pyplot as plt from sklearn.preprocessing import PolynomialFeatures from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error # --- Generate data --- np.random.seed(42) x_train = np.sort(np.random.uniform(0, 10, 50)) y_train = np.sin(x_train) + np.random.normal(0, 0.3, 50) x_test = np.sort(np.random.uniform(0, 10, 200)) y_test = np.sin(x_test) + np.random.normal(0, 0.3, 200) # --- Fit polynomials and compute RMSE --- degrees = [1, 3, 5, 10, 20] results = [] predictions = {} for d in degrees: # Create polynomial features and fit poly = PolynomialFeatures(degree=d, include_bias=False) X_train_poly = poly.fit_transform(x_train.reshape(-1, 1)) X_test_poly = poly.transform(x_test.reshape(-1, 1)) model = LinearRegression() model.fit(X_train_poly, y_train) # Predictions pred_train = model.predict(X_train_poly) pred_test = model.predict(X_test_poly) # RMSE rmse_train = np.sqrt(mean_squared_error(y_train, pred_train)) rmse_test = np.sqrt(mean_squared_error(y_test, pred_test)) results.append({'degree': d, 'train_rmse': rmse_train, 'test_rmse': rmse_test}) # Smooth curve for plotting x_grid = np.linspace(0, 10, 300).reshape(-1, 1) X_grid_poly = poly.transform(x_grid) pred_grid = model.predict(X_grid_poly) # Clip extreme predictions for high-degree polynomials pred_grid = np.clip(pred_grid, -3, 3) predictions[d] = (x_grid.ravel(), pred_grid) # --- Print RMSE results --- print("Polynomial Regression: Training vs Test RMSE") print("=" * 50) print(f"{'Degree':>8s} {'Train RMSE':>12s} {'Test RMSE':>12s}") print("-" * 50) for r in results: print(f"{r['degree']:>8d} {r['train_rmse']:>12.4f} {r['test_rmse']:>12.4f}") # --- Part 2: Plot fitted curves --- fig, ax = plt.subplots(figsize=(10, 6)) ax.scatter(x_train, y_train, alpha=0.5, s=30, label='Training data', zorder=5) x_true = np.linspace(0, 10, 300) ax.plot(x_true, np.sin(x_true), 'k--', linewidth=1.5, label='True function sin(x)') colors = ['#E69F00', '#56B4E9', '#009E73', '#D55E00', '#CC79A7'] for (d, color) in zip(degrees, colors): x_g, pred_g = predictions[d] ax.plot(x_g, pred_g, linewidth=1.5, color=color, label=f'Degree {d}') ax.set_xlabel('x') ax.set_ylabel('y') ax.set_title('Polynomial Fits of Varying Complexity') ax.legend(loc='upper right', fontsize=9) ax.set_ylim(-3, 3) plt.tight_layout() plt.show() # --- Part 3: Plot training vs test error --- degs = [r['degree'] for r in results] train_rmses = [r['train_rmse'] for r in results] test_rmses = [r['test_rmse'] for r in results] fig, ax = plt.subplots(figsize=(8, 5)) ax.plot(degs, train_rmses, 'o-', linewidth=2, label='Training RMSE', color='steelblue') ax.plot(degs, test_rmses, 'o-', linewidth=2, label='Test RMSE', color='darkorange') ax.set_xlabel('Polynomial Degree') ax.set_ylabel('RMSE') ax.set_title('Training vs Test RMSE by Polynomial Degree') ax.legend() plt.tight_layout() plt.show() # --- Part 4: Interpretation --- best_idx = np.argmin(test_rmses) best_degree = degs[best_idx] print(f"\nBest polynomial degree (lowest test RMSE): {best_degree}") print("\nInterpretation (Bias-Variance Tradeoff):") print("- Degree 1 (linear): High bias -- too simple to capture the sine curve.") print(" Underfits both training and test data.") print("- Degree 3-5: Good balance. Captures the main curvature of sin(x)") print(" without fitting noise. Test error is minimized here.") print("- Degree 10-20: High variance -- the polynomial wiggles to fit") print(" training noise. Training error drops but test error increases.") print(" This is classic overfitting.") print("\nThe optimal degree (~3-5) sits at the sweet spot of the bias-variance") print("tradeoff, where the model is flexible enough to capture the true") print("pattern but not so flexible that it memorizes noise.") ``` ::: ## Chapter 08: Trees and Ensembles ### Exercise 1 ::: {.panel-tabset} #### R ```{r} #| label: sol-ch08-ex1-r #| eval: false #| code-fold: true #| code-summary: "Show R solution" # ============================================================================= # Chapter 8, Exercise 1: Build and Prune a Classification Tree # 1. Fit a full, unpruned tree and count terminal nodes. # 2. Use cross-validation to find the optimal cp. # 3. Prune the tree and plot it. # 4. Identify the top 3 splitting variables. # ============================================================================= library(tidyverse) library(rpart) library(rpart.plot) # --- Simulate the readmission dataset (same as chapter) --- set.seed(42) n <- 1000 readmit_data <- tibble( age = rnorm(n, 68, 12), length_of_stay = rpois(n, 5) + 1, num_comorbidities = rpois(n, 3), prior_admissions = rpois(n, 1), discharge_hemoglobin = rnorm(n, 11, 2), discharge_creatinine = rlnorm(n, 0.2, 0.5), has_diabetes = rbinom(n, 1, 0.35), has_chf = rbinom(n, 1, 0.25), readmitted = factor( rbinom(n, 1, plogis(-3 + 0.02 * (rnorm(n, 68, 12) - 68) + 0.15 * rpois(n, 1) + 0.1 * rpois(n, 3) + 0.3 * rbinom(n, 1, 0.25) - 0.1 * rnorm(n, 11, 2))), labels = c("No", "Yes") ) ) cat("Readmission rate:", mean(readmit_data$readmitted == "Yes"), "\n") # --- Part 1: Fit a full, unpruned tree --- full_tree <- rpart(readmitted ~ ., data = readmit_data, method = "class", control = rpart.control(cp = 0.001, minsplit = 2, minbucket = 1)) n_terminal_full <- sum(full_tree$frame$var == "<leaf>") cat("\nFull tree terminal nodes:", n_terminal_full, "\n") # --- Part 2: Cross-validation to find optimal cp --- cat("\nCP Table:\n") printcp(full_tree) # Plot CV error vs cp plotcp(full_tree) # Find optimal cp (minimum xerror) cp_table <- full_tree$cptable optimal_cp <- cp_table[which.min(cp_table[, "xerror"]), "CP"] cat("\nOptimal cp:", optimal_cp, "\n") # Alternative: 1-SE rule (smallest tree within 1 SE of the minimum) min_xerror <- min(cp_table[, "xerror"]) min_se <- cp_table[which.min(cp_table[, "xerror"]), "xstd"] cp_1se <- cp_table[cp_table[, "xerror"] <= min_xerror + min_se, "CP"] optimal_cp_1se <- max(cp_1se) # largest cp (smallest tree) within 1 SE cat("Optimal cp (1-SE rule):", optimal_cp_1se, "\n") # --- Part 3: Prune and plot --- pruned_tree <- prune(full_tree, cp = optimal_cp) n_terminal_pruned <- sum(pruned_tree$frame$var == "<leaf>") cat("\nPruned tree terminal nodes:", n_terminal_pruned, "\n") rpart.plot(pruned_tree, type = 4, extra = 106, under = TRUE, box.palette = "RdYlGn", roundint = FALSE, main = "Pruned Decision Tree: 30-Day Readmission") # --- Part 4: Top 3 splitting variables --- cat("\nVariable Importance:\n") vi <- sort(full_tree$variable.importance, decreasing = TRUE) print(vi) top3 <- names(vi)[1:min(3, length(vi))] cat("\nTop 3 splitting variables:", paste(top3, collapse = ", "), "\n") cat("\nClinical interpretation:\n") cat("- These variables capture patient acuity and complexity.\n") cat("- Discharge lab values (hemoglobin, creatinine) reflect the patient's\n") cat(" clinical status at the time of discharge.\n") cat("- Age, comorbidity count, and prior admissions reflect overall\n") cat(" disease burden and frailty.\n") cat("- These are well-established risk factors for 30-day readmission\n") cat(" in the clinical literature.\n") ``` #### Python ```{python} #| label: sol-ch08-ex1-py #| eval: false #| code-fold: true #| code-summary: "Show Python solution" # ============================================================================= # Chapter 8, Exercise 1: Build and Prune a Classification Tree # 1. Fit trees with different max_depth values and count leaves. # 2. Cross-validate each to find optimal max_depth. # 3. Plot the optimal tree. # 4. Examine feature importances. # ============================================================================= import numpy as np import pandas as pd from sklearn.tree import DecisionTreeClassifier, plot_tree from sklearn.model_selection import cross_val_score, StratifiedKFold import matplotlib.pyplot as plt # --- Simulate the readmission dataset (same as chapter) --- np.random.seed(42) n = 1000 df = pd.DataFrame({ 'age': np.random.normal(68, 12, n), 'length_of_stay': np.random.poisson(5, n) + 1, 'num_comorbidities': np.random.poisson(3, n), 'prior_admissions': np.random.poisson(1, n), 'discharge_hemoglobin': np.random.normal(11, 2, n), 'discharge_creatinine': np.random.lognormal(0.2, 0.5, n), 'has_diabetes': np.random.binomial(1, 0.35, n), 'has_chf': np.random.binomial(1, 0.25, n) }) y = np.random.binomial(1, 0.18, n) feature_names = list(df.columns) print(f"Readmission rate: {y.mean():.3f}") # --- Part 1: Fit a full, unpruned tree --- full_tree = DecisionTreeClassifier(random_state=42) full_tree.fit(df, y) n_leaves_full = full_tree.get_n_leaves() print(f"\nFull (unpruned) tree: {n_leaves_full} terminal nodes") print(f"Full tree depth: {full_tree.get_depth()}") # --- Part 2: Cross-validation to find optimal max_depth --- depths = [2, 3, 4, 5, 7, 10, 15, 20, None] cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42) print("\nCross-validation results by max_depth:") print(f"{'max_depth':>10s} {'Mean AUC':>10s} {'Std AUC':>10s}") print("-" * 35) cv_results = [] for d in depths: tree = DecisionTreeClassifier(max_depth=d, random_state=42) scores = cross_val_score(tree, df, y, cv=cv, scoring='roc_auc') label = str(d) if d is not None else "None" cv_results.append({'max_depth': d, 'label': label, 'mean_auc': scores.mean(), 'std_auc': scores.std()}) print(f"{label:>10s} {scores.mean():>10.3f} {scores.std():>10.3f}") # Find the best depth best = max(cv_results, key=lambda x: x['mean_auc']) print(f"\nBest max_depth: {best['label']} (AUC = {best['mean_auc']:.3f})") # --- Part 3: Fit and plot the optimal tree --- best_depth = best['max_depth'] optimal_tree = DecisionTreeClassifier(max_depth=best_depth, random_state=42) optimal_tree.fit(df, y) print(f"\nOptimal tree: {optimal_tree.get_n_leaves()} terminal nodes") fig, ax = plt.subplots(figsize=(20, 10)) plot_tree(optimal_tree, feature_names=feature_names, class_names=['No', 'Yes'], filled=True, rounded=True, ax=ax, fontsize=9, max_depth=4) # show up to depth 4 for readability ax.set_title(f"Pruned Decision Tree (max_depth={best['label']})") plt.tight_layout() plt.show() # --- Part 4: Top 3 splitting variables --- importances = pd.Series(optimal_tree.feature_importances_, index=feature_names) importances = importances.sort_values(ascending=False) print("\nFeature Importances:") print(importances.to_string()) top3 = importances.head(3).index.tolist() print(f"\nTop 3 splitting variables: {top3}") # Plot feature importances fig, ax = plt.subplots(figsize=(8, 5)) importances.sort_values(ascending=True).plot(kind='barh', ax=ax, color='#2E86AB') ax.set_xlabel("Feature Importance (Gini)") ax.set_title("Decision Tree Feature Importance") plt.tight_layout() plt.show() print("\nClinical interpretation:") print("- These variables capture patient acuity and complexity.") print("- Discharge lab values reflect the patient's clinical status at discharge.") print("- Age, comorbidity count, and prior admissions reflect disease burden.") print("- These are well-established readmission risk factors in the literature.") ``` ::: ### Exercise 2 ::: {.panel-tabset} #### R ```{r} #| label: sol-ch08-ex2-r #| eval: false #| code-fold: true #| code-summary: "Show R solution" # ============================================================================= # Chapter 8, Exercise 2: Random Forest vs XGBoost Tuning Challenge # 1. Split data 80/20. 2. Tune RF and XGBoost with 5-fold CV. # 3. Select best hyperparameters. 4. Evaluate on test set. # 5. Create variable importance plots. # ============================================================================= library(tidyverse) library(tidymodels) library(vip) # --- Simulate the readmission dataset (same as chapter) --- set.seed(42) n <- 1000 readmit_data <- tibble( age = rnorm(n, 68, 12), length_of_stay = rpois(n, 5) + 1, num_comorbidities = rpois(n, 3), prior_admissions = rpois(n, 1), discharge_hemoglobin = rnorm(n, 11, 2), discharge_creatinine = rlnorm(n, 0.2, 0.5), has_diabetes = rbinom(n, 1, 0.35), has_chf = rbinom(n, 1, 0.25), readmitted = factor( rbinom(n, 1, plogis(-3 + 0.02 * (rnorm(n, 68, 12) - 68) + 0.15 * rpois(n, 1) + 0.1 * rpois(n, 3) + 0.3 * rbinom(n, 1, 0.25) - 0.1 * rnorm(n, 11, 2))), labels = c("No", "Yes") ) ) # --- Part 1: Train/test split --- set.seed(42) split <- initial_split(readmit_data, prop = 0.8, strata = readmitted) train <- training(split) test <- testing(split) cat("Training set size:", nrow(train), "\n") cat("Test set size:", nrow(test), "\n") # 5-fold CV on training set folds <- vfold_cv(train, v = 5, strata = readmitted) # Recipe (shared) base_recipe <- recipe(readmitted ~ ., data = train) # --- Part 2a: Tune Random Forest --- rf_spec <- rand_forest( trees = 500, mtry = tune(), min_n = tune() ) %>% set_engine("ranger", importance = "impurity") %>% set_mode("classification") rf_wf <- workflow() %>% add_model(rf_spec) %>% add_recipe(base_recipe) rf_grid <- grid_regular( mtry(range = c(2, 7)), min_n(range = c(5, 30)), levels = 5 ) set.seed(42) rf_tune <- tune_grid(rf_wf, resamples = folds, grid = rf_grid, metrics = metric_set(roc_auc)) cat("\n--- Random Forest Tuning Results (Top 5) ---\n") print(show_best(rf_tune, metric = "roc_auc", n = 5)) best_rf <- select_best(rf_tune, metric = "roc_auc") cat("\nBest RF params - mtry:", best_rf$mtry, "min_n:", best_rf$min_n, "\n") # --- Part 2b: Tune XGBoost --- xgb_spec <- boost_tree( trees = 500, tree_depth = tune(), learn_rate = tune(), min_n = tune() ) %>% set_engine("xgboost") %>% set_mode("classification") xgb_wf <- workflow() %>% add_model(xgb_spec) %>% add_recipe(base_recipe) xgb_grid <- grid_regular( tree_depth(range = c(2, 6)), learn_rate(range = c(-3, -1)), # log10 scale: 0.001 to 0.1 min_n(range = c(5, 20)), levels = 4 ) set.seed(42) xgb_tune <- tune_grid(xgb_wf, resamples = folds, grid = xgb_grid, metrics = metric_set(roc_auc)) cat("\n--- XGBoost Tuning Results (Top 5) ---\n") print(show_best(xgb_tune, metric = "roc_auc", n = 5)) best_xgb <- select_best(xgb_tune, metric = "roc_auc") cat("\nBest XGB params - depth:", best_xgb$tree_depth, "learn_rate:", best_xgb$learn_rate, "min_n:", best_xgb$min_n, "\n") # --- Part 3: Finalize and fit on full training set --- final_rf_wf <- finalize_workflow(rf_wf, best_rf) final_xgb_wf <- finalize_workflow(xgb_wf, best_xgb) rf_final_fit <- fit(final_rf_wf, data = train) xgb_final_fit <- fit(final_xgb_wf, data = train) # --- Part 4: Evaluate on test set --- rf_test_pred <- predict(rf_final_fit, test, type = "prob") %>% bind_cols(test %>% select(readmitted)) xgb_test_pred <- predict(xgb_final_fit, test, type = "prob") %>% bind_cols(test %>% select(readmitted)) rf_auc <- roc_auc(rf_test_pred, truth = readmitted, .pred_Yes) xgb_auc <- roc_auc(xgb_test_pred, truth = readmitted, .pred_Yes) cat("\n=== Test Set Performance ===\n") cat("Random Forest AUC:", rf_auc$.estimate, "\n") cat("XGBoost AUC: ", xgb_auc$.estimate, "\n") if (xgb_auc$.estimate > rf_auc$.estimate) { cat("\nXGBoost performs better on the test set.\n") } else { cat("\nRandom Forest performs better on the test set.\n") } # --- Part 5: Variable importance plots --- # Random Forest rf_vi <- rf_final_fit %>% extract_fit_parsnip() %>% vip(num_features = 8) + ggtitle("Random Forest Variable Importance") print(rf_vi) # XGBoost xgb_vi <- xgb_final_fit %>% extract_fit_parsnip() %>% vip(num_features = 8) + ggtitle("XGBoost Variable Importance") print(xgb_vi) cat("\nBoth models should generally agree on the most important features,\n") cat("though rankings may differ. Continuous variables with more possible\n") cat("split points (e.g., age, creatinine) often rank higher in tree-based\n") cat("importance measures than binary variables (e.g., has_chf).\n") ``` #### Python ```{python} #| label: sol-ch08-ex2-py #| eval: false #| code-fold: true #| code-summary: "Show Python solution" # ============================================================================= # Chapter 8, Exercise 2: Random Forest vs XGBoost Tuning Challenge # 1. Split data 80/20. 2. Tune RF and XGBoost with 5-fold CV. # 3. Select best hyperparameters. 4. Evaluate on test set. # 5. Create variable importance plots. # ============================================================================= import numpy as np import pandas as pd from sklearn.model_selection import (train_test_split, RandomizedSearchCV, StratifiedKFold) from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import roc_auc_score import xgboost as xgb import matplotlib.pyplot as plt # --- Simulate the readmission dataset (same as chapter) --- np.random.seed(42) n = 1000 df = pd.DataFrame({ 'age': np.random.normal(68, 12, n), 'length_of_stay': np.random.poisson(5, n) + 1, 'num_comorbidities': np.random.poisson(3, n), 'prior_admissions': np.random.poisson(1, n), 'discharge_hemoglobin': np.random.normal(11, 2, n), 'discharge_creatinine': np.random.lognormal(0.2, 0.5, n), 'has_diabetes': np.random.binomial(1, 0.35, n), 'has_chf': np.random.binomial(1, 0.25, n) }) y = np.random.binomial(1, 0.18, n) feature_names = list(df.columns) # --- Part 1: Train/test split --- X_train, X_test, y_train, y_test = train_test_split( df, y, test_size=0.2, stratify=y, random_state=42 ) print(f"Training set: {X_train.shape[0]} | Test set: {X_test.shape[0]}") cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) # --- Part 2a: Tune Random Forest --- rf_param_dist = { 'n_estimators': [200, 500], 'max_features': ['sqrt', 'log2', 3, 5, 7], 'min_samples_leaf': [5, 10, 15, 20, 30], 'max_depth': [5, 10, 15, 20, None] } rf = RandomForestClassifier(random_state=42, n_jobs=-1) rf_search = RandomizedSearchCV( rf, rf_param_dist, n_iter=30, cv=cv, scoring='roc_auc', random_state=42, n_jobs=-1, verbose=0 ) rf_search.fit(X_train, y_train) print(f"\n--- Random Forest Tuning ---") print(f"Best CV AUC: {rf_search.best_score_:.3f}") print(f"Best params: {rf_search.best_params_}") # --- Part 2b: Tune XGBoost --- xgb_param_dist = { 'n_estimators': [200, 500, 1000], 'learning_rate': [0.01, 0.05, 0.1], 'max_depth': [2, 3, 4, 5, 6], 'subsample': [0.7, 0.8, 0.9, 1.0], 'colsample_bytree': [0.7, 0.8, 0.9, 1.0], 'min_child_weight': [1, 5, 10] } xgb_model = xgb.XGBClassifier( random_state=42, use_label_encoder=False, eval_metric='logloss' ) xgb_search = RandomizedSearchCV( xgb_model, xgb_param_dist, n_iter=30, cv=cv, scoring='roc_auc', random_state=42, n_jobs=-1, verbose=0 ) xgb_search.fit(X_train, y_train) print(f"\n--- XGBoost Tuning ---") print(f"Best CV AUC: {xgb_search.best_score_:.3f}") print(f"Best params: {xgb_search.best_params_}") # --- Part 4: Evaluate on test set --- rf_best = rf_search.best_estimator_ xgb_best = xgb_search.best_estimator_ rf_test_probs = rf_best.predict_proba(X_test)[:, 1] xgb_test_probs = xgb_best.predict_proba(X_test)[:, 1] rf_test_auc = roc_auc_score(y_test, rf_test_probs) xgb_test_auc = roc_auc_score(y_test, xgb_test_probs) print(f"\n=== Test Set Performance ===") print(f"Random Forest AUC: {rf_test_auc:.3f}") print(f"XGBoost AUC: {xgb_test_auc:.3f}") if xgb_test_auc > rf_test_auc: print("\nXGBoost performs better on the test set.") else: print("\nRandom Forest performs better on the test set.") # --- Part 5: Variable importance plots --- fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Random Forest importance rf_imp = pd.Series(rf_best.feature_importances_, index=feature_names) rf_imp = rf_imp.sort_values(ascending=True) rf_imp.plot(kind='barh', ax=axes[0], color='#2E86AB') axes[0].set_xlabel("Feature Importance (Gini)") axes[0].set_title("Random Forest Variable Importance") # XGBoost importance xgb_imp = pd.Series(xgb_best.feature_importances_, index=feature_names) xgb_imp = xgb_imp.sort_values(ascending=True) xgb_imp.plot(kind='barh', ax=axes[1], color='#D55E00') axes[1].set_xlabel("Feature Importance (Gain)") axes[1].set_title("XGBoost Variable Importance") plt.tight_layout() plt.show() # Compare top features print("\nRF top 3 features:", rf_imp.sort_values(ascending=False).head(3).index.tolist()) print("XGB top 3 features:", xgb_imp.sort_values(ascending=False).head(3).index.tolist()) print("\nBoth models should generally agree on the most important features,") print("though rankings may differ due to how each algorithm uses features.") ``` ::: ### Exercise 3 ::: {.panel-tabset} #### R ```{r} #| label: sol-ch08-ex3-r #| eval: false #| code-fold: true #| code-summary: "Show R solution" # ============================================================================= # Chapter 8, Exercise 3: Interpreting a Clinical Prediction Model # Conceptual exercise: write answers as comments. # ============================================================================= # ============================================================================= # Part 1: Non-technical summary for a hospital quality committee # ============================================================================= # # "We developed a computer-based prediction model that estimates each patient's # risk of being readmitted to the hospital within 30 days of discharge. The # model uses information routinely collected during hospitalisation -- such as # lab values, age, prior admissions, and existing medical conditions -- to # assign each patient a risk score between 0% and 100%. In cross-validated # testing, the model correctly distinguished between patients who were and # were not readmitted approximately [AUC]% of the time. This tool could help # care teams focus discharge planning resources on the patients at highest # risk, potentially reducing readmission rates and associated penalties." # ============================================================================= # Part 2: Explaining variable importance to a clinical audience # ============================================================================= # # "The model identified 'number of prior admissions' and 'discharge creatinine' # as the two variables that contribute most to the model's predictions. This # means that knowing these values provides the most useful information for # distinguishing patients who will be readmitted from those who will not. # # WHAT THIS MEANS: # - Patients with more prior admissions tend to have higher predicted risk. # - Patients with elevated discharge creatinine (indicating impaired kidney # function) also tend to have higher predicted risk. # - These findings align with clinical intuition: patients with a history of # recurrent hospitalisations and those with renal impairment are known to # be at elevated risk. # # WHAT THIS DOES NOT MEAN: # - Variable importance does NOT imply causation. We cannot say that reducing # creatinine at discharge will reduce readmission risk. The model identifies # associations, not causes. # - A variable with high importance may be a proxy for something else. For # example, 'prior admissions' may reflect underlying disease severity, # social determinants, or healthcare access patterns rather than being # a direct cause of readmission. # - We should not intervene on these variables based on importance alone. # Clinical trials or causal inference methods would be needed to establish # whether modifying these factors actually changes outcomes." # ============================================================================= # Part 3: Validation strategy before clinical deployment # ============================================================================= # # PROPOSED VALIDATION PLAN: # # 1. Temporal validation: Test the model on data from a time period AFTER the # training data (e.g., train on 2020-2022, validate on 2023-2024). This # tests whether the model's performance holds over time, as clinical # practices and patient populations may shift. # # 2. External validation: Apply the model to data from a different hospital # system. A model developed at one institution may not generalise to # another due to differences in patient demographics, coding practices, # discharge protocols, and local disease patterns. # # 3. Subgroup analysis: Evaluate performance across key demographic groups # (age, sex, race/ethnicity, insurance status). A model that performs # well overall but poorly for specific populations could worsen existing # health disparities. # # 4. Calibration assessment: Verify that predicted probabilities match # observed readmission rates. A model that says "30% risk" should be # right about 30% of the time across all risk levels. # # 5. Prospective pilot: Before full deployment, run the model in parallel # alongside current practice (silent mode) to monitor performance in # real-time without affecting clinical decisions. # # RISKS OF SKIPPING EXTERNAL VALIDATION: # - The model may be overfit to idiosyncrasies of the development data # (specific EMR system, local coding conventions, patient mix). # - Performance reported from internal validation (even cross-validation) # tends to be optimistically biased. # - Deploying an unvalidated model could misallocate resources, either # missing high-risk patients (false negatives) or overwhelming care # teams with false alarms (false positives). # - Regulatory and ethical risks: deploying a model without adequate # validation may violate institutional policies and could cause patient # harm. cat("This exercise is conceptual. See the comments in this file for the\n") cat("complete answers to all three parts.\n") ``` #### Python ```{python} #| label: sol-ch08-ex3-py #| eval: false #| code-fold: true #| code-summary: "Show Python solution" # ============================================================================= # Chapter 8, Exercise 3: Interpreting a Clinical Prediction Model # Conceptual exercise: write answers as comments. # ============================================================================= # ============================================================================= # Part 1: Non-technical summary for a hospital quality committee # ============================================================================= # # "We developed a computer-based prediction model that estimates each patient's # risk of being readmitted to the hospital within 30 days of discharge. The # model uses information routinely collected during hospitalisation -- such as # lab values, age, prior admissions, and existing medical conditions -- to # assign each patient a risk score between 0% and 100%. In cross-validated # testing, the model correctly distinguished between patients who were and # were not readmitted approximately [AUC]% of the time. This tool could help # care teams focus discharge planning resources on the patients at highest # risk, potentially reducing readmission rates and associated penalties." # ============================================================================= # Part 2: Explaining variable importance to a clinical audience # ============================================================================= # # "The model identified 'number of prior admissions' and 'discharge creatinine' # as the two variables that contribute most to the model's predictions. This # means that knowing these values provides the most useful information for # distinguishing patients who will be readmitted from those who will not. # # WHAT THIS MEANS: # - Patients with more prior admissions tend to have higher predicted risk. # - Patients with elevated discharge creatinine (indicating impaired kidney # function) also tend to have higher predicted risk. # - These findings align with clinical intuition: patients with a history of # recurrent hospitalisations and those with renal impairment are known to # be at elevated risk. # # WHAT THIS DOES NOT MEAN: # - Variable importance does NOT imply causation. We cannot say that reducing # creatinine at discharge will reduce readmission risk. The model identifies # associations, not causes. # - A variable with high importance may be a proxy for something else. For # example, 'prior admissions' may reflect underlying disease severity, # social determinants, or healthcare access patterns rather than being # a direct cause of readmission. # - We should not intervene on these variables based on importance alone. # Clinical trials or causal inference methods would be needed to establish # whether modifying these factors actually changes outcomes." # ============================================================================= # Part 3: Validation strategy before clinical deployment # ============================================================================= # # PROPOSED VALIDATION PLAN: # # 1. Temporal validation: Test the model on data from a time period AFTER the # training data (e.g., train on 2020-2022, validate on 2023-2024). This # tests whether the model's performance holds over time. # # 2. External validation: Apply the model to data from a different hospital # system. A model developed at one institution may not generalise to # another due to differences in patient demographics, coding practices, # discharge protocols, and local disease patterns. # # 3. Subgroup analysis: Evaluate performance across key demographic groups # (age, sex, race/ethnicity, insurance status). A model that performs # well overall but poorly for specific populations could worsen existing # health disparities. # # 4. Calibration assessment: Verify that predicted probabilities match # observed readmission rates. A model that says "30% risk" should be # right about 30% of the time across all risk levels. # # 5. Prospective pilot: Before full deployment, run the model in parallel # alongside current practice (silent mode) to monitor performance in # real-time without affecting clinical decisions. # # RISKS OF SKIPPING EXTERNAL VALIDATION: # - The model may be overfit to idiosyncrasies of the development data. # - Performance from internal validation tends to be optimistically biased. # - Deploying an unvalidated model could misallocate resources, either # missing high-risk patients or overwhelming care teams with false alarms. # - Regulatory and ethical risks: deploying without adequate validation may # violate institutional policies and could cause patient harm. print("This exercise is conceptual. See the comments in this file for the") print("complete answers to all three parts.") ``` ::: ## Chapter 08b: Introduction to Deep Learning ### Exercise 1 ::: {.panel-tabset} #### R ```{r} #| label: sol-ch08b-ex1-r #| eval: false #| code-fold: true #| code-summary: "Show R solution" # ============================================================================= # Chapter 8b, Exercise 1: Neural Network vs XGBoost on Tabular Data # Fit both a neural network and XGBoost on the readmission dataset. # Compare 5-fold cross-validated AUC. # ============================================================================= library(tidyverse) library(tidymodels) library(keras3) # --- Simulate readmission data (same as chapter) --- set.seed(42) n <- 1000 readmit_data <- tibble( age = rnorm(n, 68, 12), length_of_stay = rpois(n, 5) + 1, num_comorbidities = rpois(n, 3), prior_admissions = rpois(n, 1), discharge_hgb = rnorm(n, 11, 2), discharge_creatinine = rlnorm(n, 0.2, 0.5), has_diabetes = rbinom(n, 1, 0.35), has_chf = rbinom(n, 1, 0.25) ) readmit_prob <- plogis(-3 + 0.02 * (readmit_data$age - 68) + 0.15 * readmit_data$prior_admissions + 0.1 * readmit_data$num_comorbidities + 0.3 * readmit_data$has_chf - 0.1 * readmit_data$discharge_hgb) readmit_data$readmitted <- factor(rbinom(n, 1, readmit_prob), labels = c("No", "Yes")) cat("Readmission rate:", mean(readmit_data$readmitted == "Yes"), "\n") # --- XGBoost with 5-fold CV (using tidymodels) --- set.seed(42) folds <- vfold_cv(readmit_data, v = 5, strata = readmitted) xgb_spec <- boost_tree(trees = 500, tree_depth = 4, learn_rate = 0.05, min_n = 10) %>% set_engine("xgboost") %>% set_mode("classification") xgb_wf <- workflow() %>% add_model(xgb_spec) %>% add_recipe(recipe(readmitted ~ ., data = readmit_data)) xgb_res <- fit_resamples(xgb_wf, resamples = folds, metrics = metric_set(roc_auc)) xgb_metrics <- collect_metrics(xgb_res) cat("\nXGBoost CV AUC:", xgb_metrics$mean, "+/-", xgb_metrics$std_err, "\n") # --- Neural Network with 5-fold CV (manual loop) --- x_all <- readmit_data %>% select(-readmitted) %>% as.matrix() y_all <- as.numeric(readmit_data$readmitted == "Yes") # Standardize features x_mean <- apply(x_all, 2, mean) x_sd <- apply(x_all, 2, sd) x_scaled <- scale(x_all, center = x_mean, scale = x_sd) set.seed(42) fold_ids <- vfold_cv(readmit_data, v = 5, strata = readmitted) nn_aucs <- numeric(5) for (i in seq_len(5)) { # Get train/validation indices train_idx <- fold_ids$splits[[i]] %>% analysis() %>% rownames() %>% as.integer() val_idx <- fold_ids$splits[[i]] %>% assessment() %>% rownames() %>% as.integer() x_train <- x_scaled[train_idx, ] y_train <- y_all[train_idx] x_val <- x_scaled[val_idx, ] y_val <- y_all[val_idx] # Build neural network model <- keras_model_sequential(input_shape = ncol(x_train)) %>% layer_dense(units = 32, activation = "relu") %>% layer_dropout(rate = 0.3) %>% layer_dense(units = 16, activation = "relu") %>% layer_dropout(rate = 0.3) %>% layer_dense(units = 1, activation = "sigmoid") model %>% compile( optimizer = optimizer_adam(learning_rate = 0.001), loss = "binary_crossentropy", metrics = "AUC" ) # Train with early stopping history <- model %>% fit( x_train, y_train, epochs = 50, batch_size = 32, validation_data = list(x_val, y_val), callbacks = list( callback_early_stopping(patience = 5, restore_best_weights = TRUE) ), verbose = 0 ) # Evaluate results <- model %>% evaluate(x_val, y_val, verbose = 0) nn_aucs[i] <- results[[2]] # AUC metric cat(sprintf(" Fold %d: NN AUC = %.3f\n", i, nn_aucs[i])) } cat("\nNeural Network CV AUC:", mean(nn_aucs), "+/-", sd(nn_aucs) / sqrt(5), "\n") # --- Comparison --- cat("\n=== Comparison ===\n") cat("XGBoost CV AUC: ", round(xgb_metrics$mean, 3), "\n") cat("Neural Network CV AUC: ", round(mean(nn_aucs), 3), "\n") cat("\nInterpretation:\n") cat("XGBoost typically matches or outperforms neural networks on tabular\n") cat("clinical data. This is expected -- the Grinsztajn et al. (2022)\n") cat("NeurIPS benchmark showed that tree-based models consistently\n") cat("outperform neural networks on typical tabular datasets. Deep learning\n") cat("excels on images, text, and sequences, not spreadsheets.\n") ``` #### Python ```{python} #| label: sol-ch08b-ex1-py #| eval: false #| code-fold: true #| code-summary: "Show Python solution" # ============================================================================= # Chapter 8b, Exercise 1: Neural Network vs XGBoost on Tabular Data # Fit both a neural network and XGBoost on the readmission dataset. # Compare 5-fold cross-validated AUC. # ============================================================================= import numpy as np import pandas as pd from sklearn.model_selection import StratifiedKFold, cross_val_score from sklearn.preprocessing import StandardScaler from sklearn.metrics import roc_auc_score import xgboost as xgb import tensorflow as tf from tensorflow.keras import layers, models, callbacks # --- Simulate readmission data (same as chapter) --- np.random.seed(42) n = 1000 age = np.random.normal(68, 12, n) length_of_stay = np.random.poisson(5, n) + 1 num_comorbidities = np.random.poisson(3, n) prior_admissions = np.random.poisson(1, n) discharge_hgb = np.random.normal(11, 2, n) discharge_creatinine = np.random.lognormal(0.2, 0.5, n) has_diabetes = np.random.binomial(1, 0.35, n) has_chf = np.random.binomial(1, 0.25, n) X = np.column_stack([age, length_of_stay, num_comorbidities, prior_admissions, discharge_hgb, discharge_creatinine, has_diabetes, has_chf]) # Generate outcome with known logistic relationship prob = 1 / (1 + np.exp(-(-3 + 0.02 * (age - 68) + 0.15 * prior_admissions + 0.1 * num_comorbidities + 0.3 * has_chf - 0.1 * discharge_hgb))) y = np.random.binomial(1, prob) print(f"Readmission rate: {y.mean():.3f}") # --- XGBoost with 5-fold CV --- cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) xgb_model = xgb.XGBClassifier( n_estimators=500, learning_rate=0.05, max_depth=4, subsample=0.8, colsample_bytree=0.8, random_state=42, use_label_encoder=False, eval_metric='logloss' ) xgb_scores = cross_val_score(xgb_model, X, y, cv=cv, scoring='roc_auc') print(f"\nXGBoost CV AUC: {xgb_scores.mean():.3f} (+/- {xgb_scores.std():.3f})") # --- Neural Network with 5-fold CV (manual loop) --- nn_aucs = [] fold_idx = 0 for train_idx, val_idx in cv.split(X, y): fold_idx += 1 # Split and scale scaler = StandardScaler() X_train = scaler.fit_transform(X[train_idx]) X_val = scaler.transform(X[val_idx]) y_train = y[train_idx] y_val = y[val_idx] # Build neural network (same architecture as chapter) tf.random.set_seed(42) model = models.Sequential([ layers.Dense(32, activation="relu", input_shape=(X_train.shape[1],)), layers.Dropout(0.3), layers.Dense(16, activation="relu"), layers.Dropout(0.3), layers.Dense(1, activation="sigmoid") ]) model.compile( optimizer=tf.keras.optimizers.Adam(learning_rate=0.001), loss="binary_crossentropy", metrics=[tf.keras.metrics.AUC(name="auc")] ) # Train with early stopping history = model.fit( X_train, y_train, epochs=50, batch_size=32, validation_data=(X_val, y_val), callbacks=[ callbacks.EarlyStopping(patience=5, restore_best_weights=True) ], verbose=0 ) # Evaluate y_pred_prob = model.predict(X_val, verbose=0).ravel() fold_auc = roc_auc_score(y_val, y_pred_prob) nn_aucs.append(fold_auc) print(f" Fold {fold_idx}: NN AUC = {fold_auc:.3f}") nn_mean_auc = np.mean(nn_aucs) nn_std_auc = np.std(nn_aucs) print(f"\nNeural Network CV AUC: {nn_mean_auc:.3f} (+/- {nn_std_auc:.3f})") # --- Comparison --- print(f"\n=== Comparison ===") print(f"XGBoost CV AUC: {xgb_scores.mean():.3f}") print(f"Neural Network CV AUC: {nn_mean_auc:.3f}") print("\nInterpretation:") print("XGBoost typically matches or outperforms neural networks on tabular") print("clinical data. This is expected -- the Grinsztajn et al. (2022)") print("NeurIPS benchmark showed that tree-based models consistently") print("outperform neural networks on typical tabular datasets. Deep learning") print("excels on images, text, and sequences, not spreadsheets.") ``` ::: ### Exercise 2 ::: {.panel-tabset} #### R ```{r} #| label: sol-ch08b-ex2-r #| eval: false #| code-fold: true #| code-summary: "Show R solution" # ============================================================================= # Chapter 8b, Exercise 2: Architecture Matching # For each clinical task, identify the best DL architecture, whether DL # is likely to outperform gradient-boosted trees, and the reporting guideline. # ============================================================================= # ============================================================================= # Task 1: Predicting 30-day mortality from 15 structured EHR variables # ============================================================================= # # (a) Architecture: Feedforward neural network (multi-layer perceptron) or, # better yet, gradient-boosted trees (XGBoost/LightGBM). With only 15 # structured variables, a simple architecture suffices. # # (b) DL likely to outperform GBT? NO. This is classic tabular data with a # modest number of features. The Grinsztajn et al. (2022) NeurIPS # benchmark and subsequent clinical benchmarks consistently show that # tree-based models match or outperform neural networks on tabular data. # Logistic regression or XGBoost is the appropriate starting point. # # (c) Reporting guideline: TRIPOD+AI (Collins et al., BMJ 2024). This is a # standard clinical prediction model using structured data. # ============================================================================= # Task 2: Classifying skin lesions as benign/malignant from dermoscopy images # ============================================================================= # # (a) Architecture: Convolutional Neural Network (CNN), specifically a # pretrained model like ResNet, EfficientNet, or Inception fine-tuned # on the dermoscopy images. A domain-specific foundation model could # also be used if available. # # (b) DL likely to outperform GBT? YES. Image data contains spatial # structure (edges, textures, shapes) that CNNs are specifically designed # to exploit. GBTs cannot process raw images and would require manual # feature extraction, which is inferior to learned CNN features. Esteva # et al. (Nature 2017) demonstrated dermatologist-level performance. # # (c) Reporting guideline: CLAIM (Checklist for AI in Medical Imaging, # Mongan et al., Radiology: AI 2020), supplemented by TRIPOD+AI. # ============================================================================= # Task 3: Detecting atrial fibrillation from 12-lead ECG tracings # ============================================================================= # # (a) Architecture: Transformer-based model or temporal CNN. The chapter # notes that transformers are the current preferred architecture for ECG # data. Medformer (NeurIPS 2024) is specifically designed for medical # time series classification. # # (b) DL likely to outperform GBT? YES. ECG data is sequential with complex # temporal patterns. GBTs would require extensive manual feature # engineering (interval measurements, morphology features), whereas DL # can learn directly from the raw waveform. Hannun et al. (Nature # Medicine 2019) demonstrated cardiologist-level arrhythmia detection. # # (c) Reporting guideline: TRIPOD+AI for the prediction model, potentially # supplemented by CLAIM if imaging is involved (e.g., ECG images rather # than signal data). # ============================================================================= # Task 4: Extracting medication names from unstructured discharge summaries # ============================================================================= # # (a) Architecture: Transformer-based language model, such as ClinicalBERT # or PubMedBERT for named entity recognition (NER). These pretrained # models understand medical vocabulary and can be fine-tuned for NER. # # (b) DL likely to outperform GBT? YES, decisively. This is a natural # language processing task. GBTs cannot process raw text meaningfully. # Transformer-based models understand context, synonyms, and medical # abbreviations. Rule-based and dictionary approaches are alternatives, # but modern NER with transformers is superior. # # (c) Reporting guideline: TRIPOD+AI. No specific imaging guideline applies. # MINIMAR (Hernandez-Boussard et al., JAMIA 2020) may also be relevant # as a minimum reporting standard. # ============================================================================= # Task 5: Predicting length of stay from structured EHR + chest X-ray # ============================================================================= # # (a) Architecture: Multimodal fusion model. A CNN (e.g., pretrained # ResNet) processes the chest X-ray to extract image features. These # are concatenated with the structured EHR features and fed into a # combined prediction head (either a feedforward network or GBT on # the fused features). # # (b) DL likely to outperform GBT? PARTIALLY. The DL component is necessary # for the image. For the structured data alone, GBTs may be equal or # better. The optimal approach may be a hybrid: use a CNN to extract # image features, then combine those features with structured data in # a GBT. Recent work on multimodal fusion suggests this hybrid approach # can outperform either modality alone. # # (c) Reporting guideline: TRIPOD+AI for the overall prediction model, # supplemented by CLAIM for the imaging component. Both should be # addressed since the model involves medical imaging. cat("This exercise is conceptual. See the comments in this file for the\n") cat("complete answers to all five clinical tasks.\n") ``` #### Python ```{python} #| label: sol-ch08b-ex2-py #| eval: false #| code-fold: true #| code-summary: "Show Python solution" # ============================================================================= # Chapter 8b, Exercise 2: Architecture Matching # For each clinical task, identify the best DL architecture, whether DL # is likely to outperform gradient-boosted trees, and the reporting guideline. # ============================================================================= # ============================================================================= # Task 1: Predicting 30-day mortality from 15 structured EHR variables # ============================================================================= # # (a) Architecture: Feedforward neural network (MLP) or, better yet, # gradient-boosted trees (XGBoost/LightGBM). With only 15 structured # variables, a simple architecture suffices. # # (b) DL likely to outperform GBT? NO. This is classic tabular data with a # modest number of features. The Grinsztajn et al. (2022) NeurIPS # benchmark and subsequent clinical benchmarks consistently show that # tree-based models match or outperform neural networks on tabular data. # # (c) Reporting guideline: TRIPOD+AI (Collins et al., BMJ 2024). Standard # clinical prediction model using structured data. # ============================================================================= # Task 2: Classifying skin lesions as benign/malignant from dermoscopy images # ============================================================================= # # (a) Architecture: CNN (e.g., pretrained ResNet, EfficientNet, or Inception # fine-tuned on dermoscopy images). A domain-specific foundation model # could also be used if available. # # (b) DL likely to outperform GBT? YES. Image data contains spatial # structure that CNNs exploit. GBTs cannot process raw images. Esteva # et al. (Nature 2017) demonstrated dermatologist-level performance. # # (c) Reporting guideline: CLAIM (Mongan et al., Radiology: AI 2020), # supplemented by TRIPOD+AI. # ============================================================================= # Task 3: Detecting atrial fibrillation from 12-lead ECG tracings # ============================================================================= # # (a) Architecture: Transformer-based model or temporal CNN. Transformers # are the current preferred architecture for ECG data. Medformer # (NeurIPS 2024) is designed for medical time series classification. # # (b) DL likely to outperform GBT? YES. ECG data is sequential with complex # temporal patterns. DL can learn directly from raw waveforms, whereas # GBTs require extensive manual feature engineering. # # (c) Reporting guideline: TRIPOD+AI, potentially supplemented by CLAIM if # ECG images rather than signal data are used. # ============================================================================= # Task 4: Extracting medication names from unstructured discharge summaries # ============================================================================= # # (a) Architecture: Transformer-based language model (ClinicalBERT or # PubMedBERT) for named entity recognition (NER). These models # understand medical vocabulary and context. # # (b) DL likely to outperform GBT? YES, decisively. This is an NLP task. # GBTs cannot process raw text meaningfully. Transformer-based NER # models understand context, synonyms, and medical abbreviations. # # (c) Reporting guideline: TRIPOD+AI. MINIMAR (Hernandez-Boussard et al., # JAMIA 2020) also applicable as minimum reporting standard. # ============================================================================= # Task 5: Predicting length of stay from structured EHR + chest X-ray # ============================================================================= # # (a) Architecture: Multimodal fusion model. A CNN (pretrained ResNet) # extracts image features from the chest X-ray. These are concatenated # with structured EHR features and fed into a combined prediction head. # # (b) DL likely to outperform GBT? PARTIALLY. DL is necessary for the image # component. For structured data alone, GBTs may be equal or better. # A hybrid approach (CNN for image features, then GBT on fused features) # may be optimal. # # (c) Reporting guideline: TRIPOD+AI for the prediction model, supplemented # by CLAIM for the imaging component. print("This exercise is conceptual. See the comments in this file for the") print("complete answers to all five clinical tasks.") ``` ::: ### Exercise 3 ::: {.panel-tabset} #### R ```{r} #| label: sol-ch08b-ex3-r #| eval: false #| code-fold: true #| code-summary: "Show R solution" # ============================================================================= # Chapter 8b, Exercise 3: Critical Appraisal of a Deep Learning Study # Evaluate a DL paper against the CLAIM or TRIPOD+AI checklist. # This is a conceptual/guided exercise -- the template below provides # the framework for appraising any DL study. # ============================================================================= # ============================================================================= # INSTRUCTIONS: # Find a recent (2024 or later) paper applying deep learning to a clinical # task in your area of interest. Use this template to evaluate it. # ============================================================================= # ============================================================================= # Question 1: Was the model externally validated? # ============================================================================= # # Look for: # - Was the model tested on data from a DIFFERENT institution, time period, # or geographic region than the training data? # - If yes, how did external performance compare to internal (e.g., was there # a drop in AUC)? # # Example answer: # "The model was validated on data from Hospital B after training on Hospital A. # Internal AUC was 0.92; external AUC dropped to 0.84 -- a 0.08 decrease. # This is consistent with the systematic review finding that 81% of DL models # show decreased accuracy on external datasets." # # Red flag: If NO external validation was performed, the results should be # treated as preliminary. Internal CV alone is insufficient for clinical claims. # ============================================================================= # Question 2: Were subgroup analyses reported? # ============================================================================= # # Look for: # - Performance broken down by age, sex, race/ethnicity # - Any mention of fairness or equity analysis # - Performance in clinically important subgroups (e.g., patients with # comorbidities, different disease severity) # # Example answer: # "The paper reported AUC by sex (male: 0.90, female: 0.87) but did not # report performance by race/ethnicity or age group. Given the known issue # of fairness non-transferability across sites (Nature Medicine 2024), # this is a significant omission." # ============================================================================= # Question 3: Was the model compared to a simpler baseline? # ============================================================================= # # Look for: # - Comparison against logistic regression, random forest, or XGBoost # - If the DL model only marginally outperforms the baseline, the added # complexity may not be justified # # Example answer: # "The paper compared DL (AUC 0.91) to logistic regression (AUC 0.85) and # random forest (AUC 0.88). The improvement over RF is modest (0.03), # raising questions about whether the DL model's added complexity and # reduced interpretability are justified." # ============================================================================= # Question 4: Were training data, code, and model weights shared? # ============================================================================= # # Look for: # - Public dataset or data sharing agreement # - Code repository (GitHub, GitLab) # - Pretrained model weights available for download # - FAIR data principles # # Example answer: # "Code was shared on GitHub. The training data is from a private hospital # system and is not publicly available, though the authors describe a # data sharing agreement. Model weights were not released." # ============================================================================= # Question 5: How close is this model to clinical deployment? # ============================================================================= # # Consider: # - Has it been externally validated across multiple sites? # - Has it been tested prospectively (not just retrospectively)? # - Has a clinical workflow been designed for how it would be used? # - Has a regulatory pathway been identified (e.g., FDA 510(k), EU MDR)? # - Has a monitoring plan for post-deployment performance been described? # - What are the potential failure modes and harms? # # Example answer: # "This model is at an early research stage. While the internal results are # promising, the model has been validated at only one external site with a # modest sample. Before clinical deployment, I would want to see: # (1) multi-site external validation, # (2) a prospective pilot study in clinical workflow, # (3) subgroup analysis across demographics, # (4) calibration assessment, # (5) a monitoring plan for detecting model drift over time." cat("This exercise is a guided template for critical appraisal.\n") cat("See the comments for the framework to evaluate any DL paper.\n") cat("Students should find their own paper and fill in the answers.\n") ``` #### Python ```{python} #| label: sol-ch08b-ex3-py #| eval: false #| code-fold: true #| code-summary: "Show Python solution" # ============================================================================= # Chapter 8b, Exercise 3: Critical Appraisal of a Deep Learning Study # Evaluate a DL paper against the CLAIM or TRIPOD+AI checklist. # This is a conceptual/guided exercise -- the template below provides # the framework for appraising any DL study. # ============================================================================= # ============================================================================= # INSTRUCTIONS: # Find a recent (2024 or later) paper applying deep learning to a clinical # task in your area of interest. Use this template to evaluate it. # ============================================================================= # ============================================================================= # Question 1: Was the model externally validated? # ============================================================================= # # Look for: # - Was the model tested on data from a DIFFERENT institution, time period, # or geographic region than the training data? # - If yes, how did external performance compare to internal (e.g., was there # a drop in AUC)? # # Example answer: # "The model was validated on data from Hospital B after training on Hospital A. # Internal AUC was 0.92; external AUC dropped to 0.84 -- a 0.08 decrease. # This is consistent with the systematic review finding that 81% of DL models # show decreased accuracy on external datasets." # # Red flag: If NO external validation was performed, the results should be # treated as preliminary. # ============================================================================= # Question 2: Were subgroup analyses reported? # ============================================================================= # # Look for: # - Performance broken down by age, sex, race/ethnicity # - Any mention of fairness or equity analysis # - Performance in clinically important subgroups # # Example answer: # "The paper reported AUC by sex (male: 0.90, female: 0.87) but did not # report performance by race/ethnicity or age group. Given the known issue # of fairness non-transferability across sites, this is a significant # omission." # ============================================================================= # Question 3: Was the model compared to a simpler baseline? # ============================================================================= # # Look for: # - Comparison against logistic regression, random forest, or XGBoost # - If the DL model only marginally outperforms the baseline, the added # complexity may not be justified # # Example answer: # "The paper compared DL (AUC 0.91) to logistic regression (AUC 0.85) and # random forest (AUC 0.88). The improvement over RF is modest (0.03)." # ============================================================================= # Question 4: Were training data, code, and model weights shared? # ============================================================================= # # Look for: # - Public dataset or data sharing agreement # - Code repository (GitHub, GitLab) # - Pretrained model weights available # # Example answer: # "Code was shared on GitHub. Training data is from a private hospital # system. Model weights were not released." # ============================================================================= # Question 5: How close is this model to clinical deployment? # ============================================================================= # # Consider: # - Has it been externally validated across multiple sites? # - Has it been tested prospectively? # - Has a clinical workflow been designed? # - Has a regulatory pathway been identified? # - Has a monitoring plan been described? # # Example answer: # "This model is at an early research stage. Before clinical deployment, # I would want to see: # (1) multi-site external validation, # (2) a prospective pilot study, # (3) subgroup analysis across demographics, # (4) calibration assessment, # (5) a monitoring plan for model drift." print("This exercise is a guided template for critical appraisal.") print("See the comments for the framework to evaluate any DL paper.") print("Students should find their own paper and fill in the answers.") ``` ::: ## Chapter 09: Model Evaluation ### Exercise 1 ::: {.panel-tabset} #### R ```{r} #| label: sol-ch09-ex1-r #| eval: false #| code-fold: true #| code-summary: "Show R solution" # ============================================================================= # Chapter 9, Exercise 1: Bayes' Theorem in Practice (PPV Calculations) # Calculate PPV at different prevalence levels and plot the relationship. # ============================================================================= library(tidyverse) # --- PPV function using Bayes' theorem --- # PPV = (Sensitivity * Prevalence) / # (Sensitivity * Prevalence + (1 - Specificity) * (1 - Prevalence)) calculate_ppv <- function(sensitivity, specificity, prevalence) { numerator <- sensitivity * prevalence denominator <- numerator + (1 - specificity) * (1 - prevalence) return(numerator / denominator) } # --- Calculate NPV --- calculate_npv <- function(sensitivity, specificity, prevalence) { numerator <- specificity * (1 - prevalence) denominator <- numerator + (1 - sensitivity) * prevalence return(numerator / denominator) } # --- Example: HIV rapid test (sensitivity = 99.7%, specificity = 99.5%) --- # In a general population (prevalence ~ 0.4%) ppv_general <- calculate_ppv(0.997, 0.995, 0.004) npv_general <- calculate_npv(0.997, 0.995, 0.004) cat("=== HIV Rapid Test (Sens=99.7%, Spec=99.5%) ===\n\n") cat("General population (prevalence 0.4%):\n") cat(" PPV:", round(ppv_general * 100, 1), "%\n") cat(" NPV:", round(npv_general * 100, 1), "%\n") # In an STI clinic population (prevalence ~ 5%) ppv_clinic <- calculate_ppv(0.997, 0.995, 0.05) npv_clinic <- calculate_npv(0.997, 0.995, 0.05) cat("\nSTI clinic (prevalence 5%):\n") cat(" PPV:", round(ppv_clinic * 100, 1), "%\n") cat(" NPV:", round(npv_clinic * 100, 1), "%\n") # In a population with known exposure (prevalence ~ 30%) ppv_exposed <- calculate_ppv(0.997, 0.995, 0.30) npv_exposed <- calculate_npv(0.997, 0.995, 0.30) cat("\nKnown exposure (prevalence 30%):\n") cat(" PPV:", round(ppv_exposed * 100, 1), "%\n") cat(" NPV:", round(npv_exposed * 100, 1), "%\n") # --- Plot PPV across a range of prevalences --- prevalences <- seq(0.001, 0.5, by = 0.001) ppvs <- sapply(prevalences, function(p) calculate_ppv(0.997, 0.995, p)) npvs <- sapply(prevalences, function(p) calculate_npv(0.997, 0.995, p)) plot_df <- tibble( prevalence = rep(prevalences, 2), value = c(ppvs, npvs), metric = rep(c("PPV", "NPV"), each = length(prevalences)) ) ggplot(plot_df, aes(x = prevalence * 100, y = value * 100, color = metric)) + geom_line(linewidth = 1.5) + geom_hline(yintercept = 50, linetype = "dashed", color = "grey50") + scale_color_manual(values = c("PPV" = "steelblue", "NPV" = "darkorange")) + labs(x = "Prevalence (%)", y = "Predictive Value (%)", title = "PPV and NPV Depend Heavily on Prevalence", subtitle = "HIV rapid test: Sensitivity=99.7%, Specificity=99.5%", color = "Metric") + theme_minimal(base_size = 14) + theme(legend.position = "top") # --- Additional: Compare tests with different sensitivity/specificity --- cat("\n\n=== Comparing Tests at 1% Prevalence ===\n") tests <- tibble( Test = c("High Sens/Low Spec", "Balanced", "Low Sens/High Spec"), Sensitivity = c(0.99, 0.95, 0.80), Specificity = c(0.90, 0.95, 0.99) ) for (i in seq_len(nrow(tests))) { ppv <- calculate_ppv(tests$Sensitivity[i], tests$Specificity[i], 0.01) npv <- calculate_npv(tests$Sensitivity[i], tests$Specificity[i], 0.01) cat(sprintf("%s (Sens=%.0f%%, Spec=%.0f%%): PPV=%.1f%%, NPV=%.1f%%\n", tests$Test[i], tests$Sensitivity[i]*100, tests$Specificity[i]*100, ppv*100, npv*100)) } cat("\nKey takeaway: Even excellent tests have low PPV when prevalence is low.\n") cat("This is why screening should target high-risk populations.\n") ``` #### Python ```{python} #| label: sol-ch09-ex1-py #| eval: false #| code-fold: true #| code-summary: "Show Python solution" # ============================================================================= # Chapter 9, Exercise 1: Bayes' Theorem in Practice (PPV Calculations) # Calculate PPV at different prevalence levels and plot the relationship. # ============================================================================= import numpy as np import matplotlib.pyplot as plt # --- PPV function using Bayes' theorem --- def calculate_ppv(sensitivity, specificity, prevalence): """PPV = (Sens * Prev) / (Sens * Prev + (1-Spec) * (1-Prev))""" numerator = sensitivity * prevalence denominator = numerator + (1 - specificity) * (1 - prevalence) return numerator / denominator def calculate_npv(sensitivity, specificity, prevalence): """NPV = (Spec * (1-Prev)) / (Spec * (1-Prev) + (1-Sens) * Prev)""" numerator = specificity * (1 - prevalence) denominator = numerator + (1 - sensitivity) * prevalence return numerator / denominator # --- Example: HIV rapid test (sensitivity = 99.7%, specificity = 99.5%) --- print("=== HIV Rapid Test (Sens=99.7%, Spec=99.5%) ===\n") # General population (prevalence ~ 0.4%) ppv_general = calculate_ppv(0.997, 0.995, 0.004) npv_general = calculate_npv(0.997, 0.995, 0.004) print(f"General population (prevalence 0.4%):") print(f" PPV: {ppv_general*100:.1f}%") print(f" NPV: {npv_general*100:.1f}%") # STI clinic (prevalence ~ 5%) ppv_clinic = calculate_ppv(0.997, 0.995, 0.05) npv_clinic = calculate_npv(0.997, 0.995, 0.05) print(f"\nSTI clinic (prevalence 5%):") print(f" PPV: {ppv_clinic*100:.1f}%") print(f" NPV: {npv_clinic*100:.1f}%") # Known exposure (prevalence ~ 30%) ppv_exposed = calculate_ppv(0.997, 0.995, 0.30) npv_exposed = calculate_npv(0.997, 0.995, 0.30) print(f"\nKnown exposure (prevalence 30%):") print(f" PPV: {ppv_exposed*100:.1f}%") print(f" NPV: {npv_exposed*100:.1f}%") # --- Plot PPV and NPV across a range of prevalences --- prevalences = np.linspace(0.001, 0.5, 500) ppvs = [calculate_ppv(0.997, 0.995, p) for p in prevalences] npvs = [calculate_npv(0.997, 0.995, p) for p in prevalences] fig, ax = plt.subplots(figsize=(8, 5)) ax.plot(prevalences * 100, np.array(ppvs) * 100, lw=2, color="steelblue", label="PPV") ax.plot(prevalences * 100, np.array(npvs) * 100, lw=2, color="darkorange", label="NPV") ax.axhline(y=50, linestyle="--", color="grey", alpha=0.7) ax.set_xlabel("Prevalence (%)") ax.set_ylabel("Predictive Value (%)") ax.set_title("PPV and NPV Depend Heavily on Prevalence\n" "HIV rapid test: Sens=99.7%, Spec=99.5%") ax.legend() plt.tight_layout() plt.show() # --- Additional: Compare tests with different sensitivity/specificity --- print("\n=== Comparing Tests at 1% Prevalence ===") tests = [ ("High Sens/Low Spec", 0.99, 0.90), ("Balanced", 0.95, 0.95), ("Low Sens/High Spec", 0.80, 0.99), ] for name, sens, spec in tests: ppv = calculate_ppv(sens, spec, 0.01) npv = calculate_npv(sens, spec, 0.01) print(f"{name} (Sens={sens*100:.0f}%, Spec={spec*100:.0f}%): " f"PPV={ppv*100:.1f}%, NPV={npv*100:.1f}%") print("\nKey takeaway: Even excellent tests have low PPV when prevalence is low.") print("This is why screening should target high-risk populations.") ``` ::: ### Exercise 2 ::: {.panel-tabset} #### R ```{r} #| label: sol-ch09-ex2-r #| eval: false #| code-fold: true #| code-summary: "Show R solution" # ============================================================================= # Chapter 9, Exercise 2: Comparing ROC and Precision-Recall Curves # Simulate an imbalanced dataset (2% prevalence), plot ROC and PR curves, # and discuss why PR curves are more informative for rare outcomes. # ============================================================================= library(tidyverse) library(pROC) library(PRROC) # --- Simulate an imbalanced dataset (2% prevalence) --- set.seed(123) n <- 5000 true_outcome <- rbinom(n, 1, 0.02) # Simulate a moderately good model pred_prob <- plogis(rnorm(n, mean = -2 + 3 * true_outcome, sd = 1.5)) cat("Number of observations:", n, "\n") cat("Number of events:", sum(true_outcome), "\n") cat("Prevalence:", mean(true_outcome), "\n") # --- ROC Curve --- roc_obj <- roc(true_outcome, pred_prob, quiet = TRUE) auroc <- auc(roc_obj) par(mfrow = c(1, 2)) # Plot ROC plot(roc_obj, main = paste("ROC Curve\nAUROC =", round(auroc, 3)), col = "steelblue", lwd = 2, legacy.axes = TRUE) abline(0, 1, lty = 2, col = "grey50") # --- Precision-Recall Curve --- pr_obj <- pr.curve( scores.class0 = pred_prob[true_outcome == 1], scores.class1 = pred_prob[true_outcome == 0], curve = TRUE ) auprc <- pr_obj$auc.integral # Plot PR curve plot(pr_obj, main = paste("Precision-Recall Curve\nAUPRC =", round(auprc, 3)), color = "darkorange", lwd = 2) abline(h = mean(true_outcome), lty = 2, col = "grey50") par(mfrow = c(1, 1)) # --- Report metrics --- cat("\n=== Summary ===\n") cat("AUROC:", round(auroc, 3), "\n") cat("AUPRC:", round(auprc, 3), "\n") cat("Baseline AUPRC (prevalence):", round(mean(true_outcome), 3), "\n") # --- Find optimal threshold (Youden's J) --- coords_best <- coords(roc_obj, "best", ret = c("threshold", "sensitivity", "specificity")) cat("\nOptimal threshold (Youden's J):", round(coords_best$threshold, 3), "\n") cat("Sensitivity:", round(coords_best$sensitivity, 3), "\n") cat("Specificity:", round(coords_best$specificity, 3), "\n") # PPV at this threshold pred_class <- ifelse(pred_prob >= coords_best$threshold, 1, 0) tp <- sum(pred_class == 1 & true_outcome == 1) fp <- sum(pred_class == 1 & true_outcome == 0) ppv_at_optimal <- tp / (tp + fp) cat("PPV at optimal threshold:", round(ppv_at_optimal, 3), "\n") # --- Interpretation --- cat("\n=== Interpretation ===\n") cat("The AUROC looks excellent (", round(auroc, 3), "), suggesting the model\n") cat("discriminates well. However, the AUPRC (", round(auprc, 3), ") reveals\n") cat("the real challenge: achieving high recall while maintaining reasonable\n") cat("precision is difficult with a 2% prevalence rate.\n\n") cat("At the Youden-optimal threshold, the PPV is only", round(ppv_at_optimal * 100, 1), "%.\n") cat("This means that even at the 'best' threshold, most positive predictions\n") cat("are false positives.\n\n") cat("KEY LESSON: For rare outcomes, always examine the PR curve alongside\n") cat("the ROC curve. AUROC can paint an overly optimistic picture because\n") cat("specificity is calculated over the large majority class.\n") ``` #### Python ```{python} #| label: sol-ch09-ex2-py #| eval: false #| code-fold: true #| code-summary: "Show Python solution" # ============================================================================= # Chapter 9, Exercise 2: Comparing ROC and Precision-Recall Curves # Simulate an imbalanced dataset (2% prevalence), plot ROC and PR curves, # and discuss why PR curves are more informative for rare outcomes. # ============================================================================= import numpy as np import matplotlib.pyplot as plt from sklearn.metrics import (roc_curve, roc_auc_score, precision_recall_curve, average_precision_score) from scipy.special import expit # --- Simulate an imbalanced dataset (2% prevalence) --- np.random.seed(123) n = 5000 true_outcome = np.random.binomial(1, 0.02, n) # Simulate a moderately good model pred_prob = expit(np.random.normal(-2 + 3 * true_outcome, 1.5)) print(f"Number of observations: {n}") print(f"Number of events: {true_outcome.sum()}") print(f"Prevalence: {true_outcome.mean():.3f}") # --- ROC Curve --- fpr, tpr, roc_thresholds = roc_curve(true_outcome, pred_prob) auroc = roc_auc_score(true_outcome, pred_prob) # --- Precision-Recall Curve --- precision, recall, pr_thresholds = precision_recall_curve(true_outcome, pred_prob) auprc = average_precision_score(true_outcome, pred_prob) prevalence = true_outcome.mean() # --- Plot side by side --- fig, axes = plt.subplots(1, 2, figsize=(14, 6)) # ROC curve axes[0].plot(fpr, tpr, color="steelblue", lw=2) axes[0].plot([0, 1], [0, 1], "--", color="grey") axes[0].set_title(f"ROC Curve (AUROC = {auroc:.3f})") axes[0].set_xlabel("False Positive Rate (1 - Specificity)") axes[0].set_ylabel("True Positive Rate (Sensitivity)") # PR curve axes[1].plot(recall, precision, color="darkorange", lw=2) axes[1].axhline(y=prevalence, linestyle="--", color="grey", label=f"Baseline (prevalence={prevalence:.3f})") axes[1].set_title(f"Precision-Recall Curve (AUPRC = {auprc:.3f})") axes[1].set_xlabel("Recall (Sensitivity)") axes[1].set_ylabel("Precision (PPV)") axes[1].legend() plt.tight_layout() plt.show() # --- Report metrics --- print(f"\n=== Summary ===") print(f"AUROC: {auroc:.3f}") print(f"AUPRC: {auprc:.3f}") print(f"Baseline AUPRC (prevalence): {prevalence:.3f}") # --- Find optimal threshold (Youden's J) --- j_scores = tpr - fpr best_idx = np.argmax(j_scores) optimal_threshold = roc_thresholds[best_idx] print(f"\nOptimal threshold (Youden's J): {optimal_threshold:.3f}") print(f"Sensitivity: {tpr[best_idx]:.3f}") print(f"Specificity: {1 - fpr[best_idx]:.3f}") # PPV at this threshold pred_class = (pred_prob >= optimal_threshold).astype(int) tp = ((pred_class == 1) & (true_outcome == 1)).sum() fp = ((pred_class == 1) & (true_outcome == 0)).sum() ppv_at_optimal = tp / (tp + fp) if (tp + fp) > 0 else 0 print(f"PPV at optimal threshold: {ppv_at_optimal:.3f}") # --- Interpretation --- print(f"\n=== Interpretation ===") print(f"The AUROC looks excellent ({auroc:.3f}), suggesting the model") print(f"discriminates well. However, the AUPRC ({auprc:.3f}) reveals") print(f"the real challenge: achieving high recall while maintaining") print(f"reasonable precision is difficult with a 2% prevalence rate.") print(f"\nAt the Youden-optimal threshold, the PPV is only {ppv_at_optimal*100:.1f}%.") print(f"This means that even at the 'best' threshold, most positive") print(f"predictions are false positives.") print(f"\nKEY LESSON: For rare outcomes, always examine the PR curve") print(f"alongside the ROC curve. AUROC can paint an overly optimistic") print(f"picture because specificity is calculated over the large") print(f"majority class.") ``` ::: ### Exercise 3 ::: {.panel-tabset} #### R ```{r} #| label: sol-ch09-ex3-r #| eval: false #| code-fold: true #| code-summary: "Show R solution" # ============================================================================= # Chapter 9, Exercise 3: Evaluating Fairness # Simulate data with two demographic groups, evaluate whether the model # performs equitably across groups. # ============================================================================= library(tidyverse) library(pROC) # --- Simulate data with two demographic groups --- set.seed(42) n <- 2000 group <- sample(c("A", "B"), n, replace = TRUE) # Group B has slightly different disease prevalence and predictor distribution true_outcome <- ifelse(group == "A", rbinom(n, 1, 0.10), rbinom(n, 1, 0.15)) # Model performs slightly worse for Group B (weaker signal) pred_prob <- ifelse(group == "A", plogis(rnorm(n, -1.5 + 2.5 * true_outcome, 1.0)), plogis(rnorm(n, -1.5 + 1.8 * true_outcome, 1.2))) df <- tibble(group = group, true_outcome = true_outcome, pred_prob = pred_prob) # --- 1. Calculate AUC by group --- cat("=== Discrimination (AUC) by Group ===\n") for (g in c("A", "B")) { idx <- df$group == g roc_g <- roc(df$true_outcome[idx], df$pred_prob[idx], quiet = TRUE) ci_g <- ci.auc(roc_g) cat(sprintf("Group %s: AUC = %.3f (95%% CI: %.3f - %.3f), Prevalence = %.1f%%\n", g, auc(roc_g), ci_g[1], ci_g[3], mean(df$true_outcome[idx]) * 100)) } # --- 2. Sensitivity and specificity at various thresholds --- cat("\n=== Performance at Different Thresholds ===\n") thresholds <- c(0.15, 0.20, 0.30, 0.40, 0.50) for (threshold in thresholds) { cat(sprintf("\nThreshold = %.2f:\n", threshold)) for (g in c("A", "B")) { idx <- df$group == g pred_class <- ifelse(df$pred_prob[idx] >= threshold, 1, 0) actual <- df$true_outcome[idx] tp <- sum(pred_class == 1 & actual == 1) fn <- sum(pred_class == 0 & actual == 1) tn <- sum(pred_class == 0 & actual == 0) fp <- sum(pred_class == 1 & actual == 0) sens <- ifelse(tp + fn > 0, tp / (tp + fn), NA) spec <- ifelse(tn + fp > 0, tn / (tn + fp), NA) ppv <- ifelse(tp + fp > 0, tp / (tp + fp), NA) cat(sprintf(" Group %s: Sens=%.3f Spec=%.3f PPV=%.3f (TP=%d FP=%d FN=%d TN=%d)\n", g, sens, spec, ppv, tp, fp, fn, tn)) } } # --- 3. Positive prediction rate (demographic parity) --- cat("\n=== Positive Prediction Rate (Demographic Parity) ===\n") threshold <- 0.30 for (g in c("A", "B")) { idx <- df$group == g pred_class <- ifelse(df$pred_prob[idx] >= threshold, 1, 0) pos_rate <- mean(pred_class) cat(sprintf("Group %s: %.1f%% predicted positive at threshold %.2f\n", g, pos_rate * 100, threshold)) } # --- 4. ROC curves overlaid --- roc_a <- roc(df$true_outcome[df$group == "A"], df$pred_prob[df$group == "A"], quiet = TRUE) roc_b <- roc(df$true_outcome[df$group == "B"], df$pred_prob[df$group == "B"], quiet = TRUE) plot(roc_a, col = "steelblue", lwd = 2, legacy.axes = TRUE, main = "ROC Curves by Demographic Group") plot(roc_b, col = "darkorange", lwd = 2, add = TRUE) legend("bottomright", legend = c(paste("Group A (AUC =", round(auc(roc_a), 3), ")"), paste("Group B (AUC =", round(auc(roc_b), 3), ")")), col = c("steelblue", "darkorange"), lwd = 2) # --- Interpretation --- cat("\n=== Interpretation ===\n") cat("1. The model shows different AUC values across groups, indicating\n") cat(" unequal discrimination. Group B (higher prevalence, noisier data)\n") cat(" has lower AUC than Group A.\n\n") cat("2. At any fixed threshold, sensitivity and specificity differ between\n") cat(" groups. This means a single threshold does not provide equitable\n") cat(" performance. Group-specific thresholds could equalize sensitivity\n") cat(" but would raise questions about fairness.\n\n") cat("3. Different positive prediction rates reflect both different prevalence\n") cat(" and different model performance. Demographic parity (equal positive\n") cat(" rates) may conflict with calibration if true prevalence differs.\n\n") cat("4. CLINICAL IMPLICATION: Before deploying a model, always evaluate\n") cat(" performance across demographic subgroups. A model that works well\n") cat(" 'on average' may perform poorly for specific populations, potentially\n") cat(" widening health disparities. Report subgroup-specific metrics.\n") ``` #### Python ```{python} #| label: sol-ch09-ex3-py #| eval: false #| code-fold: true #| code-summary: "Show Python solution" # ============================================================================= # Chapter 9, Exercise 3: Evaluating Fairness # Simulate data with two demographic groups, evaluate whether the model # performs equitably across groups. # ============================================================================= import numpy as np import matplotlib.pyplot as plt from sklearn.metrics import roc_auc_score, roc_curve from scipy.special import expit # --- Simulate data with two demographic groups --- np.random.seed(42) n = 2000 group = np.random.choice(["A", "B"], n) # Group B has slightly different disease prevalence true_outcome = np.where(group == "A", np.random.binomial(1, 0.10, n), np.random.binomial(1, 0.15, n)) # Model performs slightly worse for Group B (weaker signal, more noise) pred_prob = np.where(group == "A", expit(np.random.normal(-1.5 + 2.5 * true_outcome, 1.0)), expit(np.random.normal(-1.5 + 1.8 * true_outcome, 1.2))) # --- 1. Calculate AUC by group --- print("=== Discrimination (AUC) by Group ===") for g in ["A", "B"]: mask = group == g auc = roc_auc_score(true_outcome[mask], pred_prob[mask]) prev = true_outcome[mask].mean() print(f"Group {g}: AUC = {auc:.3f}, Prevalence = {prev*100:.1f}%") # --- 2. Sensitivity and specificity at various thresholds --- print("\n=== Performance at Different Thresholds ===") thresholds = [0.15, 0.20, 0.30, 0.40, 0.50] for threshold in thresholds: print(f"\nThreshold = {threshold:.2f}:") for g in ["A", "B"]: mask = group == g pred_class = (pred_prob[mask] >= threshold).astype(int) actual = true_outcome[mask] tp = ((pred_class == 1) & (actual == 1)).sum() fn = ((pred_class == 0) & (actual == 1)).sum() tn = ((pred_class == 0) & (actual == 0)).sum() fp = ((pred_class == 1) & (actual == 0)).sum() sens = tp / (tp + fn) if (tp + fn) > 0 else 0 spec = tn / (tn + fp) if (tn + fp) > 0 else 0 ppv = tp / (tp + fp) if (tp + fp) > 0 else 0 print(f" Group {g}: Sens={sens:.3f} Spec={spec:.3f} " f"PPV={ppv:.3f} (TP={tp} FP={fp} FN={fn} TN={tn})") # --- 3. Positive prediction rate (demographic parity) --- print("\n=== Positive Prediction Rate (Demographic Parity) ===") threshold = 0.30 for g in ["A", "B"]: mask = group == g pred_class = (pred_prob[mask] >= threshold).astype(int) pos_rate = pred_class.mean() print(f"Group {g}: {pos_rate*100:.1f}% predicted positive " f"at threshold {threshold:.2f}") # --- 4. ROC curves overlaid --- fig, ax = plt.subplots(figsize=(7, 7)) for g, color, label_prefix in [("A", "steelblue", "Group A"), ("B", "darkorange", "Group B")]: mask = group == g fpr, tpr, _ = roc_curve(true_outcome[mask], pred_prob[mask]) auc = roc_auc_score(true_outcome[mask], pred_prob[mask]) ax.plot(fpr, tpr, color=color, lw=2, label=f"{label_prefix} (AUC = {auc:.3f})") ax.plot([0, 1], [0, 1], "--", color="grey") ax.set_xlabel("False Positive Rate (1 - Specificity)") ax.set_ylabel("True Positive Rate (Sensitivity)") ax.set_title("ROC Curves by Demographic Group") ax.legend(loc="lower right") plt.tight_layout() plt.show() # --- Interpretation --- print("\n=== Interpretation ===") print("1. The model shows different AUC values across groups, indicating") print(" unequal discrimination. Group B (higher prevalence, noisier data)") print(" has lower AUC than Group A.") print("\n2. At any fixed threshold, sensitivity and specificity differ between") print(" groups. A single threshold does not provide equitable performance.") print(" Group-specific thresholds could equalize sensitivity but raise") print(" questions about fairness.") print("\n3. Different positive prediction rates reflect both different prevalence") print(" and different model performance. Demographic parity (equal positive") print(" rates) may conflict with calibration if true prevalence differs.") print("\n4. CLINICAL IMPLICATION: Before deploying a model, always evaluate") print(" performance across demographic subgroups. A model that works well") print(" 'on average' may perform poorly for specific populations, potentially") print(" widening health disparities. Report subgroup-specific metrics.") ``` ::: ## Chapter 11: Performance and Validation ### Exercise 1 ::: {.panel-tabset} #### R ```{r} #| label: sol-ch11-ex1-r #| eval: false #| code-fold: true #| code-summary: "Show R solution" # Chapter 11, Exercise 1: Calibration Assessment # Using the stroke mortality model from the chapter library(rms) # ---- Simulate the stroke data (same as chapter) ---- set.seed(2024) n <- 1500 stroke_data <- data.frame( age = round(rnorm(n, 72, 12)), nihss = round(pmax(0, rnorm(n, 8, 6))), glucose = round(rnorm(n, 140, 50)), afib = rbinom(n, 1, 0.25), thrombolysis = rbinom(n, 1, 0.30) ) lp <- -5 + 0.04 * stroke_data$age + 0.12 * stroke_data$nihss + 0.003 * stroke_data$glucose + 0.3 * stroke_data$afib - 0.5 * stroke_data$thrombolysis stroke_data$death_30d <- rbinom(n, 1, plogis(lp)) # Fit the prediction model dd <- datadist(stroke_data) options(datadist = "dd") fit <- lrm(death_30d ~ age + nihss + glucose + afib + thrombolysis, data = stroke_data, x = TRUE, y = TRUE) pred_prob <- predict(fit, type = "fitted") # ---- (a) Calibration plot using deciles of predicted risk ---- cat("=== Part (a): Calibration Plot ===\n") val.prob(pred_prob, stroke_data$death_30d, m = 150, cex = 0.5, main = "Calibration Plot (Deciles): 30-Day Stroke Mortality") # The model appears well-calibrated since the points cluster near the diagonal. # This is expected because we are evaluating apparent performance on the # training data. # ---- (b) O:E ratio ---- cat("\n=== Part (b): O:E Ratio ===\n") obs_rate <- mean(stroke_data$death_30d) mean_pred <- mean(pred_prob) oe_ratio <- obs_rate / mean_pred cat("Observed event rate:", round(obs_rate, 3), "\n") cat("Mean predicted probability:", round(mean_pred, 3), "\n") cat("O:E ratio:", round(oe_ratio, 3), "\n") # An O:E ratio near 1.0 indicates good calibration-in-the-large. # The model's average predictions match the observed event rate. # ---- (c) Calibration slope ---- cat("\n=== Part (c): Calibration Slope ===\n") cal_model <- glm(stroke_data$death_30d ~ qlogis(pred_prob), family = binomial) cal_slope <- coef(cal_model)[2] cal_intercept <- coef(cal_model)[1] cat("Calibration slope:", round(cal_slope, 3), "\n") cat("Calibration intercept:", round(cal_intercept, 3), "\n") # A calibration slope of 1.0 indicates no overfitting. # The apparent slope is typically close to 1 on the training data. # Values < 1 on new data would suggest overfitting (predictions too extreme). # ---- (d) Bootstrap optimism correction ---- cat("\n=== Part (d): Bootstrap Optimism Correction ===\n") set.seed(42) val <- validate(fit, B = 200) cat("Bootstrap Validation Results:\n") print(val) # Extract optimism-corrected C-statistic dxy_corrected <- val["Dxy", "index.corrected"] c_apparent <- fit$stats["C"] c_corrected <- (dxy_corrected + 1) / 2 cat("\nApparent C-statistic:", round(c_apparent, 4), "\n") cat("Optimism-corrected C-statistic:", round(c_corrected, 4), "\n") cat("C-statistic decrease:", round(c_apparent - c_corrected, 4), "\n") # Calibration slope from bootstrap slope_corrected <- val["Slope", "index.corrected"] cat("\nApparent calibration slope: 1.000\n") cat("Optimism-corrected calibration slope:", round(slope_corrected, 4), "\n") cat("Slope decrease:", round(1 - slope_corrected, 4), "\n") # The C-statistic decreases slightly after optimism correction, reflecting # the mild optimism in apparent performance. The calibration slope also # decreases below 1, indicating some overfitting that inflates apparent # performance. ``` #### Python ```{python} #| label: sol-ch11-ex1-py #| eval: false #| code-fold: true #| code-summary: "Show Python solution" # Chapter 11, Exercise 1: Calibration Assessment # Using the stroke mortality model from the chapter import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.linear_model import LogisticRegression from sklearn.metrics import roc_auc_score, brier_score_loss from sklearn.calibration import calibration_curve from scipy.special import expit, logit # ---- Simulate the stroke data (same as chapter) ---- np.random.seed(2024) n = 1500 stroke_data = pd.DataFrame({ "age": np.round(np.random.normal(72, 12, n)), "nihss": np.round(np.maximum(0, np.random.normal(8, 6, n))), "glucose": np.round(np.random.normal(140, 50, n)), "afib": np.random.binomial(1, 0.25, n), "thrombolysis": np.random.binomial(1, 0.30, n) }) lp = (-5 + 0.04 * stroke_data["age"] + 0.12 * stroke_data["nihss"] + 0.003 * stroke_data["glucose"] + 0.3 * stroke_data["afib"] - 0.5 * stroke_data["thrombolysis"]) stroke_data["death_30d"] = np.random.binomial(1, expit(lp)) predictors = ["age", "nihss", "glucose", "afib", "thrombolysis"] X = stroke_data[predictors].values y = stroke_data["death_30d"].values model = LogisticRegression(max_iter=5000, random_state=42) model.fit(X, y) pred_prob = model.predict_proba(X)[:, 1] # ---- (a) Calibration plot using deciles ---- print("=== Part (a): Calibration Plot ===") prob_true, prob_pred = calibration_curve(y, pred_prob, n_bins=10, strategy="quantile") fig, ax = plt.subplots(figsize=(7, 6)) ax.plot(prob_pred, prob_true, "o-", color="steelblue", lw=2, markersize=8, label="Model") ax.plot([0, 1], [0, 1], "--", color="grey", label="Perfect calibration") ax.set_xlabel("Mean Predicted Probability") ax.set_ylabel("Observed Proportion") ax.set_title("Calibration Plot (Deciles): 30-Day Stroke Mortality") ax.legend() plt.tight_layout() plt.savefig("ch11_ex1_calibration.png", dpi=150) plt.show() # The points cluster near the diagonal, suggesting good apparent calibration. # ---- (b) O:E ratio ---- print("\n=== Part (b): O:E Ratio ===") obs_rate = y.mean() mean_pred = pred_prob.mean() oe_ratio = obs_rate / mean_pred print(f"Observed event rate: {obs_rate:.3f}") print(f"Mean predicted probability: {mean_pred:.3f}") print(f"O:E ratio: {oe_ratio:.3f}") # An O:E ratio near 1.0 indicates good calibration-in-the-large. # ---- (c) Calibration slope ---- print("\n=== Part (c): Calibration Slope ===") lp_pred = logit(np.clip(pred_prob, 1e-8, 1 - 1e-8)) cal_model = LogisticRegression(max_iter=5000, penalty=None) cal_model.fit(lp_pred.reshape(-1, 1), y) print(f"Calibration slope: {cal_model.coef_[0][0]:.3f}") print(f"Calibration intercept: {cal_model.intercept_[0]:.3f}") # A slope near 1.0 on training data is expected. Values < 1 on new data # indicate overfitting. # ---- (d) Bootstrap optimism correction ---- print("\n=== Part (d): Bootstrap Optimism Correction ===") np.random.seed(42) # Apparent performance apparent_auc = roc_auc_score(y, pred_prob) apparent_brier = brier_score_loss(y, pred_prob) n_boot = 200 optimism_auc = [] optimism_brier = [] for b in range(n_boot): boot_idx = np.random.choice(n, n, replace=True) X_boot, y_boot = X[boot_idx], y[boot_idx] boot_model = LogisticRegression(max_iter=5000, random_state=42) boot_model.fit(X_boot, y_boot) # Apparent on bootstrap sample pred_boot = boot_model.predict_proba(X_boot)[:, 1] auc_boot = roc_auc_score(y_boot, pred_boot) # Test on original data pred_orig = boot_model.predict_proba(X)[:, 1] auc_orig = roc_auc_score(y, pred_orig) optimism_auc.append(auc_boot - auc_orig) mean_opt_auc = np.mean(optimism_auc) corrected_auc = apparent_auc - mean_opt_auc print(f"Apparent C-statistic: {apparent_auc:.4f}") print(f"Mean optimism (AUC): {mean_opt_auc:.4f}") print(f"Optimism-corrected C-statistic:{corrected_auc:.4f}") print(f"C-statistic decrease: {mean_opt_auc:.4f}") # The C-statistic decreases slightly after correction, reflecting mild # optimism in the apparent performance. With 1500 observations and 5 # predictors, overfitting is modest. ``` ::: ### Exercise 2 ::: {.panel-tabset} #### R ```{r} #| label: sol-ch11-ex2-r #| eval: false #| code-fold: true #| code-summary: "Show R solution" # Chapter 11, Exercise 2: External Validation Simulation # Create three external validation populations and assess model performance library(rms) library(pROC) # ---- Simulate development data and fit model (same as chapter) ---- set.seed(2024) n <- 1500 stroke_data <- data.frame( age = round(rnorm(n, 72, 12)), nihss = round(pmax(0, rnorm(n, 8, 6))), glucose = round(rnorm(n, 140, 50)), afib = rbinom(n, 1, 0.25), thrombolysis = rbinom(n, 1, 0.30) ) lp <- -5 + 0.04 * stroke_data$age + 0.12 * stroke_data$nihss + 0.003 * stroke_data$glucose + 0.3 * stroke_data$afib - 0.5 * stroke_data$thrombolysis stroke_data$death_30d <- rbinom(n, 1, plogis(lp)) dd <- datadist(stroke_data) options(datadist = "dd") fit <- lrm(death_30d ~ age + nihss + glucose + afib + thrombolysis, data = stroke_data, x = TRUE, y = TRUE) # ---- Helper function: evaluate model performance ---- evaluate_ext <- function(ext_data, fit, label) { ext_data$pred <- predict(fit, newdata = ext_data, type = "fitted") obs_rate <- mean(ext_data$death_30d) mean_pred <- mean(ext_data$pred) oe <- obs_rate / mean_pred # C-statistic roc_obj <- roc(ext_data$death_30d, ext_data$pred, quiet = TRUE) c_stat <- auc(roc_obj) # Calibration slope lp_ext <- qlogis(ext_data$pred) cal_model <- glm(death_30d ~ lp_ext, data = ext_data, family = binomial) cal_slope <- coef(cal_model)[2] cal_int <- coef(cal_model)[1] cat(sprintf("\n=== %s ===\n", label)) cat(sprintf(" N = %d, Observed mortality: %.3f\n", nrow(ext_data), obs_rate)) cat(sprintf(" Mean predicted: %.3f\n", mean_pred)) cat(sprintf(" C-statistic: %.3f\n", c_stat)) cat(sprintf(" O:E ratio: %.3f\n", oe)) cat(sprintf(" Calibration slope: %.3f\n", cal_slope)) cat(sprintf(" Calibration intercept: %.3f\n", cal_int)) return(ext_data) } # ---- Population A: Temporal validation (3 years later) ---- # Same demographics, slightly different practice patterns set.seed(101) n_a <- 800 pop_a <- data.frame( age = round(rnorm(n_a, 73, 12)), # Similar age nihss = round(pmax(0, rnorm(n_a, 8, 6))), glucose = round(rnorm(n_a, 138, 48)), afib = rbinom(n_a, 1, 0.27), thrombolysis = rbinom(n_a, 1, 0.40) # More thrombolysis over time ) lp_a <- -5 + 0.04 * pop_a$age + 0.12 * pop_a$nihss + 0.003 * pop_a$glucose + 0.3 * pop_a$afib - 0.5 * pop_a$thrombolysis pop_a$death_30d <- rbinom(n_a, 1, plogis(lp_a)) pop_a <- evaluate_ext(pop_a, fit, "Population A: Temporal (3 years later)") # ---- Population B: Geographical (younger patients) ---- set.seed(202) n_b <- 600 pop_b <- data.frame( age = round(rnorm(n_b, 62, 10)), # Younger nihss = round(pmax(0, rnorm(n_b, 6, 5))), # Lower severity glucose = round(rnorm(n_b, 130, 45)), afib = rbinom(n_b, 1, 0.15), # Less AF thrombolysis = rbinom(n_b, 1, 0.35) ) lp_b <- -5 + 0.04 * pop_b$age + 0.12 * pop_b$nihss + 0.003 * pop_b$glucose + 0.3 * pop_b$afib - 0.5 * pop_b$thrombolysis pop_b$death_30d <- rbinom(n_b, 1, plogis(lp_b)) pop_b <- evaluate_ext(pop_b, fit, "Population B: Geographical (younger patients)") # ---- Population C: Domain (primary care, lower severity) ---- set.seed(303) n_c <- 500 pop_c <- data.frame( age = round(rnorm(n_c, 68, 14)), nihss = round(pmax(0, rnorm(n_c, 3, 3))), # Much lower severity glucose = round(rnorm(n_c, 120, 35)), afib = rbinom(n_c, 1, 0.20), thrombolysis = rbinom(n_c, 1, 0.10) # Rarely used in primary care ) # Different outcome model: lower baseline risk in primary care lp_c <- -6 + 0.03 * pop_c$age + 0.10 * pop_c$nihss + 0.002 * pop_c$glucose + 0.2 * pop_c$afib - 0.3 * pop_c$thrombolysis pop_c$death_30d <- rbinom(n_c, 1, plogis(lp_c)) pop_c <- evaluate_ext(pop_c, fit, "Population C: Domain (primary care)") # ---- (b) Calibration plots ---- par(mfrow = c(1, 3), mar = c(4, 4, 3, 1)) val.prob(pop_a$pred, pop_a$death_30d, m = 80, cex = 0.5, main = "A: Temporal") val.prob(pop_b$pred, pop_b$death_30d, m = 60, cex = 0.5, main = "B: Geographical") val.prob(pop_c$pred, pop_c$death_30d, m = 50, cex = 0.5, main = "C: Domain") # ---- (c) Which population shows worst calibration? ---- cat("\n=== Part (c): Worst Calibration ===\n") cat("Population C (primary care / domain validation) shows the worst calibration.\n") cat("This is because the outcome model differs from the development setting:\n") cat(" - Different baseline risk (intercept)\n") cat(" - Different predictor-outcome relationships (coefficients)\n") cat(" - Different case-mix (lower severity patients)\n") cat("Domain validation is the most stringent test of transportability.\n") # ---- (d) Logistic recalibration ---- cat("\n=== Part (d): Logistic Recalibration ===\n") recalibrate <- function(ext_data, label) { lp_ext <- qlogis(ext_data$pred) cal_fit <- glm(death_30d ~ lp_ext, data = ext_data, family = binomial) ext_data$pred_recal <- predict(cal_fit, type = "response") oe_before <- mean(ext_data$death_30d) / mean(ext_data$pred) oe_after <- mean(ext_data$death_30d) / mean(ext_data$pred_recal) cat(sprintf("\n%s:\n", label)) cat(sprintf(" O:E before recalibration: %.3f\n", oe_before)) cat(sprintf(" O:E after recalibration: %.3f\n", oe_after)) } recalibrate(pop_a, "Population A") recalibrate(pop_b, "Population B") recalibrate(pop_c, "Population C") cat("\nLogistic recalibration corrects for differences in baseline risk\n") cat("and prediction spread, bringing O:E ratios closer to 1.0.\n") ``` #### Python ```{python} #| label: sol-ch11-ex2-py #| eval: false #| code-fold: true #| code-summary: "Show Python solution" # Chapter 11, Exercise 2: External Validation Simulation # Create three external validation populations and assess model performance import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.linear_model import LogisticRegression from sklearn.metrics import roc_auc_score, brier_score_loss from sklearn.calibration import calibration_curve from scipy.special import expit, logit # ---- Simulate development data and fit model (same as chapter) ---- np.random.seed(2024) n = 1500 stroke_data = pd.DataFrame({ "age": np.round(np.random.normal(72, 12, n)), "nihss": np.round(np.maximum(0, np.random.normal(8, 6, n))), "glucose": np.round(np.random.normal(140, 50, n)), "afib": np.random.binomial(1, 0.25, n), "thrombolysis": np.random.binomial(1, 0.30, n) }) lp = (-5 + 0.04 * stroke_data["age"] + 0.12 * stroke_data["nihss"] + 0.003 * stroke_data["glucose"] + 0.3 * stroke_data["afib"] - 0.5 * stroke_data["thrombolysis"]) stroke_data["death_30d"] = np.random.binomial(1, expit(lp)) predictors = ["age", "nihss", "glucose", "afib", "thrombolysis"] X = stroke_data[predictors].values y = stroke_data["death_30d"].values model = LogisticRegression(max_iter=5000, random_state=42) model.fit(X, y) # ---- Helper function ---- def evaluate_ext(ext_df, model, predictors, label): """Evaluate model on external data and return predictions.""" X_ext = ext_df[predictors].values y_ext = ext_df["death_30d"].values pred = model.predict_proba(X_ext)[:, 1] obs_rate = y_ext.mean() mean_pred = pred.mean() oe = obs_rate / mean_pred auc = roc_auc_score(y_ext, pred) # Calibration slope lp_ext = logit(np.clip(pred, 1e-8, 1 - 1e-8)) cal = LogisticRegression(max_iter=5000, penalty=None) cal.fit(lp_ext.reshape(-1, 1), y_ext) print(f"\n=== {label} ===") print(f" N = {len(y_ext)}, Observed mortality: {obs_rate:.3f}") print(f" Mean predicted: {mean_pred:.3f}") print(f" C-statistic: {auc:.3f}") print(f" O:E ratio: {oe:.3f}") print(f" Calibration slope: {cal.coef_[0][0]:.3f}") print(f" Calibration intercept: {cal.intercept_[0]:.3f}") return pred # ---- Population A: Temporal validation (3 years later) ---- np.random.seed(101) n_a = 800 pop_a = pd.DataFrame({ "age": np.round(np.random.normal(73, 12, n_a)), "nihss": np.round(np.maximum(0, np.random.normal(8, 6, n_a))), "glucose": np.round(np.random.normal(138, 48, n_a)), "afib": np.random.binomial(1, 0.27, n_a), "thrombolysis": np.random.binomial(1, 0.40, n_a) # More thrombolysis }) lp_a = (-5 + 0.04 * pop_a["age"] + 0.12 * pop_a["nihss"] + 0.003 * pop_a["glucose"] + 0.3 * pop_a["afib"] - 0.5 * pop_a["thrombolysis"]) pop_a["death_30d"] = np.random.binomial(1, expit(lp_a)) pred_a = evaluate_ext(pop_a, model, predictors, "Population A: Temporal") # ---- Population B: Geographical (younger patients) ---- np.random.seed(202) n_b = 600 pop_b = pd.DataFrame({ "age": np.round(np.random.normal(62, 10, n_b)), "nihss": np.round(np.maximum(0, np.random.normal(6, 5, n_b))), "glucose": np.round(np.random.normal(130, 45, n_b)), "afib": np.random.binomial(1, 0.15, n_b), "thrombolysis": np.random.binomial(1, 0.35, n_b) }) lp_b = (-5 + 0.04 * pop_b["age"] + 0.12 * pop_b["nihss"] + 0.003 * pop_b["glucose"] + 0.3 * pop_b["afib"] - 0.5 * pop_b["thrombolysis"]) pop_b["death_30d"] = np.random.binomial(1, expit(lp_b)) pred_b = evaluate_ext(pop_b, model, predictors, "Population B: Geographical") # ---- Population C: Domain (primary care, lower severity) ---- np.random.seed(303) n_c = 500 pop_c = pd.DataFrame({ "age": np.round(np.random.normal(68, 14, n_c)), "nihss": np.round(np.maximum(0, np.random.normal(3, 3, n_c))), "glucose": np.round(np.random.normal(120, 35, n_c)), "afib": np.random.binomial(1, 0.20, n_c), "thrombolysis": np.random.binomial(1, 0.10, n_c) }) # Different outcome model in primary care lp_c = (-6 + 0.03 * pop_c["age"] + 0.10 * pop_c["nihss"] + 0.002 * pop_c["glucose"] + 0.2 * pop_c["afib"] - 0.3 * pop_c["thrombolysis"]) pop_c["death_30d"] = np.random.binomial(1, expit(lp_c)) pred_c = evaluate_ext(pop_c, model, predictors, "Population C: Domain") # ---- (b) Calibration plots ---- fig, axes = plt.subplots(1, 3, figsize=(16, 5)) pops = [(pop_a, pred_a, "A: Temporal"), (pop_b, pred_b, "B: Geographical"), (pop_c, pred_c, "C: Domain")] for ax, (pop, pred, title) in zip(axes, pops): y_ext = pop["death_30d"].values prob_true, prob_pred = calibration_curve(y_ext, pred, n_bins=10, strategy="quantile") ax.plot(prob_pred, prob_true, "o-", color="steelblue", lw=2) ax.plot([0, 1], [0, 1], "--", color="grey") ax.set_xlabel("Predicted Probability") ax.set_ylabel("Observed Proportion") ax.set_title(title) plt.tight_layout() plt.savefig("ch11_ex2_calibration_plots.png", dpi=150) plt.show() # ---- (c) Worst calibration ---- print("\n=== Part (c): Worst Calibration ===") print("Population C (primary care / domain validation) shows worst calibration.") print("The outcome model differs from the development setting:") print(" - Different baseline risk and predictor-outcome relationships") print(" - Much lower severity patients (NIHSS ~ 3 vs 8)") print(" - Domain validation is the most stringent test of transportability.") # ---- (d) Logistic recalibration ---- print("\n=== Part (d): Logistic Recalibration ===") for pop, pred, label in pops: y_ext = pop["death_30d"].values lp_ext = logit(np.clip(pred, 1e-8, 1 - 1e-8)) recal = LogisticRegression(max_iter=5000, penalty=None) recal.fit(lp_ext.reshape(-1, 1), y_ext) pred_recal = recal.predict_proba(lp_ext.reshape(-1, 1))[:, 1] oe_before = y_ext.mean() / pred.mean() oe_after = y_ext.mean() / pred_recal.mean() print(f"\n{label}:") print(f" O:E before: {oe_before:.3f}") print(f" O:E after: {oe_after:.3f}") print("\nLogistic recalibration adjusts intercept and slope, bringing") print("O:E ratios closer to 1.0 in each external population.") ``` ::: ### Exercise 3 ::: {.panel-tabset} #### R ```{r} #| label: sol-ch11-ex3-r #| eval: false #| code-fold: true #| code-summary: "Show R solution" # Chapter 11, Exercise 3: Decision Curve Interpretation # Conceptual exercise about a preeclampsia prediction model # This exercise is primarily conceptual. We simulate a plausible preeclampsia # model and generate a decision curve to support the discussion. set.seed(42) n <- 2000 # Simulate data for preeclampsia prediction preeclampsia_data <- data.frame( age = round(rnorm(n, 30, 5)), bmi = round(rnorm(n, 27, 5), 1), nulliparous = rbinom(n, 1, 0.4), history_pe = rbinom(n, 1, 0.05), map = round(rnorm(n, 85, 10)) # mean arterial pressure ) lp <- -5.5 + 0.03 * preeclampsia_data$age + 0.04 * preeclampsia_data$bmi + 0.5 * preeclampsia_data$nulliparous + 1.5 * preeclampsia_data$history_pe + 0.02 * preeclampsia_data$map preeclampsia_data$pe <- rbinom(n, 1, plogis(lp)) cat("Preeclampsia rate:", mean(preeclampsia_data$pe), "\n") # Fit model model <- glm(pe ~ age + bmi + nulliparous + history_pe + map, data = preeclampsia_data, family = binomial) pred_prob <- predict(model, type = "response") y_obs <- preeclampsia_data$pe # ---- Generate decision curve ---- thresholds <- seq(0.01, 0.50, by = 0.01) n_total <- length(y_obs) nb_model <- nb_all <- numeric(length(thresholds)) for (i in seq_along(thresholds)) { t <- thresholds[i] pred_pos <- as.numeric(pred_prob >= t) tp <- sum(pred_pos == 1 & y_obs == 1) fp <- sum(pred_pos == 1 & y_obs == 0) nb_model[i] <- tp / n_total - fp / n_total * (t / (1 - t)) nb_all[i] <- mean(y_obs) - (1 - mean(y_obs)) * (t / (1 - t)) } plot(thresholds, nb_model, type = "l", lwd = 2, col = "steelblue", main = "Decision Curve: Preeclampsia Prediction Model\n(AUC ~ 0.82)", xlab = "Threshold Probability", ylab = "Net Benefit", ylim = c(-0.05, max(c(nb_model, nb_all)) * 1.1)) lines(thresholds, nb_all, lwd = 2, col = "coral", lty = 2) abline(h = 0, col = "black") legend("topright", legend = c("Model", "Treat All", "Treat None"), col = c("steelblue", "coral", "black"), lty = c(1, 2, 1), lwd = 2) # ---- Part (a): Range of useful thresholds ---- cat("\n=== Part (a): Range of Useful Thresholds ===\n") cat("The model is useful at thresholds where its net benefit curve\n") cat("exceeds BOTH the 'treat all' and 'treat none' lines.\n") cat("From the decision curve, the model appears useful approximately\n") cat("in the range of threshold probabilities from about 2% to 30-40%.\n") cat("Below ~2%, 'treat all' has similar or higher net benefit.\n") cat("Above ~30-40%, the model's net benefit approaches zero.\n") # ---- Part (b): Counter-argument to 'AUC is only 0.82' ---- cat("\n=== Part (b): Counter-argument ===\n") cat("An AUC of 0.82 is not 'only' -- it is strong discrimination.\n") cat("More importantly, the AUC does not directly tell you whether\n") cat("using the model improves clinical decisions.\n\n") cat("The decision curve shows that across the clinically relevant\n") cat("threshold range (e.g., 5-20%), the model provides positive\n") cat("net benefit above both default strategies.\n\n") cat("This means that using the model to guide closer monitoring\n") cat("decisions would identify more true preeclampsia cases per\n") cat("'unnecessary' monitoring than either monitoring everyone or\n") cat("monitoring no one. Clinical utility is what matters for\n") cat("patient care -- not the AUC alone.\n") # ---- Part (c): Effect of action cost on the decision curve ---- cat("\n=== Part (c): Effect of Action Cost ===\n") cat("The threshold probability reflects the implicit cost-benefit\n") cat("ratio of the action (closer monitoring).\n\n") cat("LOW-COST action (e.g., closer monitoring, extra visits):\n") cat(" - Clinicians would use a LOW threshold (e.g., 3-5%)\n") cat(" - They accept many false positives to avoid missing cases\n") cat(" - The relevant region of the decision curve shifts LEFT\n") cat(" - 'Treat all' remains competitive at low thresholds\n\n") cat("HIGH-COST action (e.g., prophylactic medication with side effects):\n") cat(" - Clinicians would use a HIGHER threshold (e.g., 15-25%)\n") cat(" - They want more certainty before acting\n") cat(" - The relevant region shifts RIGHT\n") cat(" - The model has more value here because it avoids unnecessary\n") cat(" treatment of low-risk patients\n\n") cat("In summary: the decision curve itself does not change, but the\n") cat("clinically relevant REGION changes depending on the cost of action.\n") ``` #### Python ```{python} #| label: sol-ch11-ex3-py #| eval: false #| code-fold: true #| code-summary: "Show Python solution" # Chapter 11, Exercise 3: Decision Curve Interpretation # Conceptual exercise about a preeclampsia prediction model import numpy as np import matplotlib.pyplot as plt from sklearn.linear_model import LogisticRegression from sklearn.metrics import roc_auc_score from scipy.special import expit # Simulate a plausible preeclampsia model to generate a decision curve np.random.seed(42) n = 2000 age = np.round(np.random.normal(30, 5, n)) bmi = np.round(np.random.normal(27, 5, n), 1) nulliparous = np.random.binomial(1, 0.4, n) history_pe = np.random.binomial(1, 0.05, n) map_val = np.round(np.random.normal(85, 10, n)) lp = (-5.5 + 0.03 * age + 0.04 * bmi + 0.5 * nulliparous + 1.5 * history_pe + 0.02 * map_val) pe = np.random.binomial(1, expit(lp)) print(f"Preeclampsia rate: {pe.mean():.3f}") X = np.column_stack([age, bmi, nulliparous, history_pe, map_val]) model = LogisticRegression(max_iter=5000, random_state=42) model.fit(X, pe) pred_prob = model.predict_proba(X)[:, 1] print(f"AUC: {roc_auc_score(pe, pred_prob):.3f}") # ---- Generate decision curve ---- thresholds = np.arange(0.01, 0.51, 0.01) n_total = len(pe) def net_benefit(y_true, y_pred, threshold): pred_pos = (y_pred >= threshold).astype(int) tp = np.sum((pred_pos == 1) & (y_true == 1)) fp = np.sum((pred_pos == 1) & (y_true == 0)) return tp / n_total - fp / n_total * (threshold / (1 - threshold)) nb_model = [net_benefit(pe, pred_prob, t) for t in thresholds] nb_all = [pe.mean() - (1 - pe.mean()) * t / (1 - t) for t in thresholds] plt.figure(figsize=(9, 6)) plt.plot(thresholds, nb_model, color="steelblue", lw=2, label="Prediction Model") plt.plot(thresholds, nb_all, color="coral", lw=2, ls="--", label="Treat All") plt.axhline(y=0, color="black", lw=1, label="Treat None") plt.xlabel("Threshold Probability") plt.ylabel("Net Benefit") plt.title("Decision Curve: Preeclampsia Prediction Model (AUC ~ 0.82)") plt.legend() plt.ylim(-0.05, max(max(nb_model), max(nb_all)) * 1.1) plt.tight_layout() plt.savefig("ch11_ex3_decision_curve.png", dpi=150) plt.show() # ---- Part (a): Range of useful thresholds ---- print("\n=== Part (a): Range of Useful Thresholds ===") print("The model is useful at thresholds where its net benefit curve") print("exceeds BOTH 'treat all' and 'treat none'.") print("From the decision curve, the model is useful approximately") print("from about 2% to 30-40% threshold probability.") print("Below ~2%, 'treat all' has similar or higher net benefit.") print("Above ~30-40%, the model's net benefit approaches zero.") # ---- Part (b): Counter-argument to 'AUC is only 0.82' ---- print("\n=== Part (b): Counter-argument ===") print("An AUC of 0.82 is strong discrimination, not 'only'.") print("More importantly, AUC does not directly measure clinical utility.") print("") print("The decision curve shows the model provides positive net benefit") print("above both default strategies across the clinically relevant") print("threshold range (e.g., 5-20%). This means using the model to") print("guide monitoring decisions identifies more true preeclampsia") print("cases per 'unnecessary' monitoring episode than either monitoring") print("everyone or no one. Clinical utility -- not AUC -- determines") print("whether a model should be used in practice.") # ---- Part (c): Effect of action cost ---- print("\n=== Part (c): Effect of Action Cost ===") print("The threshold probability reflects the cost-benefit ratio of acting.") print("") print("LOW-COST action (closer monitoring, extra visits):") print(" - Clinicians use a LOW threshold (e.g., 3-5%)") print(" - Many false positives are tolerable") print(" - The relevant region of the decision curve is on the LEFT") print(" - 'Treat all' remains competitive at low thresholds") print("") print("HIGH-COST action (prophylactic medication with side effects):") print(" - Clinicians use a HIGHER threshold (e.g., 15-25%)") print(" - More certainty is needed before acting") print(" - The relevant region shifts RIGHT") print(" - The model adds more value by avoiding unnecessary treatment") print("") print("The curve itself does not change, but the clinically relevant") print("region changes depending on the cost of the follow-up action.") ``` ::: ### Exercise 4 ::: {.panel-tabset} #### R ```{r} #| label: sol-ch11-ex4-r #| eval: false #| code-fold: true #| code-summary: "Show R solution" # Chapter 11, Exercise 4: Complete Evaluation of a Published Model # Example: QRISK3 Cardiovascular Risk Prediction Model # # This exercise is primarily a literature review exercise. Below we provide # a structured answer based on published validation data for QRISK3. # ---- Part (a): C-statistic with confidence interval ---- cat("=== Part (a): C-statistic ===\n") cat("QRISK3 (Hippisley-Cox et al., 2017, BMJ) was developed to predict\n") cat("10-year cardiovascular disease risk.\n\n") cat("Development cohort:\n") cat(" C-statistic (women): 0.880 (95% CI: 0.878-0.882)\n") cat(" C-statistic (men): 0.858 (95% CI: 0.856-0.860)\n\n") cat("Validation cohort (separate 25% held out):\n") cat(" C-statistic (women): 0.879 (95% CI: 0.876-0.882)\n") cat(" C-statistic (men): 0.858 (95% CI: 0.855-0.861)\n") cat("The discrimination is excellent and stable between development\n") cat("and validation sets.\n") # ---- Part (b): Calibration ---- cat("\n=== Part (b): Calibration ===\n") cat("QRISK3 provides calibration plots in the original publication.\n") cat("In the validation cohort:\n") cat(" - The calibration was generally good, with predicted risks\n") cat(" closely matching observed event rates across deciles.\n") cat(" - Some overestimation of risk was observed at higher risk\n") cat(" levels, particularly in older age groups.\n") cat(" - External validations in different populations (e.g., outside\n") cat(" the UK) have shown variable calibration, often with\n") cat(" overestimation in lower-risk populations.\n") # ---- Part (c): Decision curve analysis ---- cat("\n=== Part (c): Decision Curve Analysis ===\n") cat("The original QRISK3 paper does not include decision curve analysis.\n\n") cat("Clinically relevant threshold range for statin initiation:\n") cat(" - UK NICE guidelines: 10% 10-year CVD risk threshold\n") cat(" - US ACC/AHA guidelines: 7.5% threshold\n") cat(" - A decision curve would be most informative in the 5-20%\n") cat(" threshold range, where the decision to start statins is\n") cat(" most uncertain.\n") cat(" - Below 5%, most clinicians would not start statins\n") cat(" - Above 20%, most clinicians would start statins regardless\n") # ---- Part (d): External validation ---- cat("\n=== Part (d): External Validation ===\n") cat("QRISK3 has been validated in multiple populations:\n") cat(" - Internal-external: UK CPRD data (separate time period)\n") cat(" - Geographic: Various European populations\n") cat(" - Ethnic subgroups: South Asian, Black, Chinese populations\n") cat(" - Temporal: Validated across different calendar periods\n\n") cat("Key findings from external validations:\n") cat(" - Discrimination generally remains good (C > 0.80)\n") cat(" - Calibration can deteriorate in non-UK populations\n") cat(" - May overestimate risk in populations with lower baseline\n") cat(" cardiovascular event rates\n") # ---- Part (e): Recommendation for implementation ---- cat("\n=== Part (e): Recommendation ===\n") cat("Recommendation depends on the clinical setting:\n\n") cat("FOR a UK primary care setting:\n") cat(" - YES, recommend implementation. QRISK3 was developed and\n") cat(" validated in UK primary care, has excellent discrimination,\n") cat(" good calibration, and is already integrated into UK\n") cat(" clinical guidelines.\n\n") cat("FOR a non-UK setting:\n") cat(" - CONDITIONAL recommendation. Would first require:\n") cat(" 1. External validation in the local population\n") cat(" 2. Assessment of calibration (likely needs recalibration)\n") cat(" 3. Decision curve analysis at locally relevant thresholds\n") cat(" 4. Comparison with locally developed risk scores\n") cat(" - If calibration is poor, logistic recalibration should be\n") cat(" considered before implementation.\n") ``` #### Python ```{python} #| label: sol-ch11-ex4-py #| eval: false #| code-fold: true #| code-summary: "Show Python solution" # Chapter 11, Exercise 4: Complete Evaluation of a Published Model # Example: QRISK3 Cardiovascular Risk Prediction Model # # This exercise is primarily a literature review exercise. Below we provide # a structured answer based on published validation data for QRISK3. # ---- Part (a): C-statistic with confidence interval ---- print("=== Part (a): C-statistic ===") print("QRISK3 (Hippisley-Cox et al., 2017, BMJ) was developed to predict") print("10-year cardiovascular disease risk.") print() print("Development cohort:") print(" C-statistic (women): 0.880 (95% CI: 0.878-0.882)") print(" C-statistic (men): 0.858 (95% CI: 0.856-0.860)") print() print("Validation cohort (separate 25% held out):") print(" C-statistic (women): 0.879 (95% CI: 0.876-0.882)") print(" C-statistic (men): 0.858 (95% CI: 0.855-0.861)") print("The discrimination is excellent and stable between development") print("and validation sets.") # ---- Part (b): Calibration ---- print("\n=== Part (b): Calibration ===") print("QRISK3 provides calibration plots in the original publication.") print("In the validation cohort:") print(" - Calibration was generally good, with predicted risks closely") print(" matching observed event rates across deciles.") print(" - Some overestimation at higher risk levels, particularly in") print(" older age groups.") print(" - External validations outside the UK have shown variable") print(" calibration, often with overestimation in lower-risk populations.") # ---- Part (c): Decision curve analysis ---- print("\n=== Part (c): Decision Curve Analysis ===") print("The original QRISK3 paper does not include decision curve analysis.") print() print("Clinically relevant threshold range for statin initiation:") print(" - UK NICE guidelines: 10% 10-year CVD risk threshold") print(" - US ACC/AHA guidelines: 7.5% threshold") print(" - A decision curve would be most informative in the 5-20%") print(" threshold range, where the statin decision is most uncertain.") print(" - Below 5%, most clinicians would not start statins") print(" - Above 20%, most clinicians would start statins regardless") # ---- Part (d): External validation ---- print("\n=== Part (d): External Validation ===") print("QRISK3 has been validated in multiple populations:") print(" - Internal-external: UK CPRD data (separate time period)") print(" - Geographic: Various European populations") print(" - Ethnic subgroups: South Asian, Black, Chinese populations") print(" - Temporal: Validated across different calendar periods") print() print("Key findings from external validations:") print(" - Discrimination generally remains good (C > 0.80)") print(" - Calibration can deteriorate in non-UK populations") print(" - May overestimate risk in populations with lower baseline") print(" cardiovascular event rates") # ---- Part (e): Recommendation ---- print("\n=== Part (e): Recommendation ===") print("Recommendation depends on the clinical setting:") print() print("FOR a UK primary care setting:") print(" YES, recommend implementation. QRISK3 was developed and") print(" validated in UK primary care, has excellent discrimination,") print(" good calibration, and is integrated into UK clinical guidelines.") print() print("FOR a non-UK setting:") print(" CONDITIONAL recommendation. Would first require:") print(" 1. External validation in the local population") print(" 2. Assessment of calibration (likely needs recalibration)") print(" 3. Decision curve analysis at locally relevant thresholds") print(" 4. Comparison with locally developed risk scores") print(" If calibration is poor, logistic recalibration should be") print(" considered before implementation.") ``` ::: ## Chapter 13: Bayesian Inference ### Exercise 1 ::: {.panel-tabset} #### R ```{r} #| label: sol-ch13-ex1-r #| eval: false #| code-fold: true #| code-summary: "Show R solution" # Chapter 13, Exercise 1: Diagnostic Test Updating # Sequential Bayesian updating with mammogram and ultrasound # ---- Part (a): Posterior probability after positive mammogram ---- cat("=== Part (a): Positive Mammogram ===\n") sens_mammo <- 0.90 # Sensitivity of mammogram spec_mammo <- 0.92 # Specificity of mammogram prevalence <- 0.02 # Prior probability (prevalence in women aged 50-59) # Apply Bayes' theorem: # P(cancer | +) = P(+ | cancer) * P(cancer) / P(+) # P(+) = P(+ | cancer)*P(cancer) + P(+ | no cancer)*P(no cancer) p_pos <- sens_mammo * prevalence + (1 - spec_mammo) * (1 - prevalence) post_mammo <- (sens_mammo * prevalence) / p_pos cat("Prior (prevalence): ", prevalence, "\n") cat("P(positive test): ", round(p_pos, 4), "\n") cat("P(cancer | positive mammogram):", round(post_mammo, 4), "\n") cat("\nDespite 90% sensitivity and 92% specificity, the posterior\n") cat("probability is only about", round(post_mammo * 100, 1), "% because\n") cat("the prior probability (prevalence) is very low at 2%.\n") # ---- Part (b): Sequential update with positive ultrasound ---- cat("\n=== Part (b): Sequential Update with Positive Ultrasound ===\n") sens_us <- 0.95 # Sensitivity of ultrasound spec_us <- 0.85 # Specificity of ultrasound # The posterior from (a) becomes the new prior prior_us <- post_mammo p_pos_us <- sens_us * prior_us + (1 - spec_us) * (1 - prior_us) post_us <- (sens_us * prior_us) / p_pos_us cat("Prior (posterior from mammogram):", round(prior_us, 4), "\n") cat("P(cancer | positive mammogram AND positive ultrasound):", round(post_us, 4), "\n") cat("\nAfter two positive tests, the posterior probability rises to about", round(post_us * 100, 1), "%.\n") # ---- Part (c): What does this illustrate? ---- cat("\n=== Part (c): What Sequential Updating Illustrates ===\n") cat("1. NATURAL SEQUENTIAL UPDATING: In the Bayesian framework,\n") cat(" today's posterior becomes tomorrow's prior. Each new piece\n") cat(" of evidence updates our belief incrementally.\n\n") cat("2. THE PRIOR MATTERS: Starting from a low prevalence (2%),\n") cat(" even a sensitive test only raises the probability to ~19%.\n") cat(" A second positive test raises it further to ~", round(post_us * 100), "%.\n\n") cat("3. CLINICAL REASONING IS BAYESIAN: Clinicians intuitively do\n") cat(" this -- they order confirmatory tests precisely because a\n") cat(" single screening test in a low-prevalence population is\n") cat(" insufficient. The Bayesian framework formalises this logic.\n\n") cat("4. COHERENCE: The Bayesian approach naturally handles sequential\n") cat(" evidence without the multiple-testing corrections that\n") cat(" frequentist methods would require.\n") # ---- Visualise the updating process ---- stages <- c("Prior\n(prevalence)", "After\nmammogram (+)", "After\nultrasound (+)") probs <- c(prevalence, post_mammo, post_us) barplot(probs, names.arg = stages, col = c("steelblue", "goldenrod", "firebrick"), main = "Sequential Bayesian Updating\nBreast Cancer Diagnosis", ylab = "P(Cancer)", ylim = c(0, max(probs) * 1.2), border = NA) text(x = c(0.7, 1.9, 3.1), y = probs + 0.02, labels = paste0(round(probs * 100, 1), "%"), cex = 0.9) ``` #### Python ```{python} #| label: sol-ch13-ex1-py #| eval: false #| code-fold: true #| code-summary: "Show Python solution" # Chapter 13, Exercise 1: Diagnostic Test Updating # Sequential Bayesian updating with mammogram and ultrasound import matplotlib.pyplot as plt # ---- Part (a): Posterior probability after positive mammogram ---- print("=== Part (a): Positive Mammogram ===") sens_mammo = 0.90 # Sensitivity of mammogram spec_mammo = 0.92 # Specificity of mammogram prevalence = 0.02 # Prior probability (prevalence in women aged 50-59) # Apply Bayes' theorem: # P(cancer | +) = P(+ | cancer) * P(cancer) / P(+) p_pos = sens_mammo * prevalence + (1 - spec_mammo) * (1 - prevalence) post_mammo = (sens_mammo * prevalence) / p_pos print(f"Prior (prevalence): {prevalence}") print(f"P(positive test): {p_pos:.4f}") print(f"P(cancer | positive mammogram): {post_mammo:.4f}") print(f"\nDespite 90% sensitivity and 92% specificity, the posterior") print(f"probability is only about {post_mammo*100:.1f}% because the") print(f"prior probability (prevalence) is very low at 2%.") # ---- Part (b): Sequential update with positive ultrasound ---- print("\n=== Part (b): Sequential Update with Positive Ultrasound ===") sens_us = 0.95 # Sensitivity of ultrasound spec_us = 0.85 # Specificity of ultrasound # The posterior from (a) becomes the new prior prior_us = post_mammo p_pos_us = sens_us * prior_us + (1 - spec_us) * (1 - prior_us) post_us = (sens_us * prior_us) / p_pos_us print(f"Prior (posterior from mammogram): {prior_us:.4f}") print(f"P(cancer | pos mammogram AND pos ultrasound): {post_us:.4f}") print(f"\nAfter two positive tests, the posterior probability rises to about {post_us*100:.1f}%.") # ---- Part (c): What sequential updating illustrates ---- print("\n=== Part (c): What Sequential Updating Illustrates ===") print("1. NATURAL SEQUENTIAL UPDATING: In the Bayesian framework,") print(" today's posterior becomes tomorrow's prior. Each new piece") print(" of evidence updates our belief incrementally.") print() print("2. THE PRIOR MATTERS: Starting from a low prevalence (2%),") print(f" even a sensitive test only raises the probability to ~{post_mammo*100:.0f}%.") print(f" A second positive test raises it further to ~{post_us*100:.0f}%.") print() print("3. CLINICAL REASONING IS BAYESIAN: Clinicians intuitively do") print(" this -- they order confirmatory tests precisely because a") print(" single screening test in a low-prevalence population is") print(" insufficient. The Bayesian framework formalises this logic.") print() print("4. COHERENCE: The Bayesian approach naturally handles sequential") print(" evidence without the multiple-testing corrections that") print(" frequentist methods would require.") # ---- Visualise the updating process ---- stages = ["Prior\n(prevalence)", "After\nmammogram (+)", "After\nultrasound (+)"] probs = [prevalence, post_mammo, post_us] colors = ["steelblue", "goldenrod", "firebrick"] fig, ax = plt.subplots(figsize=(7, 5)) bars = ax.bar(stages, probs, color=colors, edgecolor="none", width=0.6) for bar, p in zip(bars, probs): ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.01, f"{p*100:.1f}%", ha="center", fontsize=11) ax.set_ylabel("P(Cancer)") ax.set_title("Sequential Bayesian Updating\nBreast Cancer Diagnosis") ax.set_ylim(0, max(probs) * 1.3) plt.tight_layout() plt.savefig("ch13_ex1_updating.png", dpi=150) plt.show() ``` ::: ### Exercise 2 ::: {.panel-tabset} #### R ```{r} #| label: sol-ch13-ex2-r #| eval: false #| code-fold: true #| code-summary: "Show R solution" # Chapter 13, Exercise 2: Prior Sensitivity Analysis # Phase II antibiotic trial: 21/30 patients achieve clinical cure library(tidyverse) y <- 21 # successes n <- 30 # total patients # Define three priors priors <- list( list(name = "Beta(1,1) - Uninformative", a = 1, b = 1), list(name = "Beta(5,5) - Weakly informative", a = 5, b = 5), list(name = "Beta(20,10) - Informative", a = 20, b = 10) ) theta <- seq(0, 1, length.out = 500) # ---- Part (a): Posterior mean and 95% credible interval ---- cat("=== Part (a): Posterior Summaries ===\n\n") plot_df <- tibble() for (p in priors) { a_post <- p$a + y b_post <- p$b + n - y post_mean <- a_post / (a_post + b_post) ci <- qbeta(c(0.025, 0.975), a_post, b_post) cat(sprintf("Prior: %s\n", p$name)) cat(sprintf(" Prior mean: %.3f\n", p$a / (p$a + p$b))) cat(sprintf(" Posterior: Beta(%d, %d)\n", a_post, b_post)) cat(sprintf(" Posterior mean: %.3f\n", post_mean)) cat(sprintf(" 95%% credible interval: [%.3f, %.3f]\n\n", ci[1], ci[2])) plot_df <- bind_rows(plot_df, tibble( theta = theta, Density = dbeta(theta, a_post, b_post), Prior = p$name ) ) } # ---- Part (b): Plot all three posteriors ---- cat("=== Part (b): Posterior Plot ===\n") ggplot(plot_df, aes(x = theta, y = Density, colour = Prior)) + geom_line(linewidth = 1.2) + geom_vline(xintercept = y / n, linetype = "dashed", colour = "grey40") + annotate("text", x = y / n + 0.02, y = max(plot_df$Density) * 0.9, label = paste0("MLE = ", round(y / n, 3)), hjust = 0, size = 3.5) + labs(x = expression(theta ~ "(cure rate)"), y = "Density", title = "Posterior Distributions Under Different Priors", subtitle = "21/30 patients cured in Phase II trial") + theme_minimal(base_size = 13) + theme(legend.position = "bottom") # ---- Part (c): Which prior has most influence? ---- cat("=== Part (c): Which Prior Has Most Influence? ===\n\n") cat("The Beta(5,5) prior has the MOST influence on the posterior.\n\n") cat("Why? The effective sample size of a Beta(a,b) prior is a + b.\n") cat(" - Beta(1,1): effective n = 2 (trivial influence)\n") cat(" - Beta(5,5): effective n = 10 (modest influence)\n") cat(" - Beta(20,10): effective n = 30 (substantial influence)\n\n") cat("However, while Beta(20,10) has the largest effective sample size,\n") cat("its prior mean (0.667) is close to the data (21/30 = 0.700),\n") cat("so the prior and data 'agree', and the posterior is not pulled\n") cat("far from the MLE.\n\n") cat("Beta(5,5) has a prior mean of 0.500, which is DIFFERENT from\n") cat("the data. It pulls the posterior mean toward 0.5, producing the\n") cat("most visible shift from the MLE. With n=30 observed data, even\n") cat("an effective sample size of 10 produces noticeable pull when\n") cat("the prior and data disagree.\n\n") cat("Key insight: prior influence depends on BOTH the effective sample\n") cat("size AND how much the prior disagrees with the data.\n") ``` #### Python ```{python} #| label: sol-ch13-ex2-py #| eval: false #| code-fold: true #| code-summary: "Show Python solution" # Chapter 13, Exercise 2: Prior Sensitivity Analysis # Phase II antibiotic trial: 21/30 patients achieve clinical cure import numpy as np import matplotlib.pyplot as plt from scipy.stats import beta y, n = 21, 30 # successes and total theta = np.linspace(0, 1, 500) # Define three priors priors = [ ("Beta(1,1) - Uninformative", 1, 1), ("Beta(5,5) - Weakly informative", 5, 5), ("Beta(20,10) - Informative", 20, 10), ] # ---- Part (a): Posterior mean and 95% credible interval ---- print("=== Part (a): Posterior Summaries ===\n") fig, ax = plt.subplots(figsize=(8, 5)) for name, a0, b0 in priors: a_post = a0 + y b_post = b0 + n - y post_mean = a_post / (a_post + b_post) ci = beta.ppf([0.025, 0.975], a_post, b_post) print(f"Prior: {name}") print(f" Prior mean: {a0 / (a0 + b0):.3f}") print(f" Posterior: Beta({a_post}, {b_post})") print(f" Posterior mean: {post_mean:.3f}") print(f" 95% credible interval: [{ci[0]:.3f}, {ci[1]:.3f}]") print() # ---- Part (b): Plot all three posteriors ---- ax.plot(theta, beta.pdf(theta, a_post, b_post), label=name, linewidth=1.5) ax.axvline(y / n, linestyle="--", color="grey", linewidth=1) ax.text(y / n + 0.02, ax.get_ylim()[1] * 0.1, f"MLE = {y/n:.3f}", fontsize=9, color="grey") ax.set_xlabel(r"$\theta$ (cure rate)") ax.set_ylabel("Density") ax.set_title("Posterior Distributions Under Different Priors\n" "21/30 patients cured in Phase II trial") ax.legend(fontsize=9) plt.tight_layout() plt.savefig("ch13_ex2_posteriors.png", dpi=150) plt.show() # ---- Part (c): Which prior has the most influence? ---- print("=== Part (c): Which Prior Has Most Influence? ===\n") print("The Beta(5,5) prior has the MOST influence on the posterior.") print() print("Why? The effective sample size of a Beta(a,b) prior is a + b.") print(" - Beta(1,1): effective n = 2 (trivial influence)") print(" - Beta(5,5): effective n = 10 (modest influence)") print(" - Beta(20,10): effective n = 30 (substantial influence)") print() print("However, while Beta(20,10) has the largest effective sample size,") print("its prior mean (0.667) is close to the data (21/30 = 0.700),") print("so the prior and data 'agree', and the posterior is not pulled") print("far from the MLE.") print() print("Beta(5,5) has a prior mean of 0.500, which DISAGREES with the") print("data. It pulls the posterior mean toward 0.5, producing the most") print("visible shift from the MLE. With n=30 observed data, even an") print("effective sample size of 10 produces noticeable pull when the") print("prior and data disagree.") print() print("Key insight: prior influence depends on BOTH the effective sample") print("size AND how much the prior disagrees with the data.") ``` ::: ### Exercise 3 ::: {.panel-tabset} #### R ```{r} #| label: sol-ch13-ex3-r #| eval: false #| code-fold: true #| code-summary: "Show R solution" # Chapter 13, Exercise 3: Interpreting MCMC Diagnostics # Conceptual exercise about convergence problems # Given diagnostics for a Bayesian logistic regression treatment effect: # - R-hat: 1.08 # - Bulk ESS: 150 # - Tail ESS: 85 # - Trace plot: two chains exploring different regions # ---- Part (a): Is there evidence of convergence problems? ---- cat("=== Part (a): Evidence of Convergence Problems ===\n\n") cat("YES, there is clear evidence of convergence problems.\n") cat("ALL THREE diagnostics are concerning:\n\n") cat("1. R-hat = 1.08 (threshold: < 1.01)\n") cat(" - R-hat compares between-chain to within-chain variance.\n") cat(" - A value of 1.08 far exceeds the 1.01 threshold.\n") cat(" - This means chains are NOT sampling from the same distribution.\n") cat(" - The trace plot confirms this: chains explore different regions.\n\n") cat("2. Bulk ESS = 150 (recommended: > 400 per chain, i.e., > 1600 total)\n") cat(" - Bulk ESS estimates the effective number of independent samples\n") cat(" for estimating posterior means and medians.\n") cat(" - 150 is far too low for reliable posterior summaries.\n") cat(" - High autocorrelation or poor mixing causes low ESS.\n\n") cat("3. Tail ESS = 85 (recommended: > 400)\n") cat(" - Tail ESS measures the effective samples for tail quantiles\n") cat(" (e.g., 2.5th and 97.5th percentiles for credible intervals).\n") cat(" - 85 is dangerously low -- credible intervals will be unreliable.\n") cat(" - Tail ESS is almost always lower than bulk ESS, so if bulk is\n") cat(" already too low, tail ESS will be even worse.\n\n") cat("4. Trace plot: chains exploring different regions\n") cat(" - This is the most visually obvious sign of non-convergence.\n") cat(" - Well-behaved chains should overlap ('hairy caterpillar').\n") cat(" - Chains stuck in different regions indicate multimodality or\n") cat(" insufficient sampling to traverse the parameter space.\n") # ---- Part (b): Steps to fix these problems ---- cat("\n=== Part (b): Steps to Fix These Problems ===\n\n") cat("1. INCREASE WARMUP AND ITERATIONS:\n") cat(" - Current chains may not have had enough warmup to find\n") cat(" the typical set. Try warmup = 2000-4000, iter = 4000-8000.\n\n") cat("2. INCREASE adapt_delta (Stan/brms):\n") cat(" - Set control = list(adapt_delta = 0.95 or 0.99).\n") cat(" - This makes the sampler take smaller steps, reducing\n") cat(" divergent transitions at the cost of slower sampling.\n\n") cat("3. CHECK THE MODEL SPECIFICATION:\n") cat(" - Priors may be too vague or conflicting with the likelihood.\n") cat(" - Use prior predictive checks to ensure priors are reasonable.\n") cat(" - Consider stronger (more informative) priors if justified.\n\n") cat("4. REPARAMETERISE THE MODEL:\n") cat(" - Some parameterisations create difficult posterior geometries.\n") cat(" - For hierarchical models, use non-centred parameterisation.\n") cat(" - Standardise predictors to improve sampler efficiency.\n\n") cat("5. SIMPLIFY THE MODEL:\n") cat(" - If the model is too complex for the data, consider\n") cat(" removing weak predictors or reducing random effects.\n\n") cat("6. RUN MORE CHAINS:\n") cat(" - Running 6-8 chains (instead of 4) with different\n") cat(" starting values helps diagnose multimodality.\n") # ---- Part (c): Trust the posterior summaries? ---- cat("\n=== Part (c): Can We Trust the Posterior Summaries? ===\n\n") cat("NO, absolutely not. The posterior summaries should NOT be trusted.\n\n") cat("Reasons:\n") cat("1. With R-hat = 1.08, the chains have not converged to the same\n") cat(" distribution. Any summary (mean, median, CI) is averaging\n") cat(" across different distributions -- the result is meaningless.\n\n") cat("2. With bulk ESS = 150, the posterior mean is estimated from\n") cat(" only ~150 effective independent samples. The Monte Carlo\n") cat(" error is too large for the estimates to be precise.\n\n") cat("3. With tail ESS = 85, the 95% credible interval is based on\n") cat(" only ~85 effective tail samples. The interval bounds could\n") cat(" shift substantially with additional sampling.\n\n") cat("4. Reporting results from a non-converged model is a serious\n") cat(" methodological error. The correct action is to fix the\n") cat(" convergence issues BEFORE interpreting any results.\n\n") cat("Rule: If R-hat > 1.01 or ESS < 400, do NOT interpret results.\n") cat("Fix the model first.\n") ``` #### Python ```{python} #| label: sol-ch13-ex3-py #| eval: false #| code-fold: true #| code-summary: "Show Python solution" # Chapter 13, Exercise 3: Interpreting MCMC Diagnostics # Conceptual exercise about convergence problems # Given diagnostics for a Bayesian logistic regression treatment effect: # - R-hat: 1.08 # - Bulk ESS: 150 # - Tail ESS: 85 # - Trace plot: two chains exploring different regions # ---- Part (a): Is there evidence of convergence problems? ---- print("=== Part (a): Evidence of Convergence Problems ===\n") print("YES, there is clear evidence of convergence problems.") print("ALL THREE diagnostics are concerning:\n") print("1. R-hat = 1.08 (threshold: < 1.01)") print(" - R-hat compares between-chain to within-chain variance.") print(" - A value of 1.08 far exceeds the 1.01 threshold.") print(" - This means chains are NOT sampling from the same distribution.") print(" - The trace plot confirms this: chains explore different regions.\n") print("2. Bulk ESS = 150 (recommended: > 400 per chain, i.e., > 1600 total)") print(" - Bulk ESS estimates the effective number of independent samples") print(" for estimating posterior means and medians.") print(" - 150 is far too low for reliable posterior summaries.") print(" - High autocorrelation or poor mixing causes low ESS.\n") print("3. Tail ESS = 85 (recommended: > 400)") print(" - Tail ESS measures effective samples for tail quantiles") print(" (2.5th and 97.5th percentiles for credible intervals).") print(" - 85 is dangerously low -- credible intervals will be unreliable.") print(" - Tail ESS is almost always lower than bulk ESS.\n") print("4. Trace plot: chains exploring different regions") print(" - Well-behaved chains should overlap ('hairy caterpillar').") print(" - Chains in different regions indicate multimodality or") print(" insufficient sampling.\n") # ---- Part (b): Steps to fix ---- print("=== Part (b): Steps to Fix These Problems ===\n") print("1. INCREASE WARMUP AND ITERATIONS:") print(" Try warmup=2000-4000, draws=4000-8000.\n") print("2. INCREASE target_accept (PyMC) / adapt_delta (Stan):") print(" Set target_accept=0.95 or 0.99 to reduce divergences.\n") print("3. CHECK MODEL SPECIFICATION:") print(" - Are priors too vague or conflicting with the likelihood?") print(" - Run prior predictive checks.") print(" - Consider stronger priors if justified.\n") print("4. REPARAMETERISE THE MODEL:") print(" - Use non-centred parameterisation for hierarchical models.") print(" - Standardise predictors.\n") print("5. SIMPLIFY THE MODEL:") print(" - Remove weak predictors or reduce random effects.\n") print("6. RUN MORE CHAINS:") print(" - 6-8 chains with different starting values to diagnose multimodality.\n") # ---- Part (c): Trust the posterior summaries? ---- print("=== Part (c): Can We Trust the Posterior Summaries? ===\n") print("NO, absolutely not. The posterior summaries should NOT be trusted.\n") print("Reasons:") print("1. With R-hat = 1.08, chains have not converged to the same") print(" distribution. Any summary averages across different") print(" distributions -- the result is meaningless.\n") print("2. With bulk ESS = 150, the posterior mean is estimated from") print(" only ~150 effective samples. Monte Carlo error is too large.\n") print("3. With tail ESS = 85, the 95% credible interval is based on") print(" ~85 effective tail samples. Bounds could shift substantially.\n") print("4. Reporting results from a non-converged model is a serious") print(" methodological error.\n") print("Rule: If R-hat > 1.01 or ESS < 400, do NOT interpret results.") print("Fix the model first.") ``` ::: ## Chapter 14: Applied Bayesian Modelling ### Exercise 1 ::: {.panel-tabset} #### R ```{r} #| label: sol-ch14-ex1-r #| eval: false #| code-fold: true #| code-summary: "Show R solution" # Chapter 14, Exercise 1: Bayesian Logistic Regression for ICU Mortality # 500 ICU admissions with age, APACHE II, ventilation status, 28-day mortality library(brms) library(tidyverse) library(bayesplot) # ---- (a) Simulate dataset ---- set.seed(101) n <- 500 icu_data <- tibble( age = round(rnorm(n, 62, 15)), apache = round(rnorm(n, 18, 7)), ventilated = rbinom(n, 1, 0.35) ) # True model: mortality increases with age, APACHE, and ventilation lp <- -4.5 + 0.02 * icu_data$age + 0.12 * icu_data$apache + 0.8 * icu_data$ventilated icu_data$mortality <- rbinom(n, 1, plogis(lp)) cat("Dataset summary:\n") cat(" N:", n, "\n") cat(" Mortality rate:", mean(icu_data$mortality), "\n") cat(" Mean age:", round(mean(icu_data$age), 1), "\n") cat(" Mean APACHE:", round(mean(icu_data$apache), 1), "\n") cat(" % Ventilated:", round(mean(icu_data$ventilated) * 100, 1), "%\n") # ---- (b) Fit Bayesian logistic regression ---- fit_icu <- brm( mortality ~ age + apache + ventilated, data = icu_data, family = bernoulli(link = "logit"), prior = c( prior(normal(0, 2.5), class = "b"), # weakly informative on log-odds prior(normal(0, 5), class = "Intercept") ), chains = 4, iter = 2000, warmup = 1000, seed = 42, silent = 2, refresh = 0 ) cat("\nModel summary:\n") summary(fit_icu) # ---- (c) Prior predictive check ---- cat("\n=== Part (c): Prior Predictive Check ===\n") # Fit with priors only (no data influence) fit_prior <- brm( mortality ~ age + apache + ventilated, data = icu_data, family = bernoulli(link = "logit"), prior = c( prior(normal(0, 2.5), class = "b"), prior(normal(0, 5), class = "Intercept") ), sample_prior = "only", chains = 4, iter = 2000, warmup = 1000, seed = 42, silent = 2, refresh = 0 ) # Simulate from prior predictive pp_prior <- posterior_predict(fit_prior) prior_mort_rates <- rowMeans(pp_prior) cat("Prior predictive mortality rates:\n") cat(" Mean:", round(mean(prior_mort_rates), 3), "\n") cat(" SD:", round(sd(prior_mort_rates), 3), "\n") cat(" Range:", round(min(prior_mort_rates), 3), "to", round(max(prior_mort_rates), 3), "\n") cat("The priors allow mortality rates from near 0 to near 1,\n") cat("which covers all plausible ICU mortality rates. The priors\n") cat("are appropriately weakly informative.\n") # ---- (d) Posterior odds ratios with 95% credible intervals ---- cat("\n=== Part (d): Posterior Odds Ratios ===\n") post <- as_draws_df(fit_icu) or_age <- exp(post$b_age) or_apache <- exp(post$b_apache) or_vent <- exp(post$b_ventilated) cat(sprintf("OR Age: %.3f [%.3f, %.3f]\n", mean(or_age), quantile(or_age, 0.025), quantile(or_age, 0.975))) cat(sprintf("OR APACHE: %.3f [%.3f, %.3f]\n", mean(or_apache), quantile(or_apache, 0.025), quantile(or_apache, 0.975))) cat(sprintf("OR Ventilated: %.3f [%.3f, %.3f]\n", mean(or_vent), quantile(or_vent, 0.025), quantile(or_vent, 0.975))) # Plot posterior OR distributions mcmc_areas(fit_icu, pars = c("b_age", "b_apache", "b_ventilated"), prob = 0.95, prob_outer = 0.99, transformations = exp) + geom_vline(xintercept = 1, linetype = "dashed", colour = "grey40") + labs(title = "Posterior Odds Ratios (95% CrI)", x = "Odds Ratio") + theme_minimal(base_size = 13) # ---- (e) P(OR_APACHE > 1.10 | data) ---- cat("\n=== Part (e): P(OR_APACHE > 1.10 | data) ===\n") prob_or_gt_110 <- mean(or_apache > 1.10) cat(sprintf("P(OR_APACHE > 1.10 | data) = %.3f\n", prob_or_gt_110)) cat("\nInterpretation: There is a", round(prob_or_gt_110 * 100, 1), "% posterior probability\n") cat("that each unit increase in APACHE score increases the odds of\n") cat("28-day mortality by more than 10%. This is a direct probability\n") cat("statement about the parameter -- something only Bayesian inference\n") cat("can provide.\n") ``` #### Python ```{python} #| label: sol-ch14-ex1-py #| eval: false #| code-fold: true #| code-summary: "Show Python solution" # Chapter 14, Exercise 1: Bayesian Logistic Regression for ICU Mortality # 500 ICU admissions with age, APACHE II, ventilation status, 28-day mortality import numpy as np import pandas as pd import pymc as pm import arviz as az from scipy.special import expit # ---- (a) Simulate dataset ---- np.random.seed(101) n = 500 icu_data = pd.DataFrame({ 'age': np.round(np.random.normal(62, 15, n)), 'apache': np.round(np.random.normal(18, 7, n)), 'ventilated': np.random.binomial(1, 0.35, n) }) # True model lp = (-4.5 + 0.02 * icu_data['age'] + 0.12 * icu_data['apache'] + 0.8 * icu_data['ventilated']) icu_data['mortality'] = np.random.binomial(1, expit(lp)) print("Dataset summary:") print(f" N: {n}") print(f" Mortality rate: {icu_data['mortality'].mean():.3f}") print(f" Mean age: {icu_data['age'].mean():.1f}") print(f" Mean APACHE: {icu_data['apache'].mean():.1f}") print(f" % Ventilated: {icu_data['ventilated'].mean()*100:.1f}%") # ---- (b) Fit Bayesian logistic regression ---- with pm.Model() as icu_model: # Weakly informative priors intercept = pm.Normal("Intercept", mu=0, sigma=5) b_age = pm.Normal("b_age", mu=0, sigma=2.5) b_apache = pm.Normal("b_apache", mu=0, sigma=2.5) b_vent = pm.Normal("b_ventilated", mu=0, sigma=2.5) # Linear predictor logit_p = (intercept + b_age * icu_data['age'].values + b_apache * icu_data['apache'].values + b_vent * icu_data['ventilated'].values) # Likelihood y_obs = pm.Bernoulli("mortality", logit_p=logit_p, observed=icu_data['mortality'].values) # Sample trace = pm.sample(1000, tune=1000, chains=4, random_seed=42, progressbar=True) print("\nModel summary:") print(az.summary(trace, var_names=["Intercept", "b_age", "b_apache", "b_ventilated"])) # ---- (c) Prior predictive check ---- print("\n=== Part (c): Prior Predictive Check ===") # Simulate mortality rates from the priors np.random.seed(42) n_sim = 2000 intercepts = np.random.normal(0, 5, n_sim) b_ages = np.random.normal(0, 2.5, n_sim) b_apaches = np.random.normal(0, 2.5, n_sim) b_vents = np.random.normal(0, 2.5, n_sim) # For a "typical" patient: age=62, apache=18, ventilated=0 sim_lp = intercepts + b_ages * 62 + b_apaches * 18 + b_vents * 0 sim_mort = expit(sim_lp) print(f"Prior predictive mortality rates (typical patient):") print(f" Mean: {sim_mort.mean():.3f}") print(f" SD: {sim_mort.std():.3f}") print(f" Range: {sim_mort.min():.3f} to {sim_mort.max():.3f}") print("The priors allow mortality rates from near 0 to near 1,") print("covering all plausible ICU mortality rates. The priors are") print("appropriately weakly informative.") # ---- (d) Posterior odds ratios with 95% credible intervals ---- print("\n=== Part (d): Posterior Odds Ratios ===") posterior = az.extract(trace) for var_name, label in [("b_age", "Age"), ("b_apache", "APACHE"), ("b_ventilated", "Ventilated")]: or_vals = np.exp(posterior[var_name].values) mean_or = or_vals.mean() ci_low = np.percentile(or_vals, 2.5) ci_high = np.percentile(or_vals, 97.5) print(f"OR {label:>12s}: {mean_or:.3f} [{ci_low:.3f}, {ci_high:.3f}]") # ---- (e) P(OR_APACHE > 1.10 | data) ---- print("\n=== Part (e): P(OR_APACHE > 1.10 | data) ===") or_apache = np.exp(posterior["b_apache"].values) prob_gt_110 = (or_apache > 1.10).mean() print(f"P(OR_APACHE > 1.10 | data) = {prob_gt_110:.3f}") print(f"\nInterpretation: There is a {prob_gt_110*100:.1f}% posterior probability") print("that each unit increase in APACHE score increases the odds of") print("28-day mortality by more than 10%. This is a direct probability") print("statement about the parameter -- something only Bayesian inference") print("can provide.") ``` ::: ### Exercise 2 ::: {.panel-tabset} #### R ```{r} #| label: sol-ch14-ex2-r #| eval: false #| code-fold: true #| code-summary: "Show R solution" # Chapter 14, Exercise 2: Hierarchical Model for Multi-Site Drug Trial # 12 hospitals, LDL cholesterol change, new statin vs standard care library(brms) library(tidyverse) # ---- (a) Simulate data ---- set.seed(789) n_hospitals <- 12 patients_per_hospital <- c(20, 25, 30, 40, 50, 60, 70, 80, 100, 120, 150, 200) true_grand_effect <- -25 # mg/dL reduction in LDL tau_true <- 5 # between-hospital SD hospital_effects <- rnorm(n_hospitals, 0, tau_true) trial_data <- map2_dfr(1:n_hospitals, patients_per_hospital, function(j, nj) { tibble( hospital = factor(j), treatment = rep(0:1, length.out = nj), ldl_change = (true_grand_effect + hospital_effects[j]) * treatment + rnorm(nj, 0, 20) # residual SD = 20 mg/dL ) }) cat("Data summary:\n") cat(" Total patients:", nrow(trial_data), "\n") cat(" Hospitals:", n_hospitals, "\n") cat(" Patients per hospital:", paste(patients_per_hospital, collapse = ", "), "\n") cat(" True grand effect:", true_grand_effect, "mg/dL\n") cat(" True between-hospital SD:", tau_true, "mg/dL\n") # ---- (b) Fit Bayesian hierarchical model ---- fit_hier <- brm( ldl_change ~ treatment + (treatment | hospital), data = trial_data, prior = c( prior(normal(0, 30), class = "b"), prior(normal(0, 30), class = "Intercept"), prior(cauchy(0, 5), class = "sd"), prior(exponential(0.05), class = "sigma") ), chains = 4, iter = 2000, warmup = 1000, seed = 42, silent = 2, refresh = 0, control = list(adapt_delta = 0.95) ) cat("\nHierarchical model summary:\n") summary(fit_hier) # Grand mean treatment effect grand_mean <- fixef(fit_hier)["treatment", "Estimate"] cat("\nGrand mean treatment effect:", round(grand_mean, 1), "mg/dL\n") # ---- (c) Shrinkage plot ---- # Extract hospital-specific treatment effects (partial pooling) ranefs <- ranef(fit_hier)$hospital[, , "treatment"] partial_pool <- grand_mean + ranefs[, "Estimate"] # No-pooling estimates (separate OLS per hospital) no_pool <- trial_data %>% group_by(hospital) %>% summarise( no_pool_est = coef(lm(ldl_change ~ treatment))[2], n = n(), .groups = "drop" ) %>% mutate( hospital_num = as.numeric(hospital), partial_pool_est = partial_pool ) # Plot shrinkage ggplot(no_pool, aes(y = reorder(hospital, n))) + geom_point(aes(x = no_pool_est, colour = "No pooling"), size = 3) + geom_point(aes(x = partial_pool_est, colour = "Partial pooling"), size = 3) + geom_vline(xintercept = grand_mean, linetype = "dashed", colour = "grey40") + geom_segment(aes(x = no_pool_est, xend = partial_pool_est, yend = reorder(hospital, n)), arrow = arrow(length = unit(0.15, "cm")), colour = "grey60") + annotate("text", x = grand_mean + 1, y = 12.5, label = paste0("Grand mean = ", round(grand_mean, 1)), hjust = 0, size = 3.5) + scale_colour_manual(values = c("No pooling" = "steelblue", "Partial pooling" = "firebrick")) + labs(x = "Treatment Effect (mg/dL change in LDL)", y = "Hospital (ordered by sample size)", colour = "Estimate Type", title = "Shrinkage in Hierarchical Model", subtitle = "Arrows show partial pooling pulling estimates toward the grand mean") + theme_minimal(base_size = 13) + theme(legend.position = "bottom") # ---- (d) Compare hierarchical to separate OLS regressions ---- cat("\n=== Part (d): Comparison ===\n\n") cat("Hospital | N | No-Pooling | Partial Pooling | Shrinkage\n") cat("----------|------|------------|-----------------|----------\n") for (i in 1:nrow(no_pool)) { shrinkage <- abs(no_pool$no_pool_est[i] - no_pool$partial_pool_est[i]) cat(sprintf(" %2d | %3d | %6.1f | %6.1f | %5.1f\n", no_pool$hospital_num[i], no_pool$n[i], no_pool$no_pool_est[i], no_pool$partial_pool_est[i], shrinkage)) } cat("\nThe hierarchical model is MORE APPROPRIATE because:\n") cat("1. Small hospitals (n=20-30) have noisy OLS estimates that are\n") cat(" shrunk toward the grand mean, reducing estimation error.\n") cat("2. Large hospitals (n=150-200) retain their individual estimates\n") cat(" since their data are informative enough.\n") cat("3. The between-hospital SD (tau) is estimated from the data,\n") cat(" quantifying the degree of heterogeneity across sites.\n") cat("4. Hospital-by-hospital OLS ignores the shared structure --\n") cat(" all hospitals are studying the same drug. The hierarchical\n") cat(" model borrows strength across sites while allowing for\n") cat(" genuine between-site variation.\n") ``` #### Python ```{python} #| label: sol-ch14-ex2-py #| eval: false #| code-fold: true #| code-summary: "Show Python solution" # Chapter 14, Exercise 2: Hierarchical Model for Multi-Site Drug Trial # 12 hospitals, LDL cholesterol change, new statin vs standard care import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.linear_model import LinearRegression # ---- (a) Simulate data ---- np.random.seed(789) n_hospitals = 12 patients = [20, 25, 30, 40, 50, 60, 70, 80, 100, 120, 150, 200] grand_effect = -25 # mg/dL tau = 5 # between-hospital SD hospital_effects = np.random.normal(0, tau, n_hospitals) rows = [] for j in range(n_hospitals): nj = patients[j] trt = np.tile([0, 1], nj // 2 + 1)[:nj] ldl_change = ((grand_effect + hospital_effects[j]) * trt + np.random.normal(0, 20, nj)) for i in range(nj): rows.append({ 'hospital': j, 'treatment': trt[i], 'ldl_change': ldl_change[i] }) trial_data = pd.DataFrame(rows) print("Data summary:") print(f" Total patients: {len(trial_data)}") print(f" Hospitals: {n_hospitals}") print(f" True grand effect: {grand_effect} mg/dL") print(f" True between-hospital SD: {tau} mg/dL") # ---- No-pooling estimates (OLS per hospital) ---- no_pool = [] for j in range(n_hospitals): df_j = trial_data[trial_data['hospital'] == j] trt_mean = df_j[df_j['treatment'] == 1]['ldl_change'].mean() ctl_mean = df_j[df_j['treatment'] == 0]['ldl_change'].mean() no_pool.append(trt_mean - ctl_mean) no_pool = np.array(no_pool) # ---- (b) & (c) Partial pooling approximation ---- # For a full Bayesian fit, use PyMC. Here we approximate shrinkage # to demonstrate the concept without requiring MCMC sampling. # Estimate within-hospital variance from data within_var = [] for j in range(n_hospitals): df_j = trial_data[trial_data['hospital'] == j] trt_vals = df_j[df_j['treatment'] == 1]['ldl_change'].values ctl_vals = df_j[df_j['treatment'] == 0]['ldl_change'].values # Variance of the treatment effect estimate se_j = np.sqrt(np.var(trt_vals, ddof=1) / len(trt_vals) + np.var(ctl_vals, ddof=1) / len(ctl_vals)) within_var.append(se_j**2) within_var = np.array(within_var) # Estimate between-hospital variance using method of moments grand_mean_est = np.mean(no_pool) tau_est_sq = max(0, np.var(no_pool, ddof=1) - np.mean(within_var)) tau_est = np.sqrt(tau_est_sq) # Shrinkage factor: B_j = within_var_j / (within_var_j + tau^2) shrinkage = within_var / (within_var + tau_est_sq) partial_pool = grand_mean_est + (1 - shrinkage) * (no_pool - grand_mean_est) print(f"\nEstimated grand mean effect: {grand_mean_est:.1f} mg/dL") print(f"Estimated between-hospital SD: {tau_est:.1f} mg/dL") # ---- (c) Shrinkage plot ---- fig, ax = plt.subplots(figsize=(10, 6)) order = np.argsort(patients) y_pos = np.arange(n_hospitals) for i, idx in enumerate(order): ax.annotate("", xy=(partial_pool[idx], i), xytext=(no_pool[idx], i), arrowprops=dict(arrowstyle="->", color="grey")) ax.scatter(no_pool[order], y_pos, color="steelblue", s=60, zorder=2, label="No pooling (OLS)") ax.scatter(partial_pool[order], y_pos, color="firebrick", s=60, zorder=2, label="Partial pooling") ax.axvline(grand_mean_est, linestyle="--", color="grey", linewidth=1) ax.text(grand_mean_est + 0.5, n_hospitals - 0.5, f"Grand mean = {grand_mean_est:.1f}", fontsize=9) ax.set_yticks(y_pos) ax.set_yticklabels([f"Hospital {order[i]+1}\n(n={patients[order[i]]})" for i in range(n_hospitals)], fontsize=9) ax.set_xlabel("Treatment Effect (mg/dL change in LDL)") ax.set_title("Shrinkage in Hierarchical Model\n" "Arrows show estimates pulled toward the grand mean") ax.legend() plt.tight_layout() plt.savefig("ch14_ex2_shrinkage.png", dpi=150) plt.show() # ---- (d) Compare approaches ---- print("\n=== Part (d): Comparison ===\n") print(f"{'Hospital':>8} | {'N':>4} | {'No-Pool':>10} | {'Partial Pool':>12} | {'Shrinkage':>9}") print("-" * 55) for j in range(n_hospitals): shrink_amt = abs(no_pool[j] - partial_pool[j]) print(f" {j+1:>2} | {patients[j]:>3} | {no_pool[j]:>9.1f} | {partial_pool[j]:>11.1f} | {shrink_amt:>8.1f}") print("\nThe hierarchical model is MORE APPROPRIATE because:") print("1. Small hospitals have noisy OLS estimates that are shrunk") print(" toward the grand mean, reducing estimation error.") print("2. Large hospitals retain their individual estimates since") print(" their data are informative enough.") print("3. Between-hospital SD (tau) is estimated from data,") print(" quantifying heterogeneity across sites.") print("4. Hospital-by-hospital OLS ignores shared structure --") print(" all hospitals study the same drug. The hierarchical model") print(" borrows strength across sites while allowing genuine") print(" between-site variation.") # NOTE: For a full Bayesian hierarchical model, use PyMC: # # import pymc as pm # with pm.Model() as hier_model: # mu_trt = pm.Normal("mu_trt", mu=0, sigma=30) # tau = pm.HalfCauchy("tau", beta=5) # trt_j = pm.Normal("trt_j", mu=mu_trt, sigma=tau, shape=n_hospitals) # sigma = pm.Exponential("sigma", lam=0.05) # mu = trt_j[hospital_idx] * treatment # y = pm.Normal("y", mu=mu, sigma=sigma, observed=ldl_change) # trace = pm.sample(1000, tune=1000, chains=4, random_seed=42) ``` ::: ### Exercise 3 ::: {.panel-tabset} #### R ```{r} #| label: sol-ch14-ex3-r #| eval: false #| code-fold: true #| code-summary: "Show R solution" # Chapter 14, Exercise 3: Prior Sensitivity for Rare Events # New surgical technique: 0 adverse events in 40 patients # ---- (a) Compute posterior distributions under three priors ---- cat("=== Part (a): Posteriors Under Three Priors ===\n\n") y <- 0 # adverse events n <- 40 # total patients priors <- list( list(name = "Beta(1,1) - Uniform", a = 1, b = 1), list(name = "Beta(0.5,0.5) - Jeffreys", a = 0.5, b = 0.5), list(name = "Beta(1,9) - Informative (~10%)", a = 1, b = 9) ) theta <- seq(0, 0.3, length.out = 500) par(mfrow = c(1, 1)) plot(NULL, xlim = c(0, 0.2), ylim = c(0, 50), xlab = "Adverse Event Rate", ylab = "Density", main = "Posterior Distributions: 0/40 Adverse Events") colors <- c("steelblue", "firebrick", "forestgreen") for (i in seq_along(priors)) { p <- priors[[i]] a_post <- p$a + y b_post <- p$b + n - y post_mean <- a_post / (a_post + b_post) ci <- qbeta(c(0.025, 0.975), a_post, b_post) upper_95 <- qbeta(0.95, a_post, b_post) cat(sprintf("Prior: %s\n", p$name)) cat(sprintf(" Posterior: Beta(%.1f, %.1f)\n", a_post, b_post)) cat(sprintf(" Posterior mean: %.4f (%.2f%%)\n", post_mean, post_mean * 100)) cat(sprintf(" 95%% credible interval: [%.4f, %.4f]\n", ci[1], ci[2])) cat(sprintf(" 95%% upper credible bound: %.4f (%.2f%%)\n\n", upper_95, upper_95 * 100)) lines(theta, dbeta(theta, a_post, b_post), col = colors[i], lwd = 2) } legend("topright", sapply(priors, function(p) p$name), col = colors, lwd = 2, cex = 0.8) # ---- (b) Summary table ---- cat("\n=== Part (b): Summary Table ===\n\n") cat("Prior | Post Mean | 95% Upper Bound\n") cat("-------------------|-----------|----------------\n") for (p in priors) { a_post <- p$a + y b_post <- p$b + n - y post_mean <- a_post / (a_post + b_post) upper_95 <- qbeta(0.95, a_post, b_post) cat(sprintf("%-19s| %7.4f | %7.4f (%.1f%%)\n", p$name, post_mean, upper_95, upper_95 * 100)) } # ---- (c) Why Bayesian estimates are more useful ---- cat("\n=== Part (c): Why Bayesian Estimates Are More Useful ===\n\n") cat("The frequentist point estimate is 0/40 = 0 (0%).\n\n") cat("This is PROBLEMATIC for regulatory decision-making because:\n\n") cat("1. ZERO IS NOT CREDIBLE: Just because no adverse events were\n") cat(" observed in 40 patients does not mean the true rate is zero.\n") cat(" The 'rule of 3' (frequentist) gives an upper bound of 3/40 = 7.5%,\n") cat(" but this is ad hoc and does not provide a full distribution.\n\n") cat("2. BAYESIAN ESTIMATES ARE HONEST: Each prior gives a non-zero\n") cat(" estimate of the adverse event rate, which is more realistic.\n") cat(" Even with the Jeffreys prior (minimal information), the\n") cat(" posterior mean is about 1.2%, acknowledging that rare events\n") cat(" can occur even if none were observed.\n\n") cat("3. UPPER CREDIBLE BOUNDS: Regulators need worst-case estimates.\n") cat(" The 95% upper credible bound provides a direct probability\n") cat(" statement: 'There is a 95% probability that the true adverse\n") cat(" event rate is below X%'. This is exactly what is needed for\n") cat(" risk-benefit assessments.\n\n") cat("4. PRIOR INFORMATION IS VALUABLE: If similar surgical procedures\n") cat(" have known complication rates (~10%), the informative prior\n") cat(" Beta(1,9) incorporates this, giving a more realistic estimate.\n") cat(" The frequentist approach ignores all prior knowledge.\n\n") cat("5. DECISION SUPPORT: The full posterior distribution allows\n") cat(" calculation of quantities like P(rate < 5% | data), which\n") cat(" directly supports regulatory decisions.\n") # Compute P(rate < 5% | data) for each prior cat("\n P(rate < 5% | data):\n") for (p in priors) { a_post <- p$a + y b_post <- p$b + n - y prob_lt_5 <- pbeta(0.05, a_post, b_post) cat(sprintf(" %s: %.3f\n", p$name, prob_lt_5)) } ``` #### Python ```{python} #| label: sol-ch14-ex3-py #| eval: false #| code-fold: true #| code-summary: "Show Python solution" # Chapter 14, Exercise 3: Prior Sensitivity for Rare Events # New surgical technique: 0 adverse events in 40 patients import numpy as np import matplotlib.pyplot as plt from scipy.stats import beta y = 0 # adverse events n = 40 # total patients priors = [ ("Beta(1,1) - Uniform", 1, 1), ("Beta(0.5,0.5) - Jeffreys", 0.5, 0.5), ("Beta(1,9) - Informative (~10%)", 1, 9), ] theta = np.linspace(0, 0.25, 500) colors = ["steelblue", "firebrick", "forestgreen"] # ---- (a) Compute posterior distributions ---- print("=== Part (a): Posteriors Under Three Priors ===\n") fig, ax = plt.subplots(figsize=(8, 5)) for (name, a0, b0), color in zip(priors, colors): a_post = a0 + y b_post = b0 + n - y post_mean = a_post / (a_post + b_post) ci = beta.ppf([0.025, 0.975], a_post, b_post) upper_95 = beta.ppf(0.95, a_post, b_post) print(f"Prior: {name}") print(f" Posterior: Beta({a_post}, {b_post})") print(f" Posterior mean: {post_mean:.4f} ({post_mean*100:.2f}%)") print(f" 95% credible interval: [{ci[0]:.4f}, {ci[1]:.4f}]") print(f" 95% upper credible bound: {upper_95:.4f} ({upper_95*100:.2f}%)") print() ax.plot(theta, beta.pdf(theta, a_post, b_post), color=color, lw=2, label=name) ax.set_xlabel("Adverse Event Rate") ax.set_ylabel("Density") ax.set_title("Posterior Distributions: 0/40 Adverse Events") ax.legend(fontsize=9) ax.set_xlim(0, 0.2) plt.tight_layout() plt.savefig("ch14_ex3_rare_events.png", dpi=150) plt.show() # ---- (b) Summary table ---- print("=== Part (b): Summary Table ===\n") print(f"{'Prior':<30s} | {'Post Mean':>9s} | {'95% Upper Bound':>15s}") print("-" * 60) for name, a0, b0 in priors: a_post = a0 + y b_post = b0 + n - y post_mean = a_post / (a_post + b_post) upper_95 = beta.ppf(0.95, a_post, b_post) print(f"{name:<30s} | {post_mean:>9.4f} | {upper_95:>9.4f} ({upper_95*100:.1f}%)") # ---- (c) Why Bayesian estimates are more useful ---- print("\n=== Part (c): Why Bayesian Estimates Are More Useful ===\n") print("The frequentist point estimate is 0/40 = 0 (0%).\n") print("This is PROBLEMATIC for regulatory decision-making because:\n") print("1. ZERO IS NOT CREDIBLE: Just because no adverse events were") print(" observed in 40 patients does not mean the true rate is zero.") print(" The 'rule of 3' gives an upper bound of 3/40 = 7.5%, but") print(" this is ad hoc and does not provide a full distribution.\n") print("2. BAYESIAN ESTIMATES ARE HONEST: Each prior gives a non-zero") print(" estimate, which is more realistic. Even the Jeffreys prior") print(" yields ~1.2%, acknowledging rare events can occur.\n") print("3. UPPER CREDIBLE BOUNDS: Regulators need worst-case estimates.") print(" The 95% upper bound provides a direct probability statement:") print(" 'There is 95% probability the true rate is below X%'.\n") print("4. PRIOR INFORMATION IS VALUABLE: Known complication rates from") print(" similar procedures can be formally incorporated. The frequentist") print(" approach ignores all prior knowledge.\n") print("5. DECISION SUPPORT: The full posterior enables quantities like") print(" P(rate < 5% | data), directly supporting regulatory decisions.\n") print(" P(rate < 5% | data):") for name, a0, b0 in priors: a_post = a0 + y b_post = b0 + n - y prob_lt_5 = beta.cdf(0.05, a_post, b_post) print(f" {name}: {prob_lt_5:.3f}") ``` ::: ## Chapter 15: Dimensionality Reduction ### Exercise 1 ::: {.panel-tabset} #### R ```{r} #| label: sol-ch15-ex1-r #| eval: false #| code-fold: true #| code-summary: "Show R solution" # Chapter 15, Exercise 1: PCA on Clinical Lab Data # Using the simulated metabolic panel data from the chapter library(tidyverse) library(MASS) # ---- Simulate metabolic data (same as chapter) ---- set.seed(42) n <- 300 Sigma <- matrix(c( 1.0, 0.7, 0.5, 0.3,-0.2, 0.1, 0.2,-0.1, 0.7, 1.0, 0.4, 0.2,-0.2, 0.1, 0.1,-0.1, 0.5, 0.4, 1.0, 0.4,-0.3, 0.1, 0.3,-0.1, 0.3, 0.2, 0.4, 1.0,-0.5, 0.1, 0.2, 0.0, -0.2,-0.2,-0.3,-0.5, 1.0,-0.1,-0.1, 0.2, 0.1, 0.1, 0.1, 0.1,-0.1, 1.0, 0.1,-0.3, 0.2, 0.1, 0.3, 0.2,-0.1, 0.1, 1.0,-0.2, -0.1,-0.1,-0.1, 0.0, 0.2,-0.3,-0.2, 1.0 ), nrow = 8) z <- mvrnorm(n, mu = rep(0, 8), Sigma = Sigma) metabolic <- tibble( glucose = round(z[,1] * 30 + 100), hba1c = round(z[,2] * 1.0 + 5.8, 1), triglycerides = round(z[,3] * 50 + 150), ldl = round(z[,4] * 30 + 120), hdl = round(z[,5] * 12 + 55), creatinine = round(z[,6] * 0.3 + 1.0, 2), alt = round(z[,7] * 15 + 30), albumin = round(z[,8] * 0.4 + 4.0, 1) ) # ---- (a) Perform PCA on standardised data ---- cat("=== Part (a): PCA on Standardised Data ===\n") pca_fit <- prcomp(metabolic, scale. = TRUE) cat("PCA completed. Summary:\n") print(summary(pca_fit)) # ---- (b) Scree plot and number of components ---- cat("\n=== Part (b): Scree Plot and Component Selection ===\n") var_prop <- pca_fit$sdev^2 / sum(pca_fit$sdev^2) cum_var <- cumsum(var_prop) par(mfrow = c(1, 2), mar = c(4, 4, 3, 1)) # Scree plot barplot(var_prop, names.arg = paste0("PC", 1:8), col = "steelblue", main = "Scree Plot", ylab = "Proportion of Variance", xlab = "Component", ylim = c(0, 0.4)) abline(h = 1/8, lty = 2, col = "firebrick") text(9, 1/8 + 0.015, "Kaiser threshold (1/p)", col = "firebrick", cex = 0.8) # Cumulative variance plot(1:8, cum_var, type = "b", pch = 16, col = "steelblue", main = "Cumulative Variance", xlab = "Number of Components", ylab = "Cumulative Proportion", ylim = c(0, 1)) abline(h = 0.80, col = "firebrick", lty = 2) text(6, 0.82, "80% threshold", col = "firebrick", cex = 0.8) # Kaiser criterion: eigenvalues > 1 (which equals variance > 1/p for scaled data) eigenvalues <- pca_fit$sdev^2 n_kaiser <- sum(eigenvalues > 1) n_80 <- which(cum_var >= 0.80)[1] cat("\nKaiser criterion (eigenvalue > 1):", n_kaiser, "components\n") cat("80% cumulative variance threshold:", n_80, "components\n") cat("Eigenvalues:", round(eigenvalues, 3), "\n") cat("Cumulative variance:", round(cum_var, 3), "\n") # ---- (c) Loadings of first two PCs ---- cat("\n=== Part (c): Loadings and Interpretation ===\n") cat("\nPC1 loadings (sorted by absolute value):\n") pc1_loadings <- pca_fit$rotation[, 1] print(round(sort(abs(pc1_loadings), decreasing = TRUE), 3)) cat("Signed loadings:\n") print(round(pc1_loadings[order(abs(pc1_loadings), decreasing = TRUE)], 3)) cat("\nPC2 loadings (sorted by absolute value):\n") pc2_loadings <- pca_fit$rotation[, 2] print(round(sort(abs(pc2_loadings), decreasing = TRUE), 3)) cat("Signed loadings:\n") print(round(pc2_loadings[order(abs(pc2_loadings), decreasing = TRUE)], 3)) cat("\nClinical interpretation:\n") cat("PC1: Dominated by glucose, HbA1c, triglycerides (positive) and\n") cat(" HDL (negative). This is a 'metabolic syndrome' axis.\n") cat(" Patients scoring high on PC1 tend to have elevated glucose,\n") cat(" HbA1c, triglycerides and lower HDL.\n\n") cat("PC2: Dominated by creatinine and albumin (with opposite signs).\n") cat(" This captures a 'renal/hepatic function' dimension.\n") cat(" Patients with high creatinine and low albumin (suggesting\n") cat(" renal impairment or liver dysfunction) score high on PC2.\n") # ---- (d) Biplot coloured by diabetes status ---- cat("\n=== Part (d): Biplot with Diabetes Status ===\n") set.seed(99) metabolic$diabetes <- factor( ifelse(metabolic$glucose > 110 & metabolic$hba1c > 6.2, "Diabetic", "Non-diabetic") ) cat("Diabetes prevalence:", mean(metabolic$diabetes == "Diabetic"), "\n") # Extract PC scores pc_scores <- as.data.frame(pca_fit$x[, 1:2]) pc_scores$diabetes <- metabolic$diabetes par(mfrow = c(1, 1)) # Biplot with diabetes colouring cols <- ifelse(metabolic$diabetes == "Diabetic", "firebrick", "steelblue") plot(pc_scores$PC1, pc_scores$PC2, col = cols, pch = 16, cex = 0.7, xlab = paste0("PC1 (", round(var_prop[1] * 100, 1), "% var)"), ylab = paste0("PC2 (", round(var_prop[2] * 100, 1), "% var)"), main = "PCA Biplot Coloured by Diabetes Status") # Add loading arrows loadings <- pca_fit$rotation[, 1:2] scale_factor <- 3 for (i in 1:nrow(loadings)) { arrows(0, 0, loadings[i, 1] * scale_factor, loadings[i, 2] * scale_factor, col = "grey30", length = 0.1, lwd = 1.5) text(loadings[i, 1] * scale_factor * 1.15, loadings[i, 2] * scale_factor * 1.15, rownames(loadings)[i], cex = 0.7, col = "grey20") } legend("topright", c("Diabetic", "Non-diabetic"), col = c("firebrick", "steelblue"), pch = 16, cex = 0.8) cat("\nDiabetic patients tend to cluster toward higher PC1 values,\n") cat("which aligns with the 'metabolic syndrome' interpretation.\n") cat("This confirms that PC1 captures the metabolic dimension\n") cat("along which diabetic patients differ from non-diabetic patients.\n") ``` #### Python ```{python} #| label: sol-ch15-ex1-py #| eval: false #| code-fold: true #| code-summary: "Show Python solution" # Chapter 15, Exercise 1: PCA on Clinical Lab Data # Using the simulated metabolic panel data from the chapter import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA # ---- Simulate metabolic data (same as chapter) ---- np.random.seed(42) n = 300 Sigma = np.array([ [1.0, 0.7, 0.5, 0.3,-0.2, 0.1, 0.2,-0.1], [0.7, 1.0, 0.4, 0.2,-0.2, 0.1, 0.1,-0.1], [0.5, 0.4, 1.0, 0.4,-0.3, 0.1, 0.3,-0.1], [0.3, 0.2, 0.4, 1.0,-0.5, 0.1, 0.2, 0.0], [-0.2,-0.2,-0.3,-0.5, 1.0,-0.1,-0.1, 0.2], [0.1, 0.1, 0.1, 0.1,-0.1, 1.0, 0.1,-0.3], [0.2, 0.1, 0.3, 0.2,-0.1, 0.1, 1.0,-0.2], [-0.1,-0.1,-0.1, 0.0, 0.2,-0.3,-0.2, 1.0] ]) z = np.random.multivariate_normal(np.zeros(8), Sigma, n) labels = ["glucose", "hba1c", "triglycerides", "ldl", "hdl", "creatinine", "alt", "albumin"] metabolic = pd.DataFrame({ "glucose": np.round(z[:, 0] * 30 + 100), "hba1c": np.round(z[:, 1] * 1.0 + 5.8, 1), "triglycerides": np.round(z[:, 2] * 50 + 150), "ldl": np.round(z[:, 3] * 30 + 120), "hdl": np.round(z[:, 4] * 12 + 55), "creatinine": np.round(z[:, 5] * 0.3 + 1.0, 2), "alt": np.round(z[:, 6] * 15 + 30), "albumin": np.round(z[:, 7] * 0.4 + 4.0, 1) }) # ---- (a) PCA on standardised data ---- print("=== Part (a): PCA on Standardised Data ===") X = metabolic.values scaler = StandardScaler() X_scaled = scaler.fit_transform(X) pca = PCA() scores = pca.fit_transform(X_scaled) print("Explained variance ratio:", np.round(pca.explained_variance_ratio_, 3)) # ---- (b) Scree plot and component selection ---- print("\n=== Part (b): Scree Plot and Component Selection ===") var_prop = pca.explained_variance_ratio_ cum_var = np.cumsum(var_prop) fig, axes = plt.subplots(1, 2, figsize=(12, 5)) # Scree plot axes[0].bar(range(1, 9), var_prop, color="steelblue", edgecolor="white") axes[0].axhline(y=1/8, color="firebrick", linestyle="--", label="Kaiser threshold") axes[0].set_xlabel("Component") axes[0].set_ylabel("Proportion of Variance") axes[0].set_title("Scree Plot") axes[0].set_xticks(range(1, 9)) axes[0].legend() # Cumulative variance axes[1].plot(range(1, 9), cum_var, "o-", color="steelblue", lw=2) axes[1].axhline(y=0.80, color="firebrick", linestyle="--", label="80% threshold") axes[1].set_xlabel("Number of Components") axes[1].set_ylabel("Cumulative Proportion") axes[1].set_title("Cumulative Variance") axes[1].set_xticks(range(1, 9)) axes[1].legend() plt.tight_layout() plt.savefig("ch15_ex1_scree.png", dpi=150) plt.show() # Kaiser criterion: eigenvalues > 1 eigenvalues = pca.explained_variance_ n_kaiser = np.sum(eigenvalues > 1) n_80 = np.argmax(cum_var >= 0.80) + 1 print(f"\nKaiser criterion (eigenvalue > 1): {n_kaiser} components") print(f"80% cumulative variance threshold: {n_80} components") print(f"Eigenvalues: {np.round(eigenvalues, 3)}") print(f"Cumulative variance: {np.round(cum_var, 3)}") # ---- (c) Loadings of first two PCs ---- print("\n=== Part (c): Loadings and Interpretation ===") for pc_idx in [0, 1]: loadings = pca.components_[pc_idx] order = np.argsort(np.abs(loadings))[::-1] print(f"\nPC{pc_idx+1} loadings (sorted by |loading|):") for idx in order: print(f" {labels[idx]:>15s}: {loadings[idx]:+.3f}") print("\nClinical interpretation:") print("PC1: Dominated by glucose, HbA1c, triglycerides (positive) and") print(" HDL (negative). This is a 'metabolic syndrome' axis.") print("PC2: Dominated by creatinine and albumin (opposite signs).") print(" This captures a 'renal/hepatic function' dimension.") # ---- (d) Biplot coloured by diabetes status ---- print("\n=== Part (d): Biplot with Diabetes Status ===") diabetes = np.where( (metabolic["glucose"] > 110) & (metabolic["hba1c"] > 6.2), "Diabetic", "Non-diabetic" ) print(f"Diabetes prevalence: {(diabetes == 'Diabetic').mean():.3f}") fig, ax = plt.subplots(figsize=(9, 7)) # Plot scores for label, color in [("Non-diabetic", "steelblue"), ("Diabetic", "firebrick")]: mask = diabetes == label ax.scatter(scores[mask, 0], scores[mask, 1], c=color, s=20, alpha=0.6, label=label) # Add loading arrows loadings_2d = pca.components_[:2].T scale_factor = 3 for i, lab in enumerate(labels): ax.annotate("", xy=(loadings_2d[i, 0]*scale_factor, loadings_2d[i, 1]*scale_factor), xytext=(0, 0), arrowprops=dict(arrowstyle="->", color="grey30", lw=1.5)) ax.text(loadings_2d[i, 0]*scale_factor*1.15, loadings_2d[i, 1]*scale_factor*1.15, lab, fontsize=8, color="grey20", ha="center") ax.set_xlabel(f"PC1 ({var_prop[0]:.1%} var)") ax.set_ylabel(f"PC2 ({var_prop[1]:.1%} var)") ax.set_title("PCA Biplot Coloured by Diabetes Status") ax.axhline(0, color="grey", linewidth=0.5) ax.axvline(0, color="grey", linewidth=0.5) ax.legend() plt.tight_layout() plt.savefig("ch15_ex1_biplot.png", dpi=150) plt.show() print("\nDiabetic patients tend to cluster toward higher PC1 values,") print("aligning with the 'metabolic syndrome' interpretation.") print("PC1 captures the metabolic dimension along which diabetic") print("patients differ from non-diabetic patients.") ``` ::: ### Exercise 2 ::: {.panel-tabset} #### R ```{r} #| label: sol-ch15-ex2-r #| eval: false #| code-fold: true #| code-summary: "Show R solution" # Chapter 15, Exercise 2: t-SNE Sensitivity to Perplexity # Using the three-group simulated dataset from the chapter library(Rtsne) # ---- Simulate data (same as chapter) ---- set.seed(42) n_per_group <- 100 p <- 20 group1 <- matrix(rnorm(n_per_group * p, mean = 0, sd = 1), ncol = p) group2 <- matrix(rnorm(n_per_group * p, mean = 2, sd = 1.2), ncol = p) group3 <- matrix(rnorm(n_per_group * p, mean = -1, sd = 0.8), ncol = p) group2[, 1:5] <- group2[, 1:5] + 3 group3[, 10:15] <- group3[, 10:15] - 4 X <- rbind(group1, group2, group3) labels <- factor(rep(c("Type A", "Type B", "Type C"), each = n_per_group)) cols <- c("steelblue", "firebrick", "forestgreen") # ---- (a) & (b) Run t-SNE with 4 perplexity values ---- perplexities <- c(5, 15, 30, 50) par(mfrow = c(2, 2), mar = c(4, 4, 3, 1)) for (perp in perplexities) { set.seed(42) # Same seed for comparability tsne_result <- Rtsne(X, perplexity = perp, dims = 2, verbose = FALSE, max_iter = 1000) plot(tsne_result$Y, col = cols[labels], pch = 16, cex = 0.8, main = paste("t-SNE (perplexity =", perp, ")"), xlab = "t-SNE 1", ylab = "t-SNE 2") if (perp == 5) { legend("topright", levels(labels), col = cols, pch = 16, cex = 0.7) } } # ---- (c) When do groups become clearly separated? ---- cat("=== Part (c): When Are Groups Clearly Separated? ===\n\n") cat("The three groups become clearly separated at perplexity = 15.\n") cat("At perplexity = 5, the embedding focuses on very local structure,\n") cat("which can fragment the groups into smaller sub-clusters.\n") cat("At perplexity = 15 and above, the groups are well-separated\n") cat("with clear boundaries between them.\n\n") cat("Higher perplexities (30, 50) also separate the groups clearly\n") cat("but with slightly different visual arrangements and more\n") cat("spread-out clusters.\n") # ---- (d) Are distances between clusters consistent? ---- cat("\n=== Part (d): Consistency of Inter-Cluster Distances ===\n\n") cat("NO, the distances between clusters are NOT consistent across\n") cat("perplexity values. Key observations:\n\n") cat("1. The relative positions of the three clusters change across\n") cat(" perplexity settings. Type A might be nearest to Type C at\n") cat(" one perplexity but nearest to Type B at another.\n\n") cat("2. The absolute distances between cluster centres vary\n") cat(" substantially. At low perplexity, clusters may appear close;\n") cat(" at high perplexity, they may be farther apart (or vice versa).\n\n") cat("3. Cluster sizes (spread) also change with perplexity.\n\n") cat("What this tells us about interpreting t-SNE:\n") cat("- Distances between clusters in t-SNE are MEANINGLESS.\n") cat("- t-SNE preserves local neighbourhood structure, not global\n") cat(" distances.\n") cat("- You should NEVER conclude that two clusters are 'more similar'\n") cat(" because they appear closer in a t-SNE plot.\n") cat("- The only reliable information is WHETHER clusters exist,\n") cat(" not HOW FAR APART they are.\n") cat("- Always try multiple perplexity values. If the same clusters\n") cat(" appear consistently, the structure is likely real.\n") ``` #### Python ```{python} #| label: sol-ch15-ex2-py #| eval: false #| code-fold: true #| code-summary: "Show Python solution" # Chapter 15, Exercise 2: t-SNE Sensitivity to Perplexity # Using the three-group simulated dataset from the chapter import numpy as np import matplotlib.pyplot as plt from sklearn.manifold import TSNE # ---- Simulate data (same as chapter) ---- np.random.seed(42) n_per = 100 p = 20 g1 = np.random.normal(0, 1, (n_per, p)) g2 = np.random.normal(2, 1.2, (n_per, p)) g3 = np.random.normal(-1, 0.8, (n_per, p)) g2[:, :5] += 3 g3[:, 10:15] -= 4 X = np.vstack([g1, g2, g3]) labels = np.repeat(["Type A", "Type B", "Type C"], n_per) colors = {"Type A": "steelblue", "Type B": "firebrick", "Type C": "forestgreen"} # ---- (a) & (b) Run t-SNE with 4 perplexity values ---- perplexities = [5, 15, 30, 50] fig, axes = plt.subplots(2, 2, figsize=(12, 10)) axes = axes.flatten() for ax, perp in zip(axes, perplexities): tsne = TSNE(n_components=2, perplexity=perp, random_state=42, max_iter=1000) emb = tsne.fit_transform(X) for lab in ["Type A", "Type B", "Type C"]: mask = labels == lab ax.scatter(emb[mask, 0], emb[mask, 1], c=colors[lab], s=15, alpha=0.7, label=lab) ax.set_title(f"t-SNE (perplexity = {perp})") ax.set_xlabel("t-SNE 1") ax.set_ylabel("t-SNE 2") ax.legend(fontsize=8) plt.tight_layout() plt.savefig("ch15_ex2_tsne_perplexity.png", dpi=150) plt.show() # ---- (c) When do groups become clearly separated? ---- print("=== Part (c): When Are Groups Clearly Separated? ===\n") print("The three groups become clearly separated at perplexity = 15.") print("At perplexity = 5, the embedding focuses on very local structure,") print("which can fragment groups into smaller sub-clusters.") print("At perplexity = 15 and above, groups are well-separated.") print("Higher perplexities (30, 50) also separate groups clearly but") print("with different visual arrangements and more spread-out clusters.") # ---- (d) Are distances consistent? ---- print("\n=== Part (d): Consistency of Inter-Cluster Distances ===\n") print("NO, the distances between clusters are NOT consistent across") print("perplexity values. Key observations:\n") print("1. Relative positions of the three clusters change across") print(" perplexity settings.\n") print("2. Absolute distances between cluster centres vary substantially.\n") print("3. Cluster sizes (spread) also change with perplexity.\n") print("What this tells us about interpreting t-SNE:") print("- Distances between clusters are MEANINGLESS.") print("- t-SNE preserves local neighbourhood structure, not global distances.") print("- NEVER conclude two clusters are 'more similar' because they") print(" appear closer in a t-SNE plot.") print("- The only reliable information is WHETHER clusters exist,") print(" not HOW FAR APART they are.") print("- Always try multiple perplexity values. If clusters appear") print(" consistently, the structure is likely real.") ``` ::: ### Exercise 3 ::: {.panel-tabset} #### R ```{r} #| label: sol-ch15-ex3-r #| eval: false #| code-fold: true #| code-summary: "Show R solution" # Chapter 15, Exercise 3: UMAP Parameter Exploration # Using the three-group simulated dataset from the chapter library(uwot) # ---- Simulate data (same as chapter) ---- set.seed(42) n_per_group <- 100 p <- 20 group1 <- matrix(rnorm(n_per_group * p, mean = 0, sd = 1), ncol = p) group2 <- matrix(rnorm(n_per_group * p, mean = 2, sd = 1.2), ncol = p) group3 <- matrix(rnorm(n_per_group * p, mean = -1, sd = 0.8), ncol = p) group2[, 1:5] <- group2[, 1:5] + 3 group3[, 10:15] <- group3[, 10:15] - 4 X <- rbind(group1, group2, group3) labels <- factor(rep(c("Type A", "Type B", "Type C"), each = n_per_group)) cols <- c("steelblue", "firebrick", "forestgreen") # ---- (a) UMAP with varying n_neighbors, min_dist = 0.1 ---- n_neighbors_vals <- c(5, 15, 50, 100) par(mfrow = c(2, 2), mar = c(4, 4, 3, 1)) for (nn in n_neighbors_vals) { set.seed(42) umap_result <- umap(X, n_neighbors = nn, min_dist = 0.1, verbose = FALSE) plot(umap_result, col = cols[labels], pch = 16, cex = 0.8, main = paste("n_neighbors =", nn, ", min_dist = 0.1"), xlab = "UMAP 1", ylab = "UMAP 2") if (nn == 5) { legend("topright", levels(labels), col = cols, pch = 16, cex = 0.7) } } # ---- (b) UMAP with varying min_dist, n_neighbors = 15 ---- min_dist_vals <- c(0.0, 0.1, 0.5, 1.0) par(mfrow = c(2, 2), mar = c(4, 4, 3, 1)) for (md in min_dist_vals) { set.seed(42) umap_result <- umap(X, n_neighbors = 15, min_dist = md, verbose = FALSE) plot(umap_result, col = cols[labels], pch = 16, cex = 0.8, main = paste("n_neighbors = 15, min_dist =", md), xlab = "UMAP 1", ylab = "UMAP 2") if (md == 0.0) { legend("topright", levels(labels), col = cols, pch = 16, cex = 0.7) } } # ---- (c) Already done above via the two panels of plots ---- # ---- (d) Description and recommendation ---- cat("=== Part (d): How Each Parameter Affects Visual Appearance ===\n\n") cat("n_neighbors (with min_dist = 0.1 fixed):\n") cat(" n_neighbors = 5: Very tight, fragmented clusters. Each point\n") cat(" considers only 5 neighbors, emphasising micro-structure.\n") cat(" May split true clusters into sub-groups.\n") cat(" n_neighbors = 15: Well-separated, compact clusters. Good\n") cat(" balance between local and global structure.\n") cat(" n_neighbors = 50: Broader clusters, more connected. Groups\n") cat(" start to merge slightly as the algorithm considers wider\n") cat(" neighbourhoods.\n") cat(" n_neighbors = 100: Even more spread out. Global structure is\n") cat(" emphasised over local detail. Clusters are less tightly packed.\n\n") cat("min_dist (with n_neighbors = 15 fixed):\n") cat(" min_dist = 0.0: Very tight, dense clusters. Points are packed\n") cat(" as close together as possible. Maximum visual separation.\n") cat(" min_dist = 0.1: Slightly looser clusters. Good default.\n") cat(" min_dist = 0.5: More spread-out embedding. Internal cluster\n") cat(" structure becomes more visible but inter-cluster gaps shrink.\n") cat(" min_dist = 1.0: Very spread out, almost uniform. Clusters\n") cat(" overlap, and the embedding loses much of its structure.\n\n") cat("Recommendation for this dataset:\n") cat(" n_neighbors = 15, min_dist = 0.1\n") cat(" This combination provides well-separated, compact clusters\n") cat(" that clearly reveal the three-group structure. It is also\n") cat(" the default in most implementations, and for good reason:\n") cat(" it balances local detail with global structure effectively.\n") ``` #### Python ```{python} #| label: sol-ch15-ex3-py #| eval: false #| code-fold: true #| code-summary: "Show Python solution" # Chapter 15, Exercise 3: UMAP Parameter Exploration # Using the three-group simulated dataset from the chapter import numpy as np import matplotlib.pyplot as plt from umap import UMAP # ---- Simulate data (same as chapter) ---- np.random.seed(42) n_per = 100 p = 20 g1 = np.random.normal(0, 1, (n_per, p)) g2 = np.random.normal(2, 1.2, (n_per, p)) g3 = np.random.normal(-1, 0.8, (n_per, p)) g2[:, :5] += 3 g3[:, 10:15] -= 4 X = np.vstack([g1, g2, g3]) labels = np.repeat(["Type A", "Type B", "Type C"], n_per) colors = {"Type A": "steelblue", "Type B": "firebrick", "Type C": "forestgreen"} # ---- (a) UMAP with varying n_neighbors, min_dist = 0.1 ---- n_neighbors_vals = [5, 15, 50, 100] fig, axes = plt.subplots(2, 2, figsize=(12, 10)) axes = axes.flatten() for ax, nn in zip(axes, n_neighbors_vals): umap_emb = UMAP(n_components=2, n_neighbors=nn, min_dist=0.1, random_state=42).fit_transform(X) for lab in ["Type A", "Type B", "Type C"]: mask = labels == lab ax.scatter(umap_emb[mask, 0], umap_emb[mask, 1], c=colors[lab], s=15, alpha=0.7, label=lab) ax.set_title(f"n_neighbors = {nn}, min_dist = 0.1") ax.set_xlabel("UMAP 1") ax.set_ylabel("UMAP 2") ax.legend(fontsize=8) plt.suptitle("UMAP: Varying n_neighbors", fontsize=14, y=1.02) plt.tight_layout() plt.savefig("ch15_ex3_umap_neighbors.png", dpi=150) plt.show() # ---- (b) UMAP with varying min_dist, n_neighbors = 15 ---- min_dist_vals = [0.0, 0.1, 0.5, 1.0] fig, axes = plt.subplots(2, 2, figsize=(12, 10)) axes = axes.flatten() for ax, md in zip(axes, min_dist_vals): umap_emb = UMAP(n_components=2, n_neighbors=15, min_dist=md, random_state=42).fit_transform(X) for lab in ["Type A", "Type B", "Type C"]: mask = labels == lab ax.scatter(umap_emb[mask, 0], umap_emb[mask, 1], c=colors[lab], s=15, alpha=0.7, label=lab) ax.set_title(f"n_neighbors = 15, min_dist = {md}") ax.set_xlabel("UMAP 1") ax.set_ylabel("UMAP 2") ax.legend(fontsize=8) plt.suptitle("UMAP: Varying min_dist", fontsize=14, y=1.02) plt.tight_layout() plt.savefig("ch15_ex3_umap_mindist.png", dpi=150) plt.show() # ---- (d) Description and recommendation ---- print("=== Part (d): How Each Parameter Affects Visual Appearance ===\n") print("n_neighbors (with min_dist = 0.1 fixed):") print(" n_neighbors = 5: Very tight, fragmented clusters. Emphasises") print(" micro-structure, may split true clusters.") print(" n_neighbors = 15: Well-separated, compact clusters. Good balance") print(" between local and global structure.") print(" n_neighbors = 50: Broader clusters, more connected. Groups") print(" start to merge as wider neighbourhoods are considered.") print(" n_neighbors = 100: Spread out. Global structure emphasised.\n") print("min_dist (with n_neighbors = 15 fixed):") print(" min_dist = 0.0: Very tight, dense clusters. Maximum separation.") print(" min_dist = 0.1: Slightly looser. Good default.") print(" min_dist = 0.5: Spread-out embedding. Internal structure visible") print(" but inter-cluster gaps shrink.") print(" min_dist = 1.0: Very spread out, near-uniform. Clusters overlap.\n") print("Recommendation for this dataset:") print(" n_neighbors = 15, min_dist = 0.1") print(" This provides well-separated, compact clusters that clearly") print(" reveal the three-group structure. It is the default for good") print(" reason: it balances local detail with global structure.") ``` ::: ### Exercise 4 ::: {.panel-tabset} #### R ```{r} #| label: sol-ch15-ex4-r #| eval: false #| code-fold: true #| code-summary: "Show R solution" # Chapter 15, Exercise 4: Full Workflow on Simulated Genomic Data # 1000 patients, 500 gene expression features, 4 cancer subtypes library(tidyverse) library(Rtsne) library(uwot) # ---- Simulate data ---- set.seed(42) n <- 1000 p <- 500 # 4 cancer subtypes with different prevalences (one rare at 5%) subtype_probs <- c(0.35, 0.30, 0.30, 0.05) subtype <- sample(1:4, n, replace = TRUE, prob = subtype_probs) cat("Subtype distribution:\n") print(table(subtype)) # Generate base gene expression X <- matrix(rnorm(n * p), ncol = p) # Add subtype-specific signals in different gene subsets # Subtype 1: upregulated in genes 1-30 X[subtype == 1, 1:30] <- X[subtype == 1, 1:30] + 2.0 # Subtype 2: upregulated in genes 31-60, downregulated in 61-80 X[subtype == 2, 31:60] <- X[subtype == 2, 31:60] + 2.5 X[subtype == 2, 61:80] <- X[subtype == 2, 61:80] - 1.5 # Subtype 3: upregulated in genes 81-120 X[subtype == 3, 81:120] <- X[subtype == 3, 81:120] + 2.0 # Subtype 4 (rare): strong signal in genes 121-160 X[subtype == 4, 121:160] <- X[subtype == 4, 121:160] + 3.5 # ---- (a) Scale and PCA ---- cat("\n=== Part (a): PCA ===\n") X_scaled <- scale(X) pca_result <- prcomp(X_scaled) var_prop <- pca_result$sdev^2 / sum(pca_result$sdev^2) cum_var <- cumsum(var_prop) n_80 <- which(cum_var >= 0.80)[1] cat("PCs needed for 80% variance:", n_80, "\n") cat("First 10 cumulative variance:", round(cum_var[1:10], 3), "\n") # Scree plot (first 30 components) par(mfrow = c(1, 1)) barplot(var_prop[1:30], names.arg = 1:30, col = "steelblue", main = "Scree Plot (first 30 PCs)", xlab = "Component", ylab = "Proportion of Variance") abline(h = 1/p, col = "firebrick", lty = 2) # ---- (b) UMAP on first 30 PCs ---- cat("\n=== Part (b): UMAP on First 30 PCs ===\n") pca_30 <- pca_result$x[, 1:30] set.seed(42) umap_result <- umap(pca_30, n_neighbors = 15, min_dist = 0.1, verbose = FALSE) subtype_labels <- factor(subtype, labels = c("Subtype 1", "Subtype 2", "Subtype 3", "Subtype 4 (rare)")) cols <- c("steelblue", "firebrick", "forestgreen", "goldenrod") plot_df <- tibble( UMAP1 = umap_result[, 1], UMAP2 = umap_result[, 2], Subtype = subtype_labels ) # UMAP plot par(mfrow = c(1, 1)) plot(umap_result, col = cols[subtype], pch = 16, cex = 0.7, main = "UMAP of Simulated Genomic Data (4 Cancer Subtypes)", xlab = "UMAP 1", ylab = "UMAP 2") legend("topright", levels(subtype_labels), col = cols, pch = 16, cex = 0.8) cat("The four subtypes are visually separable in the UMAP embedding.\n") cat("Each subtype forms a distinct cluster, reflecting the different\n") cat("gene expression profiles we simulated.\n") # ---- (c) t-SNE comparison ---- cat("\n=== Part (c): t-SNE Comparison ===\n") set.seed(42) tsne_result <- Rtsne(pca_30, perplexity = 30, dims = 2, verbose = FALSE, max_iter = 1000) par(mfrow = c(1, 2), mar = c(4, 4, 3, 1)) plot(umap_result, col = cols[subtype], pch = 16, cex = 0.7, main = "UMAP (n_neighbors=15)", xlab = "UMAP 1", ylab = "UMAP 2") legend("topright", levels(subtype_labels), col = cols, pch = 16, cex = 0.6) plot(tsne_result$Y, col = cols[subtype], pch = 16, cex = 0.7, main = "t-SNE (perplexity=30)", xlab = "t-SNE 1", ylab = "t-SNE 2") legend("topright", levels(subtype_labels), col = cols, pch = 16, cex = 0.6) cat("Both UMAP and t-SNE successfully separate the four subtypes.\n") cat("UMAP tends to preserve more global structure (relative distances\n") cat("between clusters are more meaningful), while t-SNE may produce\n") cat("more compact and well-separated clusters but with unreliable\n") cat("inter-cluster distances.\n") # ---- (d) Can the rare subtype be identified? ---- cat("\n=== Part (d): Identifying the Rare Subtype ===\n") n_rare <- sum(subtype == 4) cat("Rare subtype (Subtype 4) has", n_rare, "patients (5% of total).\n\n") cat("YES, the rare subtype can be identified in the UMAP plot.\n") cat("Despite having only ~50 patients, Subtype 4 forms a distinct\n") cat("cluster (shown in goldenrod/yellow). This is because:\n") cat(" 1. The signal is strong (effect size = 3.5 in 40 genes)\n") cat(" 2. UMAP preserves local structure well, so even small groups\n") cat(" remain cohesive\n") cat(" 3. The subtype's gene expression pattern is qualitatively\n") cat(" different from the other subtypes (different genes)\n\n") cat("In practice, rare subtypes CAN be identified if:\n") cat(" - Their molecular profile is sufficiently distinct\n") cat(" - The sample size is not too small (>20-30 patients)\n") cat(" - Dimensionality reduction preserves the relevant structure\n") cat("If the signal were weaker, the rare subtype might merge with\n") cat("another group or be scattered as outliers.\n") ``` #### Python ```{python} #| label: sol-ch15-ex4-py #| eval: false #| code-fold: true #| code-summary: "Show Python solution" # Chapter 15, Exercise 4: Full Workflow on Simulated Genomic Data # 1000 patients, 500 gene expression features, 4 cancer subtypes import numpy as np import matplotlib.pyplot as plt from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA from sklearn.manifold import TSNE from umap import UMAP # ---- Simulate data ---- np.random.seed(42) n = 1000 p = 500 # 4 cancer subtypes (one rare at 5%) subtype = np.random.choice(4, n, p=[0.35, 0.30, 0.30, 0.05]) print("Subtype distribution:", {i: (subtype == i).sum() for i in range(4)}) # Generate base gene expression X = np.random.normal(0, 1, (n, p)) # Add subtype-specific signals X[subtype == 0, :30] += 2.0 # Subtype 1: genes 0-29 X[subtype == 1, 30:60] += 2.5 # Subtype 2: genes 30-59 X[subtype == 1, 60:80] -= 1.5 # Subtype 2: downregulated genes 60-79 X[subtype == 2, 80:120] += 2.0 # Subtype 3: genes 80-119 X[subtype == 3, 120:160] += 3.5 # Subtype 4 (rare): genes 120-159 # ---- (a) Scale and PCA ---- print("\n=== Part (a): PCA ===") X_scaled = StandardScaler().fit_transform(X) pca = PCA() scores = pca.fit_transform(X_scaled) cum_var = np.cumsum(pca.explained_variance_ratio_) n_80 = np.argmax(cum_var >= 0.80) + 1 print(f"PCs needed for 80% variance: {n_80}") print(f"First 10 cumulative variance: {np.round(cum_var[:10], 3)}") # Scree plot fig, ax = plt.subplots(figsize=(10, 4)) ax.bar(range(1, 31), pca.explained_variance_ratio_[:30], color="steelblue", edgecolor="white") ax.axhline(y=1/p, color="firebrick", linestyle="--", label=f"1/p = {1/p:.4f}") ax.set_xlabel("Component") ax.set_ylabel("Proportion of Variance") ax.set_title("Scree Plot (first 30 PCs)") ax.legend() plt.tight_layout() plt.savefig("ch15_ex4_scree.png", dpi=150) plt.show() # ---- (b) UMAP on first 30 PCs ---- print("\n=== Part (b): UMAP on First 30 PCs ===") pca_30 = scores[:, :30] umap_2d = UMAP(n_components=2, n_neighbors=15, min_dist=0.1, random_state=42).fit_transform(pca_30) subtype_names = ["Subtype 1", "Subtype 2", "Subtype 3", "Subtype 4 (rare)"] colors_4 = ["steelblue", "firebrick", "forestgreen", "goldenrod"] fig, ax = plt.subplots(figsize=(8, 6)) for s in range(4): mask = subtype == s ax.scatter(umap_2d[mask, 0], umap_2d[mask, 1], c=colors_4[s], s=12, alpha=0.6, label=subtype_names[s]) ax.set_xlabel("UMAP 1") ax.set_ylabel("UMAP 2") ax.set_title("UMAP of Simulated Genomic Data (4 Cancer Subtypes)") ax.legend() plt.tight_layout() plt.savefig("ch15_ex4_umap.png", dpi=150) plt.show() print("The four subtypes are visually separable in the UMAP embedding.") # ---- (c) t-SNE comparison ---- print("\n=== Part (c): t-SNE Comparison ===") tsne_2d = TSNE(n_components=2, perplexity=30, random_state=42, max_iter=1000).fit_transform(pca_30) fig, axes = plt.subplots(1, 2, figsize=(14, 5)) for s in range(4): mask = subtype == s axes[0].scatter(umap_2d[mask, 0], umap_2d[mask, 1], c=colors_4[s], s=12, alpha=0.6, label=subtype_names[s]) axes[1].scatter(tsne_2d[mask, 0], tsne_2d[mask, 1], c=colors_4[s], s=12, alpha=0.6, label=subtype_names[s]) axes[0].set_title("UMAP (n_neighbors=15)") axes[0].set_xlabel("UMAP 1") axes[0].set_ylabel("UMAP 2") axes[0].legend(fontsize=8) axes[1].set_title("t-SNE (perplexity=30)") axes[1].set_xlabel("t-SNE 1") axes[1].set_ylabel("t-SNE 2") axes[1].legend(fontsize=8) plt.tight_layout() plt.savefig("ch15_ex4_comparison.png", dpi=150) plt.show() print("Both UMAP and t-SNE successfully separate the four subtypes.") print("UMAP preserves more global structure; t-SNE may produce more") print("compact clusters but with unreliable inter-cluster distances.") # ---- (d) Identifying the rare subtype ---- print("\n=== Part (d): Identifying the Rare Subtype ===") n_rare = (subtype == 3).sum() print(f"Rare subtype (Subtype 4) has {n_rare} patients (5% of total).\n") print("YES, the rare subtype can be identified in the UMAP plot.") print("Despite having only ~50 patients, Subtype 4 forms a distinct") print("cluster (shown in goldenrod). This is because:") print(" 1. The signal is strong (effect size = 3.5 in 40 genes)") print(" 2. UMAP preserves local structure, so small groups remain cohesive") print(" 3. The gene expression pattern is qualitatively different\n") print("In practice, rare subtypes CAN be identified if:") print(" - Their molecular profile is sufficiently distinct") print(" - Sample size is not too small (>20-30 patients)") print(" - Dimensionality reduction preserves the relevant structure") print("If the signal were weaker, the rare subtype might merge with") print("another group or scatter as outliers.") ``` ::: ## Chapter 16: Clustering ### Exercise 1 ::: {.panel-tabset} #### R ```{r} #| label: sol-ch16-ex1-r #| eval: false #| code-fold: true #| code-summary: "Show R solution" # Chapter 16, Exercise 1: K-Means on Simulated Patient Data # 600 patients, 6 clinical variables, 3 underlying phenotypes library(tidyverse) library(cluster) # ---- Simulate data ---- set.seed(42) n <- 600 # Phenotype 1: Septic shock (high HR, RR, lactate, low SBP) p1 <- tibble( hr = rnorm(200, 115, 12), rr = rnorm(200, 28, 5), temp = rnorm(200, 38.5, 0.8), sbp = rnorm(200, 80, 12), creat = rnorm(200, 2.5, 0.8), lactate = rnorm(200, 5, 2) ) # Phenotype 2: Stable critical (moderate vitals) p2 <- tibble( hr = rnorm(200, 90, 10), rr = rnorm(200, 20, 3), temp = rnorm(200, 37.2, 0.5), sbp = rnorm(200, 110, 15), creat = rnorm(200, 1.2, 0.3), lactate = rnorm(200, 1.5, 0.5) ) # Phenotype 3: Febrile, preserved hemodynamics p3 <- tibble( hr = rnorm(200, 100, 10), rr = rnorm(200, 22, 4), temp = rnorm(200, 39.2, 0.7), sbp = rnorm(200, 120, 10), creat = rnorm(200, 1.0, 0.2), lactate = rnorm(200, 2.0, 0.8) ) dat <- bind_rows(p1, p2, p3) true_labels <- rep(1:3, each = 200) # ---- (a) Scale and apply K-means for k = 2 to 6 ---- dat_scaled <- scale(dat) wcss <- numeric(6) sil_avg <- numeric(6) for (k in 2:6) { km <- kmeans(dat_scaled, centers = k, nstart = 25) wcss[k] <- km$tot.withinss sil_avg[k] <- mean(silhouette(km$cluster, dist(dat_scaled))[, 3]) } # ---- (b) Elbow and silhouette plots ---- par(mfrow = c(1, 2), mar = c(4, 4, 3, 1)) plot(2:6, wcss[2:6], type = "b", pch = 16, col = "steelblue", xlab = "Number of clusters (k)", ylab = "Within-cluster SS", main = "Elbow Method") plot(2:6, sil_avg[2:6], type = "b", pch = 16, col = "firebrick", xlab = "Number of clusters (k)", ylab = "Average Silhouette", main = "Silhouette Scores") cat("=== Part (b): Optimal k ===\n") cat("Elbow plot: The elbow appears at k = 3, where WCSS decreases\n") cat(" sharply and then flattens.\n") cat("Silhouette: Maximum average silhouette at k =", which.max(sil_avg[2:6]) + 1, "\n") cat("Both methods suggest k = 3, consistent with the 3 simulated phenotypes.\n") # ---- (c) Visualise k=3 solution using PCA ---- km3 <- kmeans(dat_scaled, centers = 3, nstart = 25) pca2 <- prcomp(dat_scaled)$x[, 1:2] par(mfrow = c(1, 1)) cols <- c("steelblue", "firebrick", "forestgreen") plot(pca2, col = cols[km3$cluster], pch = 16, cex = 0.7, main = "K-means (k=3) on First 2 PCs", xlab = "PC1", ylab = "PC2") legend("topright", paste("Cluster", 1:3), col = cols, pch = 16, cex = 0.8) # ---- (d) Profile the clusters ---- cat("\n=== Part (d): Cluster Profiles ===\n\n") dat$cluster <- km3$cluster profiles <- dat %>% group_by(cluster) %>% summarise(across(hr:lactate, mean), n = n(), .groups = "drop") print(profiles) cat("\nClinical interpretation of cluster profiles:\n\n") # Identify which cluster matches which phenotype for (cl in 1:3) { prof <- profiles %>% filter(cluster == cl) cat(sprintf("Cluster %d (n = %d):\n", cl, prof$n)) cat(sprintf(" HR=%.0f, RR=%.0f, Temp=%.1f, SBP=%.0f, Creat=%.1f, Lactate=%.1f\n", prof$hr, prof$rr, prof$temp, prof$sbp, prof$creat, prof$lactate)) if (prof$lactate > 3 && prof$sbp < 90) { cat(" -> Matches SEPTIC SHOCK phenotype: high HR, RR, lactate; low SBP\n") } else if (prof$temp > 38.5) { cat(" -> Matches FEBRILE phenotype: high temp, preserved hemodynamics\n") } else { cat(" -> Matches STABLE CRITICAL phenotype: moderate vitals\n") } cat("\n") } cat("The cluster profiles make clinical sense. The algorithm has\n") cat("successfully recovered the three simulated phenotypes, each\n") cat("with a distinct clinical profile that a clinician would recognise.\n") ``` #### Python ```{python} #| label: sol-ch16-ex1-py #| eval: false #| code-fold: true #| code-summary: "Show Python solution" # Chapter 16, Exercise 1: K-Means on Simulated Patient Data # 600 patients, 6 clinical variables, 3 underlying phenotypes import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.preprocessing import StandardScaler from sklearn.cluster import KMeans from sklearn.metrics import silhouette_score from sklearn.decomposition import PCA # ---- Simulate data ---- np.random.seed(42) # Phenotype 1: Septic shock p1 = np.column_stack([ np.random.normal(115, 12, 200), # hr np.random.normal(28, 5, 200), # rr np.random.normal(38.5, 0.8, 200), # temp np.random.normal(80, 12, 200), # sbp np.random.normal(2.5, 0.8, 200), # creat np.random.normal(5, 2, 200) # lactate ]) # Phenotype 2: Stable critical p2 = np.column_stack([ np.random.normal(90, 10, 200), np.random.normal(20, 3, 200), np.random.normal(37.2, 0.5, 200), np.random.normal(110, 15, 200), np.random.normal(1.2, 0.3, 200), np.random.normal(1.5, 0.5, 200) ]) # Phenotype 3: Febrile p3 = np.column_stack([ np.random.normal(100, 10, 200), np.random.normal(22, 4, 200), np.random.normal(39.2, 0.7, 200), np.random.normal(120, 10, 200), np.random.normal(1.0, 0.2, 200), np.random.normal(2.0, 0.8, 200) ]) X = np.vstack([p1, p2, p3]) cols = ['hr', 'rr', 'temp', 'sbp', 'creat', 'lactate'] df = pd.DataFrame(X, columns=cols) # ---- (a) Scale and K-means for k = 2 to 6 ---- X_scaled = StandardScaler().fit_transform(X) ks = range(2, 7) wcss_list = [] sil_list = [] for k in ks: km = KMeans(n_clusters=k, n_init=25, random_state=42) km.fit(X_scaled) wcss_list.append(km.inertia_) sil_list.append(silhouette_score(X_scaled, km.labels_)) # ---- (b) Elbow and silhouette plots ---- fig, axes = plt.subplots(1, 2, figsize=(12, 5)) axes[0].plot(list(ks), wcss_list, "o-", color="steelblue") axes[0].set_xlabel("Number of clusters (k)") axes[0].set_ylabel("Within-cluster SS") axes[0].set_title("Elbow Method") axes[1].plot(list(ks), sil_list, "o-", color="firebrick") axes[1].set_xlabel("Number of clusters (k)") axes[1].set_ylabel("Average Silhouette") axes[1].set_title("Silhouette Scores") plt.tight_layout() plt.savefig("ch16_ex1_elbow_silhouette.png", dpi=150) plt.show() best_k = list(ks)[np.argmax(sil_list)] print(f"=== Part (b): Optimal k ===") print(f"Elbow: Elbow appears at k = 3") print(f"Silhouette: Maximum at k = {best_k}") print(f"Both methods suggest k = 3, matching the 3 simulated phenotypes.") # ---- (c) Visualise k=3 with PCA ---- km3 = KMeans(n_clusters=3, n_init=25, random_state=42).fit(X_scaled) pca2 = PCA(n_components=2).fit_transform(X_scaled) fig, ax = plt.subplots(figsize=(8, 6)) colors_3 = ["steelblue", "firebrick", "forestgreen"] for c in range(3): mask = km3.labels_ == c ax.scatter(pca2[mask, 0], pca2[mask, 1], c=colors_3[c], s=15, alpha=0.6, label=f"Cluster {c+1}") ax.set_xlabel("PC1") ax.set_ylabel("PC2") ax.set_title("K-means (k=3) on First 2 PCs") ax.legend() plt.tight_layout() plt.savefig("ch16_ex1_pca_clusters.png", dpi=150) plt.show() # ---- (d) Profile the clusters ---- print("\n=== Part (d): Cluster Profiles ===\n") df['cluster'] = km3.labels_ profiles = df.groupby('cluster')[cols].mean() print(profiles.round(1)) print("\nClinical interpretation:") for cl in range(3): prof = profiles.loc[cl] n_cl = (km3.labels_ == cl).sum() print(f"\nCluster {cl+1} (n = {n_cl}):") print(f" HR={prof['hr']:.0f}, RR={prof['rr']:.0f}, Temp={prof['temp']:.1f}, " f"SBP={prof['sbp']:.0f}, Creat={prof['creat']:.1f}, Lactate={prof['lactate']:.1f}") if prof['lactate'] > 3 and prof['sbp'] < 90: print(" -> SEPTIC SHOCK: high HR, RR, lactate; low SBP") elif prof['temp'] > 38.5: print(" -> FEBRILE: high temperature, preserved hemodynamics") else: print(" -> STABLE CRITICAL: moderate vitals") print("\nThe cluster profiles make clinical sense. The algorithm") print("successfully recovered the three simulated phenotypes.") ``` ::: ### Exercise 2 ::: {.panel-tabset} #### R ```{r} #| label: sol-ch16-ex2-r #| eval: false #| code-fold: true #| code-summary: "Show R solution" # Chapter 16, Exercise 2: Hierarchical Clustering with Different Linkage Methods # Using the same dataset from Exercise 1 library(tidyverse) library(cluster) library(mclust) # for adjustedRandIndex # ---- Simulate data (same as Exercise 1) ---- set.seed(42) p1 <- tibble(hr = rnorm(200, 115, 12), rr = rnorm(200, 28, 5), temp = rnorm(200, 38.5, 0.8), sbp = rnorm(200, 80, 12), creat = rnorm(200, 2.5, 0.8), lactate = rnorm(200, 5, 2)) p2 <- tibble(hr = rnorm(200, 90, 10), rr = rnorm(200, 20, 3), temp = rnorm(200, 37.2, 0.5), sbp = rnorm(200, 110, 15), creat = rnorm(200, 1.2, 0.3), lactate = rnorm(200, 1.5, 0.5)) p3 <- tibble(hr = rnorm(200, 100, 10), rr = rnorm(200, 22, 4), temp = rnorm(200, 39.2, 0.7), sbp = rnorm(200, 120, 10), creat = rnorm(200, 1.0, 0.2), lactate = rnorm(200, 2.0, 0.8)) dat <- bind_rows(p1, p2, p3) dat_scaled <- scale(dat) # K-means reference solution km3 <- kmeans(dat_scaled, centers = 3, nstart = 25) # ---- (a) Hierarchical clustering with 4 linkage methods ---- cat("=== Part (a): Hierarchical Clustering ===\n\n") d <- dist(dat_scaled) methods <- c("single", "complete", "average", "ward.D2") titles <- c("Single", "Complete", "Average", "Ward's") # Use a subset for readable dendrograms set.seed(42) idx <- sample(nrow(dat_scaled), 80) par(mfrow = c(2, 2), mar = c(2, 3, 3, 1)) for (i in seq_along(methods)) { hc <- hclust(dist(dat_scaled[idx, ]), method = methods[i]) plot(hc, labels = FALSE, main = titles[i], xlab = "", sub = "", hang = -1, cex = 0.5) rect.hclust(hc, k = 3, border = "firebrick") } # ---- (b) Cut each dendrogram at k=3, compare assignments ---- cat("=== Part (b): Cluster Assignments ===\n\n") hc_clusters <- list() for (i in seq_along(methods)) { hc <- hclust(d, method = methods[i]) hc_clusters[[methods[i]]] <- cutree(hc, k = 3) } # Crosstabulation between methods cat("Crosstab: Ward's vs Complete linkage:\n") print(table(Ward = hc_clusters[["ward.D2"]], Complete = hc_clusters[["complete"]])) cat("\nCrosstab: Ward's vs Single linkage:\n") print(table(Ward = hc_clusters[["ward.D2"]], Single = hc_clusters[["single"]])) # ---- (c) ARI comparison with K-means ---- cat("\n=== Part (c): Adjusted Rand Index vs K-means ===\n\n") for (i in seq_along(methods)) { ari <- adjustedRandIndex(km3$cluster, hc_clusters[[methods[i]]]) cat(sprintf("ARI(%s vs K-means): %.3f\n", titles[i], ari)) } cat("\nWard's method produces clusters most similar to K-means (highest ARI).\n") cat("This is expected because both Ward's method and K-means minimise\n") cat("within-cluster variance (WCSS), so they have similar objectives.\n") # ---- (d) Recommendation for clinical data ---- cat("\n=== Part (d): Recommendation ===\n\n") cat("Ward's method is the recommended default for clinical data because:\n\n") cat("1. OBJECTIVE: Ward's minimises within-cluster variance, which\n") cat(" aligns with the goal of finding compact, homogeneous clusters.\n\n") cat("2. CONSISTENCY: It produces results most similar to K-means,\n") cat(" which is the most widely used method. Using both and checking\n") cat(" for agreement strengthens confidence in the results.\n\n") cat("3. ROBUSTNESS: Complete linkage can be distorted by outliers;\n") cat(" single linkage creates chaining effects (long, straggling\n") cat(" clusters). Ward's avoids both issues.\n\n") cat("4. COMPACT CLUSTERS: Clinical phenotypes are typically expected\n") cat(" to be compact groups of similar patients, which Ward's\n") cat(" naturally produces.\n\n") cat("5. INTERPRETABILITY: The dendrogram from Ward's method is\n") cat(" usually the easiest to interpret, with clear height gaps\n") cat(" indicating natural cluster boundaries.\n\n") cat("CAVEAT: If you suspect non-spherical clusters (e.g., disease\n") cat("trajectories that form elongated shapes), average linkage\n") cat("might be more appropriate. But for most clinical phenotyping\n") cat("applications, Ward's is the safe default.\n") ``` #### Python ```{python} #| label: sol-ch16-ex2-py #| eval: false #| code-fold: true #| code-summary: "Show Python solution" # Chapter 16, Exercise 2: Hierarchical Clustering with Different Linkage Methods # Using the same dataset from Exercise 1 import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.preprocessing import StandardScaler from sklearn.cluster import KMeans, AgglomerativeClustering from sklearn.metrics import adjusted_rand_score from scipy.cluster.hierarchy import dendrogram, linkage, fcluster # ---- Simulate data (same as Exercise 1) ---- np.random.seed(42) p1 = np.column_stack([ np.random.normal(115, 12, 200), np.random.normal(28, 5, 200), np.random.normal(38.5, 0.8, 200), np.random.normal(80, 12, 200), np.random.normal(2.5, 0.8, 200), np.random.normal(5, 2, 200) ]) p2 = np.column_stack([ np.random.normal(90, 10, 200), np.random.normal(20, 3, 200), np.random.normal(37.2, 0.5, 200), np.random.normal(110, 15, 200), np.random.normal(1.2, 0.3, 200), np.random.normal(1.5, 0.5, 200) ]) p3 = np.column_stack([ np.random.normal(100, 10, 200), np.random.normal(22, 4, 200), np.random.normal(39.2, 0.7, 200), np.random.normal(120, 10, 200), np.random.normal(1.0, 0.2, 200), np.random.normal(2.0, 0.8, 200) ]) X = np.vstack([p1, p2, p3]) X_scaled = StandardScaler().fit_transform(X) # K-means reference km3 = KMeans(n_clusters=3, n_init=25, random_state=42).fit(X_scaled) # ---- (a) Hierarchical clustering with 4 linkage methods ---- print("=== Part (a): Hierarchical Clustering ===\n") methods = ['single', 'complete', 'average', 'ward'] titles = ['Single', 'Complete', 'Average', "Ward's"] # Dendrograms (use subset for readability) np.random.seed(42) idx = np.random.choice(len(X_scaled), 80, replace=False) X_sub = X_scaled[idx] fig, axes = plt.subplots(2, 2, figsize=(16, 10)) axes = axes.flatten() for ax, method, title in zip(axes, methods, titles): Z = linkage(X_sub, method=method) dendrogram(Z, ax=ax, no_labels=True, color_threshold=0) ax.set_title(title) ax.set_xlabel("") plt.suptitle("Hierarchical Clustering: Linkage Method Comparison", fontsize=14) plt.tight_layout() plt.savefig("ch16_ex2_dendrograms.png", dpi=150) plt.show() # ---- (b) Cut at k=3 and compare ---- print("=== Part (b): Cluster Assignments ===\n") hc_labels = {} for method in methods: Z = linkage(X_scaled, method=method) hc_labels[method] = fcluster(Z, t=3, criterion='maxclust') # Crosstabs print("Crosstab: Ward's vs Complete:") ct = pd.crosstab(pd.Series(hc_labels['ward'], name='Ward'), pd.Series(hc_labels['complete'], name='Complete')) print(ct) print("\nCrosstab: Ward's vs Single:") ct2 = pd.crosstab(pd.Series(hc_labels['ward'], name='Ward'), pd.Series(hc_labels['single'], name='Single')) print(ct2) # ---- (c) ARI comparison with K-means ---- print("\n=== Part (c): Adjusted Rand Index vs K-means ===\n") for method, title in zip(methods, titles): ari = adjusted_rand_score(km3.labels_, hc_labels[method]) print(f"ARI({title} vs K-means): {ari:.3f}") print("\nWard's method produces clusters most similar to K-means (highest ARI).") print("Both minimise within-cluster variance, so they share similar objectives.") # ---- (d) Recommendation ---- print("\n=== Part (d): Recommendation ===\n") print("Ward's method is recommended for clinical data because:\n") print("1. OBJECTIVE: Minimises within-cluster variance, producing") print(" compact, homogeneous clusters.\n") print("2. CONSISTENCY: Most similar to K-means, strengthening") print(" confidence when both methods agree.\n") print("3. ROBUSTNESS: Avoids chaining (single linkage) and outlier") print(" sensitivity (complete linkage).\n") print("4. COMPACT CLUSTERS: Clinical phenotypes are typically compact") print(" groups, which Ward's naturally produces.\n") print("5. INTERPRETABILITY: Dendrogram with clear height gaps at") print(" natural cluster boundaries.\n") print("CAVEAT: For non-spherical clusters, average linkage may be") print("more appropriate. But Ward's is the safe default for most") print("clinical phenotyping applications.") ``` ::: ### Exercise 3 ::: {.panel-tabset} #### R ```{r} #| label: sol-ch16-ex3-r #| eval: false #| code-fold: true #| code-summary: "Show R solution" # Chapter 16, Exercise 3: DBSCAN for Outlier Detection # Add outlier patients to the Exercise 1 dataset library(tidyverse) library(cluster) library(dbscan) # ---- Simulate data (same as Exercise 1) ---- set.seed(42) p1 <- tibble(hr = rnorm(200, 115, 12), rr = rnorm(200, 28, 5), temp = rnorm(200, 38.5, 0.8), sbp = rnorm(200, 80, 12), creat = rnorm(200, 2.5, 0.8), lactate = rnorm(200, 5, 2)) p2 <- tibble(hr = rnorm(200, 90, 10), rr = rnorm(200, 20, 3), temp = rnorm(200, 37.2, 0.5), sbp = rnorm(200, 110, 15), creat = rnorm(200, 1.2, 0.3), lactate = rnorm(200, 1.5, 0.5)) p3 <- tibble(hr = rnorm(200, 100, 10), rr = rnorm(200, 22, 4), temp = rnorm(200, 39.2, 0.7), sbp = rnorm(200, 120, 10), creat = rnorm(200, 1.0, 0.2), lactate = rnorm(200, 2.0, 0.8)) dat <- bind_rows(p1, p2, p3) is_outlier <- rep(FALSE, 600) # Add 30 outlier patients with extreme values set.seed(99) outliers <- tibble( hr = rnorm(30, 180, 10), # Extreme heart rate rr = rnorm(30, 40, 5), # Extreme respiratory rate temp = rnorm(30, 40.5, 0.5), # Very high temperature sbp = rnorm(30, 50, 10), # Very low blood pressure creat = rnorm(30, 8, 1.5), # Very high creatinine lactate = rnorm(30, 15, 3) # Very high lactate ) dat <- bind_rows(dat, outliers) is_outlier <- c(is_outlier, rep(TRUE, 30)) dat_scaled <- scale(dat) cat("Total patients:", nrow(dat), "(600 normal + 30 outliers)\n\n") # ---- (a) K-means with k=3 ---- cat("=== Part (a): K-means with k=3 ===\n") km3 <- kmeans(dat_scaled, centers = 3, nstart = 25) # Where do outliers end up? outlier_clusters <- km3$cluster[is_outlier] cat("Outlier distribution across K-means clusters:\n") print(table(outlier_clusters)) cat("\nK-means assigns outliers to one of the existing clusters,\n") cat("typically the cluster whose centroid is nearest (even if far).\n") cat("This can distort the centroid and corrupt the cluster profiles.\n") # Show how outliers affect cluster means dat$km_cluster <- km3$cluster dat$is_outlier <- is_outlier for (cl in 1:3) { n_outlier_in_cl <- sum(dat$km_cluster == cl & dat$is_outlier) n_total_in_cl <- sum(dat$km_cluster == cl) cat(sprintf("Cluster %d: %d patients (%d outliers)\n", cl, n_total_in_cl, n_outlier_in_cl)) } # ---- (b) DBSCAN ---- cat("\n=== Part (b): DBSCAN ===\n") # Use kNNdistplot to help choose eps kNNdistplot(dat_scaled, k = 5) abline(h = 2.5, col = "firebrick", lty = 2) # Run DBSCAN db <- dbscan(dat_scaled, eps = 2.5, minPts = 10) cat("DBSCAN results:\n") cat(" Number of clusters:", max(db$cluster), "\n") cat(" Noise points (outliers):", sum(db$cluster == 0), "\n") cat(" Actual outliers detected as noise:", sum(db$cluster == 0 & is_outlier), "out of 30\n") cat("\nDBSCAN cluster sizes:\n") print(table(db$cluster)) # ---- (c) Compare non-outlier assignments ---- cat("\n=== Part (c): Comparison for Non-Outlier Patients ===\n") # Get cluster assignments for non-outlier patients only normal_idx <- !is_outlier km_normal <- km3$cluster[normal_idx] db_normal <- db$cluster[normal_idx] # For DBSCAN, remove any normal patients assigned as noise db_noise_normal <- sum(db_normal == 0) cat("Normal patients misclassified as noise by DBSCAN:", db_noise_normal, "\n") # ARI for non-noise, non-outlier patients both_assigned <- db_normal > 0 if (sum(both_assigned) > 0) { # Use mclust for ARI library(mclust) ari <- adjustedRandIndex(km_normal[both_assigned], db_normal[both_assigned]) cat("ARI between K-means and DBSCAN (non-outlier, non-noise):", round(ari, 3), "\n") } # ---- (d) Which approach for outlier handling? ---- cat("\n=== Part (d): Preferred Approach for Outlier Handling ===\n\n") cat("DBSCAN is preferred for clinical phenotyping when outliers are expected.\n\n") cat("Reasons:\n") cat("1. EXPLICIT OUTLIER IDENTIFICATION: DBSCAN labels outliers as noise,\n") cat(" making them visible for clinical review. K-means silently absorbs\n") cat(" them into clusters, distorting the results.\n\n") cat("2. CLINICAL RELEVANCE: Outlier patients (e.g., with extreme vitals)\n") cat(" may represent data errors, rare presentations, or patients who\n") cat(" need individual assessment. Identifying them is valuable.\n\n") cat("3. CLUSTER INTEGRITY: By excluding outliers, DBSCAN preserves the\n") cat(" purity of the main clusters. K-means cluster centroids can be\n") cat(" pulled by outliers, producing misleading profiles.\n\n") cat("4. PRACTICAL WORKFLOW:\n") cat(" - Use DBSCAN to identify outliers first.\n") cat(" - Review outliers clinically (data errors? rare cases?).\n") cat(" - Apply K-means to the remaining non-outlier patients for\n") cat(" cleaner phenotyping.\n\n") cat("5. CAVEAT: DBSCAN requires tuning eps and minPts, which can be\n") cat(" challenging. The kNN distance plot helps, but the choice\n") cat(" affects how many points are labelled as noise.\n") # Visualise comparison pca2 <- prcomp(dat_scaled)$x[, 1:2] par(mfrow = c(1, 2), mar = c(4, 4, 3, 1)) # K-means cols_km <- c("steelblue", "firebrick", "forestgreen")[km3$cluster] pch_km <- ifelse(is_outlier, 4, 16) plot(pca2, col = cols_km, pch = pch_km, cex = 0.7, main = "K-means (k=3)\n(X = outlier patients)", xlab = "PC1", ylab = "PC2") # DBSCAN db_cols <- c("grey50", "steelblue", "firebrick", "forestgreen") cols_db <- db_cols[db$cluster + 1] pch_db <- ifelse(db$cluster == 0, 4, 16) plot(pca2, col = cols_db, pch = pch_db, cex = 0.7, main = "DBSCAN\n(X/grey = noise)", xlab = "PC1", ylab = "PC2") legend("topright", c("Noise", "Cluster 1", "Cluster 2", "Cluster 3"), col = db_cols, pch = c(4, 16, 16, 16), cex = 0.7) ``` #### Python ```{python} #| label: sol-ch16-ex3-py #| eval: false #| code-fold: true #| code-summary: "Show Python solution" # Chapter 16, Exercise 3: DBSCAN for Outlier Detection # Add outlier patients to the Exercise 1 dataset import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.preprocessing import StandardScaler from sklearn.cluster import KMeans, DBSCAN from sklearn.metrics import adjusted_rand_score from sklearn.decomposition import PCA from sklearn.neighbors import NearestNeighbors # ---- Simulate data (same as Exercise 1) ---- np.random.seed(42) p1 = np.column_stack([ np.random.normal(115, 12, 200), np.random.normal(28, 5, 200), np.random.normal(38.5, 0.8, 200), np.random.normal(80, 12, 200), np.random.normal(2.5, 0.8, 200), np.random.normal(5, 2, 200) ]) p2 = np.column_stack([ np.random.normal(90, 10, 200), np.random.normal(20, 3, 200), np.random.normal(37.2, 0.5, 200), np.random.normal(110, 15, 200), np.random.normal(1.2, 0.3, 200), np.random.normal(1.5, 0.5, 200) ]) p3 = np.column_stack([ np.random.normal(100, 10, 200), np.random.normal(22, 4, 200), np.random.normal(39.2, 0.7, 200), np.random.normal(120, 10, 200), np.random.normal(1.0, 0.2, 200), np.random.normal(2.0, 0.8, 200) ]) X_normal = np.vstack([p1, p2, p3]) is_outlier = np.array([False] * 600) # Add 30 outlier patients np.random.seed(99) outliers = np.column_stack([ np.random.normal(180, 10, 30), # Extreme HR np.random.normal(40, 5, 30), # Extreme RR np.random.normal(40.5, 0.5, 30), # Very high temp np.random.normal(50, 10, 30), # Very low SBP np.random.normal(8, 1.5, 30), # Very high creatinine np.random.normal(15, 3, 30) # Very high lactate ]) X = np.vstack([X_normal, outliers]) is_outlier = np.concatenate([is_outlier, np.array([True] * 30)]) X_scaled = StandardScaler().fit_transform(X) print(f"Total patients: {len(X)} (600 normal + 30 outliers)\n") # ---- (a) K-means with k=3 ---- print("=== Part (a): K-means with k=3 ===") km3 = KMeans(n_clusters=3, n_init=25, random_state=42).fit(X_scaled) outlier_clusters = km3.labels_[is_outlier] print("Outlier distribution across K-means clusters:") for c in range(3): n_out = (outlier_clusters == c).sum() n_tot = (km3.labels_ == c).sum() print(f" Cluster {c+1}: {n_tot} patients ({n_out} outliers)") print("\nK-means assigns outliers to existing clusters, distorting centroids.") # ---- (b) DBSCAN ---- print("\n=== Part (b): DBSCAN ===") # kNN distance plot to choose eps nn = NearestNeighbors(n_neighbors=5).fit(X_scaled) distances, _ = nn.kneighbors(X_scaled) knn_dist = np.sort(distances[:, -1]) plt.figure(figsize=(8, 4)) plt.plot(knn_dist, color="steelblue") plt.axhline(y=2.5, color="firebrick", linestyle="--", label="eps = 2.5") plt.xlabel("Points (sorted)") plt.ylabel("5-NN Distance") plt.title("kNN Distance Plot for eps Selection") plt.legend() plt.tight_layout() plt.savefig("ch16_ex3_knn_dist.png", dpi=150) plt.show() # Run DBSCAN db = DBSCAN(eps=2.5, min_samples=10).fit(X_scaled) n_clusters = len(set(db.labels_)) - (1 if -1 in db.labels_ else 0) n_noise = (db.labels_ == -1).sum() outliers_as_noise = ((db.labels_ == -1) & is_outlier).sum() print(f"Number of clusters: {n_clusters}") print(f"Noise points (outliers): {n_noise}") print(f"Actual outliers detected as noise: {outliers_as_noise} out of 30") # ---- (c) Compare non-outlier assignments ---- print("\n=== Part (c): Comparison for Non-Outlier Patients ===") normal_mask = ~is_outlier km_normal = km3.labels_[normal_mask] db_normal = db.labels_[normal_mask] db_noise_normal = (db_normal == -1).sum() print(f"Normal patients misclassified as noise: {db_noise_normal}") # ARI for non-noise, non-outlier patients both_assigned = db_normal >= 0 if both_assigned.sum() > 0: ari = adjusted_rand_score(km_normal[both_assigned], db_normal[both_assigned]) print(f"ARI (K-means vs DBSCAN, non-outlier, non-noise): {ari:.3f}") # ---- Visualise comparison ---- pca2 = PCA(n_components=2).fit_transform(X_scaled) fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # K-means colors_km = ["steelblue", "firebrick", "forestgreen"] for c in range(3): mask = (km3.labels_ == c) & ~is_outlier axes[0].scatter(pca2[mask, 0], pca2[mask, 1], c=colors_km[c], s=12, alpha=0.6) mask_out = (km3.labels_ == c) & is_outlier axes[0].scatter(pca2[mask_out, 0], pca2[mask_out, 1], c=colors_km[c], marker="x", s=50, linewidths=2) axes[0].set_title("K-means (k=3)\n(X = outlier patients)") axes[0].set_xlabel("PC1") axes[0].set_ylabel("PC2") # DBSCAN color_map = {-1: "grey", 0: "steelblue", 1: "firebrick", 2: "forestgreen"} for c_val in sorted(set(db.labels_)): mask = db.labels_ == c_val marker = "x" if c_val == -1 else "o" label = "Noise" if c_val == -1 else f"Cluster {c_val+1}" axes[1].scatter(pca2[mask, 0], pca2[mask, 1], c=color_map.get(c_val, "purple"), marker=marker, s=15 if c_val >= 0 else 40, alpha=0.6, label=label) axes[1].set_title("DBSCAN\n(X/grey = noise)") axes[1].set_xlabel("PC1") axes[1].set_ylabel("PC2") axes[1].legend(fontsize=8) plt.tight_layout() plt.savefig("ch16_ex3_comparison.png", dpi=150) plt.show() # ---- (d) Preferred approach ---- print("\n=== Part (d): Preferred Approach ===\n") print("DBSCAN is preferred for clinical phenotyping when outliers are expected.\n") print("1. EXPLICIT OUTLIER IDENTIFICATION: DBSCAN labels outliers as noise,") print(" making them visible. K-means silently absorbs them.\n") print("2. CLINICAL RELEVANCE: Outlier patients may represent data errors,") print(" rare presentations, or patients needing individual assessment.\n") print("3. CLUSTER INTEGRITY: DBSCAN preserves cluster purity by excluding") print(" outliers. K-means centroids can be pulled by outliers.\n") print("4. PRACTICAL WORKFLOW: Use DBSCAN first to identify outliers,") print(" review them clinically, then apply K-means to non-outliers.\n") print("5. CAVEAT: DBSCAN requires tuning eps and min_samples, which") print(" affects how many points are labelled as noise.") ``` ::: ### Exercise 4 ::: {.panel-tabset} #### R ```{r} #| label: sol-ch16-ex4-r #| eval: false #| code-fold: true #| code-summary: "Show R solution" # Chapter 16, Exercise 4: Full Pipeline - PCA + Clustering + UMAP Visualisation # 800 patients, 30 clinical variables, 4 underlying subtypes library(tidyverse) library(cluster) library(uwot) # ---- Simulate data ---- set.seed(42) n <- 800 p <- 30 # 4 subtypes with different prevalences subtype_probs <- c(0.30, 0.30, 0.25, 0.15) true_subtype <- sample(1:4, n, replace = TRUE, prob = subtype_probs) cat("True subtype distribution:\n") print(table(true_subtype)) # Generate base data X <- matrix(rnorm(n * p), ncol = p) # Add subtype-specific signals in different feature subsets for (s in 1:4) { feature_start <- (s - 1) * 6 + 1 feature_end <- min(s * 6, p) X[true_subtype == s, feature_start:feature_end] <- X[true_subtype == s, feature_start:feature_end] + 2.5 } # ---- (a) Standardise and PCA ---- cat("\n=== Part (a): PCA ===\n") X_scaled <- scale(X) pca_result <- prcomp(X_scaled) var_prop <- pca_result$sdev^2 / sum(pca_result$sdev^2) cum_var <- cumsum(var_prop) # Scree plot par(mfrow = c(1, 2), mar = c(4, 4, 3, 1)) barplot(var_prop[1:15], names.arg = 1:15, col = "steelblue", main = "Scree Plot (first 15 PCs)", xlab = "Component", ylab = "Proportion of Variance") abline(h = 1/p, col = "firebrick", lty = 2) plot(1:15, cum_var[1:15], type = "b", pch = 16, col = "steelblue", main = "Cumulative Variance", xlab = "Components", ylab = "Cumulative Proportion", ylim = c(0, 1)) abline(h = 0.80, col = "firebrick", lty = 2) n_80 <- which(cum_var >= 0.80)[1] cat("PCs needed for 80% variance:", n_80, "\n") # Use first 10 PCs for clustering n_pcs <- 10 pca_scores <- pca_result$x[, 1:n_pcs] # ---- (b) K-means on PCA scores ---- cat("\n=== Part (b): K-means on PCA Scores ===\n") sil_scores <- numeric(5) for (k in 2:5) { km <- kmeans(pca_scores, centers = k, nstart = 50) sil_scores[k] <- mean(silhouette(km$cluster, dist(pca_scores))[, 3]) cat(sprintf("k = %d: Silhouette = %.3f\n", k, sil_scores[k])) } best_k <- which.max(sil_scores[2:5]) + 1 cat("Best k by silhouette:", best_k, "\n") # Fit final clustering km_final <- kmeans(pca_scores, centers = best_k, nstart = 50) # ---- (c) UMAP visualisation ---- cat("\n=== Part (c): UMAP Visualisation ===\n") set.seed(42) umap_result <- umap(pca_scores, n_neighbors = 15, min_dist = 0.1, verbose = FALSE) plot_df <- tibble( UMAP1 = umap_result[, 1], UMAP2 = umap_result[, 2], Cluster = factor(km_final$cluster), True_subtype = factor(true_subtype) ) cols4 <- c("steelblue", "firebrick", "forestgreen", "goldenrod") par(mfrow = c(1, 2), mar = c(4, 4, 3, 1)) # Discovered clusters plot(umap_result, col = cols4[km_final$cluster], pch = 16, cex = 0.6, main = "K-means Clusters on UMAP", xlab = "UMAP 1", ylab = "UMAP 2") legend("topright", paste("Cluster", 1:best_k), col = cols4[1:best_k], pch = 16, cex = 0.7) # True subtypes plot(umap_result, col = cols4[true_subtype], pch = 16, cex = 0.6, main = "True Subtypes on UMAP", xlab = "UMAP 1", ylab = "UMAP 2") legend("topright", paste("Subtype", 1:4), col = cols4, pch = 16, cex = 0.7) # ---- (d) Cluster profiles ---- cat("\n=== Part (d): Cluster Profiles ===\n\n") # Name variables for interpretability var_names <- paste0("V", 1:p) dat_df <- as.data.frame(X) colnames(dat_df) <- var_names dat_df$cluster <- km_final$cluster # Compute mean of each variable by cluster profiles <- dat_df %>% group_by(cluster) %>% summarise(across(everything(), mean), n = n(), .groups = "drop") cat("Cluster sizes:\n") print(table(km_final$cluster)) cat("\nCluster means for key variables (first 24, showing subtype signals):\n\n") # Show means for the signal variables signal_vars <- paste0("V", 1:24) profile_table <- profiles %>% select(cluster, n, all_of(signal_vars)) # Print in a compact format for (cl in 1:best_k) { prof <- profile_table %>% filter(cluster == cl) cat(sprintf("Cluster %d (n = %d):\n", cl, prof$n)) for (s in 1:4) { start <- (s - 1) * 6 + 1 end <- s * 6 vars <- paste0("V", start:end) means <- round(unlist(prof[vars]), 2) cat(sprintf(" Subtype %d signal vars (V%d-V%d): mean = %.2f\n", s, start, end, mean(means))) } cat("\n") } # ---- (e) Discussion ---- cat("=== Part (e): Validation Discussion ===\n\n") cat("To validate these clusters in a real clinical study:\n\n") cat("1. EXTERNAL OUTCOMES: Compare clusters on outcomes NOT used in\n") cat(" clustering (mortality, length of stay, treatment response).\n") cat(" If clusters predict outcomes they were not trained on, the\n") cat(" structure is likely clinically meaningful.\n\n") cat("2. STABILITY ANALYSIS: Resample the data (bootstrap) and re-run\n") cat(" clustering. If the same patients consistently end up in the\n") cat(" same clusters, the results are robust. If clusters change\n") cat(" substantially across resamples, they may be artefacts.\n\n") cat("3. INDEPENDENT REPLICATION: Apply the same pipeline to an\n") cat(" independent dataset (different hospital, different time period).\n") cat(" If similar clusters emerge, the findings are generalisable.\n\n") cat("4. CLINICAL EXPERT REVIEW: Present cluster profiles to clinicians.\n") cat(" Do the clusters correspond to recognisable patient types?\n") cat(" Do they suggest different management strategies?\n\n") cat("5. MULTIPLE ALGORITHMS: Compare results from K-means, hierarchical\n") cat(" clustering, and DBSCAN. Agreement across methods strengthens\n") cat(" confidence. Disagreement suggests fragile structure.\n\n") cat("6. AVOID CIRCULAR REASONING: Do NOT validate clusters by showing\n") cat(" they differ on the same variables used to create them.\n") cat(" This is guaranteed by construction and proves nothing.\n") ``` #### Python ```{python} #| label: sol-ch16-ex4-py #| eval: false #| code-fold: true #| code-summary: "Show Python solution" # Chapter 16, Exercise 4: Full Pipeline - PCA + Clustering + UMAP Visualisation # 800 patients, 30 clinical variables, 4 underlying subtypes import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA from sklearn.cluster import KMeans from sklearn.metrics import silhouette_score from umap import UMAP # ---- Simulate data ---- np.random.seed(42) n = 800 p = 30 # 4 subtypes true_subtype = np.random.choice(4, n, p=[0.30, 0.30, 0.25, 0.15]) print("True subtype distribution:", {i: (true_subtype == i).sum() for i in range(4)}) X = np.random.normal(0, 1, (n, p)) # Add subtype-specific signals for s in range(4): start = s * 6 end = min((s + 1) * 6, p) X[true_subtype == s, start:end] += 2.5 # ---- (a) Standardise and PCA ---- print("\n=== Part (a): PCA ===") X_scaled = StandardScaler().fit_transform(X) pca = PCA() scores = pca.fit_transform(X_scaled) cum_var = np.cumsum(pca.explained_variance_ratio_) n_80 = np.argmax(cum_var >= 0.80) + 1 print(f"PCs needed for 80% variance: {n_80}") # Scree plot fig, axes = plt.subplots(1, 2, figsize=(12, 5)) axes[0].bar(range(1, 16), pca.explained_variance_ratio_[:15], color="steelblue", edgecolor="white") axes[0].axhline(y=1/p, color="firebrick", linestyle="--", label="1/p") axes[0].set_xlabel("Component") axes[0].set_ylabel("Proportion of Variance") axes[0].set_title("Scree Plot") axes[0].legend() axes[1].plot(range(1, 16), cum_var[:15], "o-", color="steelblue") axes[1].axhline(y=0.80, color="firebrick", linestyle="--", label="80%") axes[1].set_xlabel("Components") axes[1].set_ylabel("Cumulative Proportion") axes[1].set_title("Cumulative Variance") axes[1].legend() plt.tight_layout() plt.savefig("ch16_ex4_scree.png", dpi=150) plt.show() # Use 10 PCs n_pcs = 10 pca_scores = scores[:, :n_pcs] # ---- (b) K-means on PCA scores ---- print("\n=== Part (b): K-means on PCA Scores ===") sil_list = [] for k in range(2, 6): km = KMeans(n_clusters=k, n_init=50, random_state=42).fit(pca_scores) sil = silhouette_score(pca_scores, km.labels_) sil_list.append(sil) print(f"k = {k}: Silhouette = {sil:.3f}") best_k = list(range(2, 6))[np.argmax(sil_list)] print(f"Best k by silhouette: {best_k}") # Fit final clustering km_final = KMeans(n_clusters=best_k, n_init=50, random_state=42).fit(pca_scores) # ---- (c) UMAP visualisation ---- print("\n=== Part (c): UMAP Visualisation ===") umap_2d = UMAP(n_components=2, n_neighbors=15, min_dist=0.1, random_state=42).fit_transform(pca_scores) colors_4 = ["steelblue", "firebrick", "forestgreen", "goldenrod"] fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Discovered clusters for c in range(best_k): mask = km_final.labels_ == c axes[0].scatter(umap_2d[mask, 0], umap_2d[mask, 1], c=colors_4[c], s=10, alpha=0.6, label=f"Cluster {c+1}") axes[0].set_title("K-means Clusters on UMAP") axes[0].set_xlabel("UMAP 1") axes[0].set_ylabel("UMAP 2") axes[0].legend(fontsize=8) # True subtypes for s in range(4): mask = true_subtype == s axes[1].scatter(umap_2d[mask, 0], umap_2d[mask, 1], c=colors_4[s], s=10, alpha=0.6, label=f"Subtype {s+1}") axes[1].set_title("True Subtypes on UMAP") axes[1].set_xlabel("UMAP 1") axes[1].set_ylabel("UMAP 2") axes[1].legend(fontsize=8) plt.tight_layout() plt.savefig("ch16_ex4_umap.png", dpi=150) plt.show() # ---- (d) Cluster profiles ---- print("\n=== Part (d): Cluster Profiles ===\n") var_names = [f"V{i+1}" for i in range(p)] df = pd.DataFrame(X, columns=var_names) df['cluster'] = km_final.labels_ print("Cluster sizes:") print(df['cluster'].value_counts().sort_index()) print("\nCluster means for signal variables:") for cl in range(best_k): mask = df['cluster'] == cl n_cl = mask.sum() print(f"\nCluster {cl+1} (n = {n_cl}):") for s in range(4): start = s * 6 end = (s + 1) * 6 signal_vars = [f"V{i+1}" for i in range(start, min(end, p))] mean_val = df.loc[mask, signal_vars].mean().mean() print(f" Subtype {s+1} signal vars (V{start+1}-V{min(end,p)}): mean = {mean_val:.2f}") # ---- (e) Discussion ---- print("\n=== Part (e): Validation Discussion ===\n") print("To validate these clusters in a real clinical study:\n") print("1. EXTERNAL OUTCOMES: Compare clusters on outcomes NOT used in") print(" clustering (mortality, LOS, treatment response). Clusters") print(" that predict external outcomes are clinically meaningful.\n") print("2. STABILITY ANALYSIS: Bootstrap resampling + re-clustering.") print(" Consistent membership = robust; variable = artefact.\n") print("3. INDEPENDENT REPLICATION: Apply pipeline to an independent") print(" dataset (different hospital/time). Similar clusters = generalisable.\n") print("4. CLINICAL EXPERT REVIEW: Do clusters correspond to") print(" recognisable patient types? Do they suggest different management?\n") print("5. MULTIPLE ALGORITHMS: Compare K-means, hierarchical, DBSCAN.") print(" Agreement across methods strengthens confidence.\n") print("6. AVOID CIRCULAR REASONING: Do NOT validate on the same") print(" variables used for clustering. That proves nothing.") ``` ::: ## Chapter 17: Causal Inference ### Exercise 1 ::: {.panel-tabset} #### R ```{r} #| label: sol-ch17-ex1-r #| eval: false #| code-fold: true #| code-summary: "Show R solution" # ============================================================================= # Chapter 17 - Exercise 1: DAG Construction (Conceptual) # Studying the relationship between ACE inhibitor use and Acute Kidney Injury # ============================================================================= # This is a conceptual exercise. The answers are provided as detailed comments # with optional code to illustrate DAG concepts using the dagitty package. # --- Part (a): List at least five relevant variables --- # # 1. Baseline kidney function (eGFR / serum creatinine) # - Patients with worse kidney function are more likely to receive ACE inhibitors # (for renoprotective effects) AND are at higher risk of AKI. # # 2. Heart failure severity # - Heart failure is a major indication for ACE inhibitors AND independently # increases AKI risk (through haemodynamic changes). # # 3. Diabetes # - Diabetes is an indication for ACE inhibitors (nephroprotection) AND a risk # factor for AKI. # # 4. Age # - Older patients are more likely to be on ACE inhibitors and are at higher # risk of AKI. # # 5. Hypertension # - Primary indication for ACE inhibitors and can contribute to kidney injury. # # 6. Concomitant nephrotoxic drugs (e.g., NSAIDs, contrast agents) # - May be prescribed alongside or instead of ACE inhibitors and directly # cause AKI. # # 7. Volume status / dehydration # - Dehydration increases AKI risk and may influence whether ACE inhibitors # are held or continued. # # 8. Proteinuria # - An indication for ACE inhibitors and a marker of kidney disease severity # (related to AKI risk). # --- Part (b): Draw a DAG --- # We use the dagitty package to encode the DAG programmatically. # install.packages("dagitty") library(dagitty) dag <- dagitty('dag { ACEi [exposure] AKI [outcome] eGFR [confounder] HeartFailure [confounder] Diabetes [confounder] Age [confounder] Hypertension [confounder] NephrotoxicDrugs [confounder] VolumeStatus [confounder] ICUAdmission [collider] Age -> ACEi Age -> AKI Age -> eGFR Age -> HeartFailure Age -> Diabetes eGFR -> ACEi eGFR -> AKI HeartFailure -> ACEi HeartFailure -> AKI HeartFailure -> VolumeStatus Diabetes -> ACEi Diabetes -> AKI Diabetes -> eGFR Hypertension -> ACEi Hypertension -> AKI NephrotoxicDrugs -> AKI VolumeStatus -> AKI ACEi -> AKI # Collider: ICU Admission is caused by both ACEi use patterns # and by AKI itself (patients with AKI go to ICU) ACEi -> ICUAdmission AKI -> ICUAdmission }') # Visualise the DAG (if running interactively) # plot(dag) # --- Part (c): Minimal sufficient adjustment set --- # Using dagitty to compute the adjustment set automatically adj_set <- adjustmentSets(dag, exposure = "ACEi", outcome = "AKI", type = "minimal") cat("Minimal sufficient adjustment sets:\n") print(adj_set) # Explanation: # The minimal adjustment set includes confounders that block all backdoor paths # from ACEi to AKI. Based on our DAG, this includes: # - Age, eGFR, HeartFailure, Diabetes, Hypertension # These are the common causes of both ACEi use and AKI. # We do NOT need to adjust for VolumeStatus (it's not a confounder of ACEi->AKI # unless there's a direct path from ACEi to VolumeStatus). # We do NOT need to adjust for NephrotoxicDrugs (it only affects AKI, not ACEi). # --- Part (d): Identify a collider --- # # ICU Admission is a COLLIDER: it is caused by both ACEi use (patients on # ACE inhibitors who develop complications may be admitted to ICU) and AKI # (AKI itself is a reason for ICU admission). # # What happens if we adjust for ICU Admission? # Conditioning on a collider OPENS a spurious path between ACEi and AKI. # This would create "collider bias" (also known as Berkson's bias in clinical # settings). Even if ACEi had no effect on AKI, adjusting for ICU admission # would create an artificial association because: # - Among ICU patients, if they were NOT admitted for AKI, they were more # likely admitted for ACEi-related reasons (and vice versa). # - This induces a negative correlation between the two causes of the collider. # # Practical lesson: Do NOT adjust for variables that are consequences of both # the exposure and the outcome (or that lie on causal paths from both). cat("\n--- Summary ---\n") cat("Key confounders to adjust for: Age, eGFR, Heart Failure, Diabetes, Hypertension\n") cat("Collider to AVOID adjusting for: ICU Admission\n") cat("Mediators to consider carefully: Volume status changes caused by ACEi\n") ``` #### Python ```{python} #| label: sol-ch17-ex1-py #| eval: false #| code-fold: true #| code-summary: "Show Python solution" # ============================================================================= # Chapter 17 - Exercise 1: DAG Construction (Conceptual) # Studying the relationship between ACE inhibitor use and Acute Kidney Injury # ============================================================================= # This is a conceptual exercise. The answers are provided as detailed comments # with optional code to illustrate DAG concepts. # --- Part (a): List at least five relevant variables --- # # 1. Baseline kidney function (eGFR / serum creatinine) # - Patients with worse kidney function are more likely to receive ACE inhibitors # (for renoprotective effects) AND are at higher risk of AKI. # # 2. Heart failure severity # - Heart failure is a major indication for ACE inhibitors AND independently # increases AKI risk (through haemodynamic changes). # # 3. Diabetes # - Diabetes is an indication for ACE inhibitors (nephroprotection) AND a risk # factor for AKI. # # 4. Age # - Older patients are more likely to be on ACE inhibitors and are at higher # risk of AKI. # # 5. Hypertension # - Primary indication for ACE inhibitors and can contribute to kidney injury. # # 6. Concomitant nephrotoxic drugs (e.g., NSAIDs, contrast agents) # - May be prescribed alongside or instead of ACE inhibitors and directly # cause AKI. # # 7. Volume status / dehydration # - Dehydration increases AKI risk and may influence whether ACE inhibitors # are held or continued. # # 8. Proteinuria # - An indication for ACE inhibitors and a marker of kidney disease severity # (related to AKI risk). # --- Part (b): Draw a DAG --- # We represent the DAG as an adjacency list and visualise with networkx. import networkx as nx import matplotlib.pyplot as plt # Define the DAG dag = nx.DiGraph() # Add edges representing causal relationships edges = [ # Age affects many variables ("Age", "ACEi"), ("Age", "AKI"), ("Age", "eGFR"), ("Age", "HeartFailure"), ("Age", "Diabetes"), # eGFR confounds ACEi-AKI ("eGFR", "ACEi"), ("eGFR", "AKI"), # Heart failure confounds ACEi-AKI ("HeartFailure", "ACEi"), ("HeartFailure", "AKI"), ("HeartFailure", "VolumeStatus"), # Diabetes confounds ACEi-AKI ("Diabetes", "ACEi"), ("Diabetes", "AKI"), ("Diabetes", "eGFR"), # Hypertension confounds ACEi-AKI ("Hypertension", "ACEi"), ("Hypertension", "AKI"), # Other causes of AKI ("NephrotoxicDrugs", "AKI"), ("VolumeStatus", "AKI"), # Treatment -> Outcome (causal effect of interest) ("ACEi", "AKI"), # Collider: ICU admission caused by both ACEi complications and AKI ("ACEi", "ICUAdmission"), ("AKI", "ICUAdmission"), ] dag.add_edges_from(edges) # Visualise the DAG plt.figure(figsize=(12, 8)) pos = nx.spring_layout(dag, seed=42, k=2) nx.draw(dag, pos, with_labels=True, node_color='lightblue', node_size=2000, font_size=9, font_weight='bold', arrows=True, arrowsize=20, edge_color='gray') plt.title("DAG: ACE Inhibitor Use and Acute Kidney Injury") plt.tight_layout() plt.savefig("dag_acei_aki.png", dpi=300) plt.show() # --- Part (c): Minimal sufficient adjustment set --- # # To estimate the causal effect of ACEi on AKI, we need to block all # backdoor paths (non-causal paths from ACEi to AKI). # # Backdoor paths from ACEi to AKI: # 1. ACEi <- Age -> AKI # 2. ACEi <- eGFR -> AKI # 3. ACEi <- HeartFailure -> AKI # 4. ACEi <- Diabetes -> AKI # 5. ACEi <- Diabetes -> eGFR -> AKI # 6. ACEi <- Hypertension -> AKI # 7. ACEi <- Age -> eGFR -> AKI # 8. ACEi <- Age -> HeartFailure -> AKI # 9. ACEi <- Age -> Diabetes -> AKI # 10. ACEi <- HeartFailure -> VolumeStatus -> AKI # # Minimal sufficient adjustment set: # {Age, eGFR, HeartFailure, Diabetes, Hypertension} # # This blocks all backdoor paths. Note: # - NephrotoxicDrugs only affects AKI (not a confounder), so no need to adjust. # - VolumeStatus is blocked by conditioning on HeartFailure (the path goes # through HeartFailure). print("Minimal sufficient adjustment set:") print(" {Age, eGFR, HeartFailure, Diabetes, Hypertension}") # --- Part (d): Identify a collider --- # # ICU Admission is a COLLIDER in the DAG: # ACEi -> ICUAdmission <- AKI # # Both ACEi use (through complications or monitoring) and AKI (as a # critical condition) can cause ICU admission. # # What happens if we adjust for ICU Admission? # # Conditioning on a collider OPENS a spurious path between its causes. # This creates "collider bias" (Berkson's bias): # # - Among ICU patients, if a patient was NOT admitted because of AKI, # they were more likely admitted due to ACEi-related issues (and vice versa). # - This induces an artificial negative association between ACEi and AKI, # even if no true causal relationship exists. # # Practical lesson: Never adjust for variables that are consequences of both # the exposure and the outcome. Restricting the analysis to ICU patients # only would similarly introduce collider bias. print("\nCollider identified: ICU Admission") print(" ACEi -> ICUAdmission <- AKI") print(" Adjusting for ICU admission would OPEN a spurious path (collider bias)") print("\n--- Summary ---") print("Key confounders to adjust for: Age, eGFR, Heart Failure, Diabetes, Hypertension") print("Collider to AVOID adjusting for: ICU Admission") print("Mediators to consider: Volume status changes caused by ACEi") ``` ::: ### Exercise 2 ::: {.panel-tabset} #### R ```{r} #| label: sol-ch17-ex2-r #| eval: false #| code-fold: true #| code-summary: "Show R solution" # ============================================================================= # Chapter 17 - Exercise 2: Propensity Score Matching in R # Beta-blocker use and 1-year mortality # ============================================================================= library(tidyverse) library(MatchIt) library(cobalt) library(broom) # --- Simulate the dataset (from the exercise) --- set.seed(123) n <- 1500 exercise_dat <- tibble( age = rnorm(n, 70, 8), creatinine = rnorm(n, 1.2, 0.4), heart_failure = rbinom(n, 1, 0.35), prior_mi = rbinom(n, 1, 0.20), # Confounded treatment (beta-blocker use) treatment = rbinom(n, 1, plogis(-1 + 0.02*age + 0.3*heart_failure + 0.5*prior_mi - 0.8*creatinine)), # Outcome: 1-year mortality death_1yr = rbinom(n, 1, plogis(-2 + 0.05*age + 0.4*heart_failure + 0.6*prior_mi + 1.0*creatinine - 0.7*treatment)) ) cat("Dataset dimensions:", nrow(exercise_dat), "patients\n") cat("Treatment prevalence:", mean(exercise_dat$treatment), "\n") cat("Outcome prevalence:", mean(exercise_dat$death_1yr), "\n\n") # --- Part (a): Estimate propensity scores using logistic regression --- ps_model <- glm(treatment ~ age + creatinine + heart_failure + prior_mi, data = exercise_dat, family = binomial) exercise_dat$ps <- predict(ps_model, type = "response") cat("Propensity score summary:\n") print(summary(exercise_dat$ps)) # Visualise propensity score overlap ggplot(exercise_dat, aes(x = ps, fill = factor(treatment))) + geom_density(alpha = 0.5) + labs(x = "Propensity Score", fill = "Beta-blocker", title = "Propensity Score Distribution by Treatment Group") + scale_fill_manual(values = c("#0072B2", "#D55E00"), labels = c("No", "Yes")) + theme_minimal() # --- Part (b): 1:1 nearest-neighbour matching with caliper of 0.2 SD --- m_out <- matchit(treatment ~ age + creatinine + heart_failure + prior_mi, data = exercise_dat, method = "nearest", distance = "glm", # logistic regression PS caliper = 0.2, # 0.2 SD of logit PS ratio = 1) # 1:1 matching cat("\nMatching summary:\n") summary(m_out) # --- Part (c): Love plot to assess balance --- love.plot(m_out, thresholds = c(m = 0.1), # SMD threshold of 0.1 binary = "std", var.order = "unadjusted", title = "Covariate Balance: Before and After Matching", colors = c("#D55E00", "#0072B2")) # Also check numeric balance cat("\nBalance table:\n") print(bal.tab(m_out, thresholds = c(m = 0.1))) # --- Part (d): Estimate ATT for beta-blocker use on 1-year mortality --- m_data <- match.data(m_out) cat("\nMatched sample size:", nrow(m_data), "patients\n") cat("Treated in matched sample:", sum(m_data$treatment == 1), "\n") cat("Control in matched sample:", sum(m_data$treatment == 0), "\n") # Outcome model in matched sample outcome_model <- glm(death_1yr ~ treatment, data = m_data, family = binomial, weights = weights) cat("\nATT estimate (logistic regression in matched sample):\n") result <- tidy(outcome_model, conf.int = TRUE, exponentiate = TRUE) print(result) # Risk difference (linear probability model) rd_model <- lm(death_1yr ~ treatment, data = m_data, weights = weights) cat("\nRisk difference estimate:\n") print(tidy(rd_model, conf.int = TRUE)) # Mortality rates in matched sample cat("\nMortality in matched sample:\n") cat(" Treated:", mean(m_data$death_1yr[m_data$treatment == 1]), "\n") cat(" Control:", mean(m_data$death_1yr[m_data$treatment == 0]), "\n") # --- Part (e): E-value calculation --- # Extract the odds ratio from the matched analysis or_est <- result$estimate[result$term == "treatment"] or_lo <- result$conf.low[result$term == "treatment"] or_hi <- result$conf.high[result$term == "treatment"] cat("\nOdds Ratio:", round(or_est, 3), "(95% CI:", round(or_lo, 3), "-", round(or_hi, 3), ")\n") # Convert OR to RR approximation (for rare outcomes, OR ~ RR) # For the E-value, we need RR on the above-null scale # Since treatment is protective (OR < 1), convert: RR_above_null = 1/OR rr_est <- 1 / or_est rr_lo_bound <- 1 / or_hi # Note: CI bounds flip when inverting # E-value formula: E = RR + sqrt(RR * (RR - 1)) e_value <- function(rr) { if (rr < 1) rr <- 1 / rr # Convert to above-null return(rr + sqrt(rr * (rr - 1))) } e_val_point <- e_value(rr_est) # E-value for the CI bound closest to null # If protective, the upper bound of RR (lower bound of 1/OR) is closest to null e_val_ci <- e_value(rr_lo_bound) cat("\nE-value for point estimate:", round(e_val_point, 2), "\n") cat("E-value for CI bound closest to null:", round(e_val_ci, 2), "\n") # Interpretation cat("\n--- E-value Interpretation ---\n") cat("The E-value of", round(e_val_point, 2), "means that an unmeasured\n") cat("confounder would need to be associated with BOTH beta-blocker use AND\n") cat("1-year mortality by a risk ratio of at least", round(e_val_point, 2), "\n") cat("(above and beyond measured confounders) to fully explain away the\n") cat("observed protective association.\n") cat("\nThe E-value for the confidence interval bound is", round(e_val_ci, 2), ",\n") cat("meaning a confounder of that strength could shift the CI to include the null.\n") cat("If these values are larger than plausible confounders, the result is robust.\n") # Optional: use EValue package if available # library(EValue) # evalues.OR(or_est, lo = or_lo, hi = or_hi, rare = FALSE) ``` #### Python ```{python} #| label: sol-ch17-ex2-py #| eval: false #| code-fold: true #| code-summary: "Show Python solution" # ============================================================================= # Chapter 17 - Exercise 2: Propensity Score Matching in Python # Beta-blocker use and 1-year mortality # ============================================================================= import numpy as np import pandas as pd from sklearn.linear_model import LogisticRegression from sklearn.neighbors import NearestNeighbors from scipy import stats import matplotlib.pyplot as plt # --- Simulate the dataset (same DGP as the R exercise) --- np.random.seed(123) n = 1500 age = np.random.normal(70, 8, n) creatinine = np.random.normal(1.2, 0.4, n) heart_failure = np.random.binomial(1, 0.35, n) prior_mi = np.random.binomial(1, 0.20, n) # Confounded treatment assignment ps_true = 1 / (1 + np.exp(-(-1 + 0.02*age + 0.3*heart_failure + 0.5*prior_mi - 0.8*creatinine))) treatment = np.random.binomial(1, ps_true) # Outcome mort_prob = 1 / (1 + np.exp(-(-2 + 0.05*age + 0.4*heart_failure + 0.6*prior_mi + 1.0*creatinine - 0.7*treatment))) death_1yr = np.random.binomial(1, mort_prob) df = pd.DataFrame({ 'age': age, 'creatinine': creatinine, 'heart_failure': heart_failure, 'prior_mi': prior_mi, 'treatment': treatment, 'death_1yr': death_1yr }) print(f"Dataset: {len(df)} patients") print(f"Treatment prevalence: {df['treatment'].mean():.3f}") print(f"Outcome prevalence: {df['death_1yr'].mean():.3f}\n") # --- Part (a): Estimate propensity scores --- covariates = ['age', 'creatinine', 'heart_failure', 'prior_mi'] ps_model = LogisticRegression(max_iter=1000) ps_model.fit(df[covariates], df['treatment']) df['ps'] = ps_model.predict_proba(df[covariates])[:, 1] # Compute logit of PS for caliper matching df['logit_ps'] = np.log(df['ps'] / (1 - df['ps'])) print("Propensity score summary:") print(df['ps'].describe()) # Visualise PS overlap fig, ax = plt.subplots(figsize=(8, 5)) df[df['treatment'] == 0]['ps'].plot.kde(ax=ax, label='Control', color='#0072B2') df[df['treatment'] == 1]['ps'].plot.kde(ax=ax, label='Treated', color='#D55E00') ax.set_xlabel('Propensity Score') ax.set_ylabel('Density') ax.set_title('Propensity Score Distribution by Treatment Group') ax.legend() plt.tight_layout() plt.savefig('ps_overlap.png', dpi=300) plt.show() # --- Part (b): 1:1 nearest-neighbour matching with caliper --- treated = df[df['treatment'] == 1].copy() control = df[df['treatment'] == 0].copy() # Caliper: 0.2 * SD of logit PS caliper = 0.2 * df['logit_ps'].std() print(f"\nCaliper (0.2 SD of logit PS): {caliper:.4f}") # Nearest-neighbour matching on logit PS nn = NearestNeighbors(n_neighbors=1, metric='euclidean') nn.fit(control[['logit_ps']].values) distances, indices = nn.kneighbors(treated[['logit_ps']].values) # Apply caliper: keep only matches within the caliper matched_treated_idx = [] matched_control_idx = [] used_controls = set() for i in range(len(treated)): dist = distances[i, 0] ctrl_idx = indices[i, 0] if dist <= caliper and ctrl_idx not in used_controls: matched_treated_idx.append(treated.index[i]) matched_control_idx.append(control.index[ctrl_idx]) used_controls.add(ctrl_idx) # Create matched dataset matched_df = pd.concat([ df.loc[matched_treated_idx], df.loc[matched_control_idx] ]) print(f"Matched sample: {len(matched_df)} patients") print(f" Treated: {len(matched_treated_idx)}") print(f" Control: {len(matched_control_idx)}") print(f" Unmatched treated: {len(treated) - len(matched_treated_idx)}") # --- Part (c): Assess balance using SMD (Love plot equivalent) --- def compute_smd(data, var, group_col): """Compute standardised mean difference.""" treated = data[data[group_col] == 1][var] control = data[data[group_col] == 0][var] pooled_sd = np.sqrt((treated.var() + control.var()) / 2) if pooled_sd == 0: return 0 return (treated.mean() - control.mean()) / pooled_sd print("\nCovariate balance (SMD):") print(f"{'Variable':<20} {'Before':>10} {'After':>10} {'Balanced?':>10}") print("-" * 52) smd_before = {} smd_after = {} for var in covariates: smd_b = compute_smd(df, var, 'treatment') smd_a = compute_smd(matched_df, var, 'treatment') smd_before[var] = smd_b smd_after[var] = smd_a balanced = "Yes" if abs(smd_a) < 0.1 else "No" print(f"{var:<20} {smd_b:>10.4f} {smd_a:>10.4f} {balanced:>10}") # Love plot fig, ax = plt.subplots(figsize=(8, 5)) y_pos = range(len(covariates)) ax.scatter([abs(smd_before[v]) for v in covariates], y_pos, color='#D55E00', label='Before matching', s=80, zorder=5) ax.scatter([abs(smd_after[v]) for v in covariates], y_pos, color='#0072B2', label='After matching', s=80, zorder=5) # Connect before and after for i, var in enumerate(covariates): ax.plot([abs(smd_before[var]), abs(smd_after[var])], [i, i], 'k-', alpha=0.3) ax.axvline(x=0.1, color='red', linestyle='--', alpha=0.5, label='SMD = 0.1 threshold') ax.set_yticks(y_pos) ax.set_yticklabels(covariates) ax.set_xlabel('Absolute Standardised Mean Difference') ax.set_title('Love Plot: Covariate Balance Before and After Matching') ax.legend(loc='lower right') plt.tight_layout() plt.savefig('love_plot.png', dpi=300) plt.show() # --- Part (d): Estimate ATT --- # Mortality rates in matched sample mort_treated = matched_df[matched_df['treatment'] == 1]['death_1yr'].mean() mort_control = matched_df[matched_df['treatment'] == 0]['death_1yr'].mean() print(f"\nMortality in matched sample:") print(f" Treated (beta-blocker): {mort_treated:.4f}") print(f" Control (no beta-blocker): {mort_control:.4f}") print(f" Risk difference (ATT): {mort_treated - mort_control:.4f}") # Odds ratio from statsmodels.api import Logit, add_constant X = add_constant(matched_df['treatment']) logit_model = Logit(matched_df['death_1yr'], X).fit(disp=0) print(f"\nLogistic regression in matched sample:") print(logit_model.summary2().tables[1]) or_est = np.exp(logit_model.params['treatment']) ci = np.exp(logit_model.conf_int().loc['treatment']) print(f"\nOdds Ratio: {or_est:.3f} (95% CI: {ci[0]:.3f} - {ci[1]:.3f})") # --- Part (e): E-value --- # Convert protective OR to RR on above-null scale rr_est = 1 / or_est # Flip because protective rr_ci_closest_null = 1 / ci[1] # Upper CI of OR -> lower bound of 1/OR def e_value(rr): """Compute E-value for a risk ratio.""" if rr < 1: rr = 1 / rr return rr + np.sqrt(rr * (rr - 1)) e_val_point = e_value(rr_est) e_val_ci = e_value(rr_ci_closest_null) print(f"\n--- E-value Analysis ---") print(f"E-value for point estimate: {e_val_point:.2f}") print(f"E-value for CI bound closest to null: {e_val_ci:.2f}") print(f"\nInterpretation:") print(f"An unmeasured confounder would need to be associated with BOTH") print(f"beta-blocker use AND 1-year mortality by a risk ratio of at least") print(f"{e_val_point:.2f} to fully explain away the observed protective effect.") print(f"For the confidence interval, a confounder with RR >= {e_val_ci:.2f}") print(f"could shift the CI to include the null.") ``` ::: ### Exercise 3 ::: {.panel-tabset} #### R ```{r} #| label: sol-ch17-ex3-r #| eval: false #| code-fold: true #| code-summary: "Show R solution" # ============================================================================= # Chapter 17 - Exercise 3: IPTW in R # (Exercise specifies Python, but we provide an R version as well) # Beta-blocker use and 1-year mortality using IPTW # ============================================================================= library(tidyverse) library(WeightIt) library(cobalt) library(survey) library(broom) # --- Simulate the dataset (same DGP as Exercise 2) --- set.seed(123) n <- 1500 exercise_dat <- tibble( age = rnorm(n, 70, 8), creatinine = rnorm(n, 1.2, 0.4), heart_failure = rbinom(n, 1, 0.35), prior_mi = rbinom(n, 1, 0.20), treatment = rbinom(n, 1, plogis(-1 + 0.02*age + 0.3*heart_failure + 0.5*prior_mi - 0.8*creatinine)), death_1yr = rbinom(n, 1, plogis(-2 + 0.05*age + 0.4*heart_failure + 0.6*prior_mi + 1.0*creatinine - 0.7*treatment)) ) # --- Part (a): Fit PS model and compute stabilised IPTW weights --- w_out <- weightit(treatment ~ age + creatinine + heart_failure + prior_mi, data = exercise_dat, method = "ps", estimand = "ATE") cat("Weight summary:\n") summary(w_out) # Check for extreme weights cat("\nWeight distribution:\n") cat(" Min:", min(w_out$weights), "\n") cat(" Max:", max(w_out$weights), "\n") cat(" Mean:", mean(w_out$weights), "\n") cat(" SD:", sd(w_out$weights), "\n") # --- Part (b): Assess covariate balance using weighted SMDs --- cat("\nBalance table:\n") bt <- bal.tab(w_out, thresholds = c(m = 0.1)) print(bt) # Love plot love.plot(w_out, thresholds = c(m = 0.1), binary = "std", var.order = "unadjusted", title = "Covariate Balance: Before and After IPTW", colors = c("#D55E00", "#0072B2")) # --- Part (c): Estimate ATE using weighted regression --- d_weighted <- svydesign(ids = ~1, weights = w_out$weights, data = exercise_dat) # Logistic regression (odds ratio) ate_model <- svyglm(death_1yr ~ treatment, design = d_weighted, family = quasibinomial) cat("\nIPTW ATE estimate (logistic):\n") print(tidy(ate_model, conf.int = TRUE, exponentiate = TRUE)) # Linear probability model (risk difference) rd_model <- svyglm(death_1yr ~ treatment, design = d_weighted, family = gaussian()) cat("\nIPTW ATE estimate (risk difference):\n") print(tidy(rd_model, conf.int = TRUE)) # --- Part (d): Sensitivity analysis with simulated unmeasured confounder --- cat("\n--- Sensitivity Analysis: Unmeasured Confounder ---\n") # Simulate an unmeasured confounder U that affects both treatment and outcome set.seed(456) U <- rnorm(n, 0, 1) # Re-simulate treatment and outcome with U included exercise_dat_u <- tibble( age = exercise_dat$age, creatinine = exercise_dat$creatinine, heart_failure = exercise_dat$heart_failure, prior_mi = exercise_dat$prior_mi, U = U, treatment = rbinom(n, 1, plogis(-1 + 0.02*age + 0.3*heart_failure + 0.5*prior_mi - 0.8*creatinine + 0.6*U)), # U affects treatment death_1yr = rbinom(n, 1, plogis(-2 + 0.05*age + 0.4*heart_failure + 0.6*prior_mi + 1.0*creatinine - 0.7*treatment + 0.8*U)) # U affects outcome ) # Analysis WITHOUT adjusting for U (biased) w_no_u <- weightit(treatment ~ age + creatinine + heart_failure + prior_mi, data = exercise_dat_u, method = "ps", estimand = "ATE") d_no_u <- svydesign(ids = ~1, weights = w_no_u$weights, data = exercise_dat_u) model_no_u <- svyglm(death_1yr ~ treatment, design = d_no_u, family = quasibinomial) res_no_u <- tidy(model_no_u, conf.int = TRUE, exponentiate = TRUE) cat("\nWithout adjusting for U:\n") print(res_no_u) # Analysis WITH adjusting for U (unbiased) w_with_u <- weightit(treatment ~ age + creatinine + heart_failure + prior_mi + U, data = exercise_dat_u, method = "ps", estimand = "ATE") d_with_u <- svydesign(ids = ~1, weights = w_with_u$weights, data = exercise_dat_u) model_with_u <- svyglm(death_1yr ~ treatment, design = d_with_u, family = quasibinomial) res_with_u <- tidy(model_with_u, conf.int = TRUE, exponentiate = TRUE) cat("\nWith adjusting for U:\n") print(res_with_u) cat("\n--- Interpretation ---\n") cat("The unmeasured confounder U (with associations of 0.6 with treatment\n") cat("and 0.8 with outcome on the log-odds scale) shifts the treatment effect\n") cat("estimate when not adjusted for. This demonstrates:\n") cat(" 1. IPTW cannot correct for unmeasured confounding.\n") cat(" 2. Sensitivity analysis helps quantify how strong a confounder\n") cat(" would need to be to change conclusions.\n") cat(" 3. The E-value provides a formal framework for this assessment.\n") ``` #### Python ```{python} #| label: sol-ch17-ex3-py #| eval: false #| code-fold: true #| code-summary: "Show Python solution" # ============================================================================= # Chapter 17 - Exercise 3: IPTW in Python # Beta-blocker use and 1-year mortality using IPTW # ============================================================================= import numpy as np import pandas as pd from sklearn.linear_model import LogisticRegression from scipy import stats import matplotlib.pyplot as plt import statsmodels.api as sm # --- Simulate the dataset (same DGP as Exercise 2) --- np.random.seed(123) n = 1500 age = np.random.normal(70, 8, n) creatinine = np.random.normal(1.2, 0.4, n) heart_failure = np.random.binomial(1, 0.35, n) prior_mi = np.random.binomial(1, 0.20, n) ps_true = 1 / (1 + np.exp(-(-1 + 0.02*age + 0.3*heart_failure + 0.5*prior_mi - 0.8*creatinine))) treatment = np.random.binomial(1, ps_true) mort_prob = 1 / (1 + np.exp(-(-2 + 0.05*age + 0.4*heart_failure + 0.6*prior_mi + 1.0*creatinine - 0.7*treatment))) death_1yr = np.random.binomial(1, mort_prob) df = pd.DataFrame({ 'age': age, 'creatinine': creatinine, 'heart_failure': heart_failure, 'prior_mi': prior_mi, 'treatment': treatment, 'death_1yr': death_1yr }) print(f"Dataset: {len(df)} patients") print(f"Treatment prevalence: {df['treatment'].mean():.3f}") print(f"Outcome prevalence: {df['death_1yr'].mean():.3f}\n") # --- Part (a): Fit PS model and compute stabilised IPTW weights --- covariates = ['age', 'creatinine', 'heart_failure', 'prior_mi'] ps_model = LogisticRegression(max_iter=1000) ps_model.fit(df[covariates], df['treatment']) df['ps'] = ps_model.predict_proba(df[covariates])[:, 1] # Unstabilised ATE weights: w = A/ps + (1-A)/(1-ps) df['iptw'] = np.where( df['treatment'] == 1, 1 / df['ps'], 1 / (1 - df['ps']) ) # Stabilised weights: replace numerator with marginal treatment probability p_treat = df['treatment'].mean() df['sw'] = np.where( df['treatment'] == 1, p_treat / df['ps'], (1 - p_treat) / (1 - df['ps']) ) print("Stabilised weight summary:") print(f" Min: {df['sw'].min():.4f}") print(f" Max: {df['sw'].max():.4f}") print(f" Mean: {df['sw'].mean():.4f}") print(f" SD: {df['sw'].std():.4f}") # Check for extreme weights extreme = (df['sw'] > 10).sum() print(f" Extreme weights (>10): {extreme}") # --- Part (b): Assess covariate balance using weighted SMDs --- def weighted_smd(data, var, treatment_col, weight_col): """Compute weighted standardised mean difference.""" treated = data[data[treatment_col] == 1] control = data[data[treatment_col] == 0] mean_t = np.average(treated[var], weights=treated[weight_col]) mean_c = np.average(control[var], weights=control[weight_col]) # Use unweighted pooled SD for standardisation sd_pooled = np.sqrt((treated[var].var() + control[var].var()) / 2) if sd_pooled == 0: return 0 return (mean_t - mean_c) / sd_pooled def unweighted_smd(data, var, treatment_col): """Compute unweighted SMD.""" treated = data[data[treatment_col] == 1][var] control = data[data[treatment_col] == 0][var] sd_pooled = np.sqrt((treated.var() + control.var()) / 2) if sd_pooled == 0: return 0 return (treated.mean() - control.mean()) / sd_pooled print("\nCovariate balance (weighted SMD):") print(f"{'Variable':<20} {'Unweighted':>12} {'Weighted':>12} {'Balanced?':>10}") print("-" * 56) smd_before = {} smd_after = {} for var in covariates: smd_b = unweighted_smd(df, var, 'treatment') smd_a = weighted_smd(df, var, 'treatment', 'sw') smd_before[var] = smd_b smd_after[var] = smd_a balanced = "Yes" if abs(smd_a) < 0.1 else "No" print(f"{var:<20} {smd_b:>12.4f} {smd_a:>12.4f} {balanced:>10}") # Love plot equivalent fig, ax = plt.subplots(figsize=(8, 5)) y_pos = range(len(covariates)) ax.scatter([abs(smd_before[v]) for v in covariates], y_pos, color='#D55E00', label='Before IPTW', s=80, zorder=5) ax.scatter([abs(smd_after[v]) for v in covariates], y_pos, color='#0072B2', label='After IPTW', s=80, zorder=5) for i, var in enumerate(covariates): ax.plot([abs(smd_before[var]), abs(smd_after[var])], [i, i], 'k-', alpha=0.3) ax.axvline(x=0.1, color='red', linestyle='--', alpha=0.5, label='SMD = 0.1') ax.set_yticks(y_pos) ax.set_yticklabels(covariates) ax.set_xlabel('Absolute Standardised Mean Difference') ax.set_title('Covariate Balance: Before and After IPTW') ax.legend(loc='lower right') plt.tight_layout() plt.savefig('iptw_balance.png', dpi=300) plt.show() # --- Part (c): Estimate ATE using weighted regression --- X = sm.add_constant(df['treatment']) # Weighted least squares (linear probability model for risk difference) wls_model = sm.WLS(df['death_1yr'], X, weights=df['sw']).fit() print("\nIPTW ATE Estimate (Risk Difference):") print(f" Coefficient: {wls_model.params.iloc[1]:.4f}") print(f" 95% CI: ({wls_model.conf_int().iloc[1, 0]:.4f}, " f"{wls_model.conf_int().iloc[1, 1]:.4f})") print(f" p-value: {wls_model.pvalues.iloc[1]:.4f}") # Weighted logistic regression (for odds ratio) from statsmodels.genmod.generalized_linear_model import GLM from statsmodels.genmod.families import Binomial glm_model = GLM(df['death_1yr'], X, family=Binomial(), freq_weights=df['sw']).fit() or_est = np.exp(glm_model.params.iloc[1]) ci = np.exp(glm_model.conf_int().iloc[1]) print(f"\nIPTW ATE Estimate (Odds Ratio):") print(f" OR: {or_est:.3f} (95% CI: {ci[0]:.3f} - {ci[1]:.3f})") # --- Part (d): Sensitivity analysis with unmeasured confounder --- print("\n--- Sensitivity Analysis: Unmeasured Confounder ---") print("Varying confounder strength to show impact on ATE estimate:\n") # We simulate data with an unmeasured confounder U of varying strength results = [] gamma_values = [0, 0.2, 0.4, 0.6, 0.8, 1.0] for gamma in gamma_values: np.random.seed(42) # Same seed for comparability U = np.random.normal(0, 1, n) # Re-simulate with confounder U ps_u = 1 / (1 + np.exp(-(-1 + 0.02*age + 0.3*heart_failure + 0.5*prior_mi - 0.8*creatinine + gamma*U))) trt_u = np.random.binomial(1, ps_u) mort_u = 1 / (1 + np.exp(-(-2 + 0.05*age + 0.4*heart_failure + 0.6*prior_mi + 1.0*creatinine - 0.7*trt_u + gamma*U))) death_u = np.random.binomial(1, mort_u) df_u = pd.DataFrame({ 'age': age, 'creatinine': creatinine, 'heart_failure': heart_failure, 'prior_mi': prior_mi, 'treatment': trt_u, 'death_1yr': death_u }) # IPTW without U ps_mod = LogisticRegression(max_iter=1000) ps_mod.fit(df_u[covariates], df_u['treatment']) ps_est = ps_mod.predict_proba(df_u[covariates])[:, 1] p_t = df_u['treatment'].mean() sw = np.where(df_u['treatment'] == 1, p_t / ps_est, (1 - p_t) / (1 - ps_est)) X_u = sm.add_constant(df_u['treatment']) wls_u = sm.WLS(df_u['death_1yr'], X_u, weights=sw).fit() results.append({ 'gamma': gamma, 'ate': wls_u.params.iloc[1], 'ci_lo': wls_u.conf_int().iloc[1, 0], 'ci_hi': wls_u.conf_int().iloc[1, 1] }) results_df = pd.DataFrame(results) print(f"{'Gamma (U strength)':<20} {'ATE':>10} {'95% CI':>25}") print("-" * 57) for _, row in results_df.iterrows(): print(f"{row['gamma']:<20.1f} {row['ate']:>10.4f} " f"({row['ci_lo']:.4f}, {row['ci_hi']:.4f})") # Plot sensitivity analysis fig, ax = plt.subplots(figsize=(8, 5)) ax.errorbar(results_df['gamma'], results_df['ate'], yerr=[results_df['ate'] - results_df['ci_lo'], results_df['ci_hi'] - results_df['ate']], fmt='o-', color='#0072B2', capsize=5, markersize=8) ax.axhline(y=0, color='red', linestyle='--', alpha=0.5, label='Null effect') ax.set_xlabel('Unmeasured Confounder Strength (gamma)') ax.set_ylabel('ATE Estimate (Risk Difference)') ax.set_title('Sensitivity Analysis: Impact of Unmeasured Confounding') ax.legend() plt.tight_layout() plt.savefig('sensitivity_analysis.png', dpi=300) plt.show() print("\n--- Interpretation ---") print("As the strength of the unmeasured confounder increases (gamma),") print("the IPTW estimate of the treatment effect becomes more biased") print("because IPTW cannot account for unmeasured confounding.") print("When gamma = 0 (no unmeasured confounder), the estimate is closest") print("to the true effect. This demonstrates why sensitivity analysis") print("is essential for observational causal inference.") ``` ::: ### Exercise 4 ::: {.panel-tabset} #### R ```{r} #| label: sol-ch17-ex4-r #| eval: false #| code-fold: true #| code-summary: "Show R solution" # ============================================================================= # Chapter 17 - Exercise 4: Target Trial Emulation (Conceptual) # Early vs delayed metformin initiation and 5-year cardiovascular events # ============================================================================= # This is a conceptual exercise. All answers are provided as detailed comments. # --- Part (a): Complete target trial protocol table --- # # +----------------------------+---------------------------------------------+-------------------------------------------+ # | Protocol component | Target trial | Observational emulation | # +----------------------------+---------------------------------------------+-------------------------------------------+ # | Eligibility criteria | Adults aged 30-80 with new T2DM diagnosis, | Same criteria applied to EHR database: | # | | no prior CVD, no prior metformin use, | first T2DM diagnosis code (ICD-10 E11), | # | | eGFR >= 30, no contraindications to | no prior MACE, no metformin dispensing | # | | metformin (e.g., severe renal impairment). | before index date, eGFR >= 30 in prior | # | | | 6 months. | # +----------------------------+---------------------------------------------+-------------------------------------------+ # | Treatment strategies | Strategy 1: Initiate metformin within | Same. Defined by first metformin | # | | 3 months of T2DM diagnosis. | prescription fill date relative to T2DM | # | | Strategy 2: Do not initiate metformin | diagnosis date. | # | | within 3 months (delayed, 3-12 months). | | # +----------------------------+---------------------------------------------+-------------------------------------------+ # | Treatment assignment | Random assignment at time of T2DM | Not random. Adjusted using IPTW or | # | | diagnosis. | propensity score matching, conditioning | # | | | on baseline confounders. | # +----------------------------+---------------------------------------------+-------------------------------------------+ # | Start of follow-up | Date of randomisation (= date of T2DM | Date of T2DM diagnosis (time zero). | # | (time zero) | diagnosis). | All eligible patients enter at this date. | # +----------------------------+---------------------------------------------+-------------------------------------------+ # | Outcome | First major adverse cardiovascular event | Same. MACE defined by ICD-10 codes for | # | | (MACE): composite of MI, stroke, or CV | MI (I21), stroke (I63, I64), or CV death | # | | death within 5 years. | (cause of death codes). Censored at 5 | # | | | years, loss to follow-up, or death from | # | | | non-CV causes. | # +----------------------------+---------------------------------------------+-------------------------------------------+ # | Causal contrast | Intention-to-treat: effect of being | Intention-to-treat: compare early vs | # | | assigned to early vs delayed initiation. | delayed initiation regardless of | # | | Per-protocol: effect of adhering to the | subsequent adherence. Per-protocol: | # | | assigned strategy. | use clone-censor-weight approach. | # +----------------------------+---------------------------------------------+-------------------------------------------+ # | Analysis plan | Cox proportional hazards model. | IPTW-weighted Cox PH model for ITT. | # | | ITT analysis is primary. | For per-protocol: clone each patient | # | | Per-protocol as sensitivity analysis. | into both strategies, censor when they | # | | | deviate, and apply IPCW to correct for | # | | | informative censoring. | # +----------------------------+---------------------------------------------+-------------------------------------------+ # --- Part (b): Sources of immortal time bias in a naive analysis --- # # In a naive analysis comparing "early initiators" (metformin within 3 months) # vs "delayed initiators" (3-12 months), several sources of immortal time # bias can arise: # # 1. SURVIVAL REQUIREMENT FOR CLASSIFICATION: # To be classified as a "delayed initiator," a patient must survive at least # 3 months (to reach the delayed window). The time between T2DM diagnosis # and 3 months is "immortal" for the delayed group -- they cannot experience # the outcome during this period by definition. This artificially inflates # survival in the delayed group. # # 2. MISCLASSIFICATION OF PERSON-TIME: # If follow-up starts at T2DM diagnosis for both groups but treatment # classification depends on future events (whether and when metformin is # started), person-time before treatment initiation is misclassified. # A patient who starts metformin at month 2 contributes 2 months of # "untreated" time that is credited to the "early" group. # # 3. EXCLUSION OF NON-INITIATORS WHO DIE EARLY: # Patients who die before 3 months and never start metformin may be excluded # entirely if the study requires treatment initiation. This creates # survivorship bias. # # HOW TARGET TRIAL EMULATION AVOIDS THIS: # - Time zero is aligned with eligibility (T2DM diagnosis), not treatment start # - The clone-censor-weight approach assigns each patient to BOTH strategies # at time zero, censors them when they deviate from their assigned strategy, # and uses inverse probability of censoring weights (IPCW) to correct for # the informative censoring # --- Part (c): How new-user, active comparator design addresses confounding --- # # The NEW-USER (INCIDENT USER) DESIGN: # - Only includes patients at the TIME OF FIRST metformin prescription # (or the decision point: T2DM diagnosis) # - Avoids "prevalent user bias" where including patients already on metformin # selectively includes those who tolerated and responded to the drug # (survivors of the initial treatment period) # - Captures early effects (including side effects) that might be missed # in prevalent user analyses # # The ACTIVE COMPARATOR DESIGN: # - Compares early metformin vs delayed metformin (not metformin vs no treatment) # - Patients choosing delayed initiation are more comparable to early initiators # than patients who never initiate (who may differ fundamentally in disease # severity, healthcare access, or physician preferences) # - Reduces confounding by indication because both groups are deemed to need # metformin -- the question is only about TIMING # - Makes the positivity assumption more plausible: most patients have a # realistic probability of being in either group # - Mimics the clinical question: "Should I start metformin now or wait?" # --- Part (d): Unmeasured confounders and sensitivity analysis --- # # POTENTIAL UNMEASURED CONFOUNDERS: # # 1. HbA1c trajectory / diabetes severity: # Patients with more rapidly rising HbA1c may receive metformin earlier. # If severity also affects CVD risk, this confounds the comparison. # (May be partially captured in EHR but with measurement timing issues.) # # 2. Patient health literacy and adherence behaviour: # Patients who seek care promptly and fill prescriptions early may also # engage in other health-promoting behaviours (exercise, diet) that # independently reduce CVD risk. This is a form of "healthy user bias." # # 3. Physician prescribing patterns / quality of care: # Physicians who prescribe metformin early may also provide better # overall cardiovascular risk management. # # 4. Socioeconomic status / insurance: # Patients with better insurance or higher SES may fill prescriptions # faster and have better access to follow-up care. # # 5. Diet, exercise, and lifestyle factors: # Rarely captured in EHR data but strongly associated with both # treatment initiation patterns and CVD outcomes. # # SENSITIVITY ANALYSIS APPROACHES: # # 1. E-VALUE: # Calculate the E-value for the primary estimate to quantify how strong # an unmeasured confounder would need to be (in terms of associations with # both treatment and outcome) to fully explain away the observed effect. # # 2. QUANTITATIVE BIAS ANALYSIS: # Use external data (e.g., surveys with lifestyle data) to estimate the # likely magnitude of confounding by specific unmeasured factors. Apply # the bias formula to adjust the estimate. # # 3. NEGATIVE CONTROL OUTCOMES: # Test outcomes that should NOT be affected by metformin timing (e.g., # accidental injuries). If an association is found, it suggests residual # confounding. # # 4. NEGATIVE CONTROL EXPOSURES: # Test treatments that share the same confounding structure but should # not affect CVD (e.g., timing of proton pump inhibitor initiation). # # 5. INSTRUMENTAL VARIABLE ANALYSIS: # If a valid instrument exists (e.g., physician preference for early # prescribing), IV analysis can estimate the causal effect even with # unmeasured confounding. cat("Exercise 4 is a conceptual exercise.\n") cat("All answers are provided as detailed comments in this script.\n") cat("Review the comments for:\n") cat(" (a) Complete target trial protocol table\n") cat(" (b) Sources of immortal time bias\n") cat(" (c) How new-user, active comparator design helps\n") cat(" (d) Unmeasured confounders and sensitivity approaches\n") ``` #### Python ```{python} #| label: sol-ch17-ex4-py #| eval: false #| code-fold: true #| code-summary: "Show Python solution" # ============================================================================= # Chapter 17 - Exercise 4: Target Trial Emulation (Conceptual) # Early vs delayed metformin initiation and 5-year cardiovascular events # ============================================================================= # This is a conceptual exercise. All answers are provided as detailed comments. # --- Part (a): Complete target trial protocol table --- # # Protocol Component | Target Trial | Observational Emulation # -----------------------|---------------------------------------|--------------------------------------- # Eligibility criteria | Adults aged 30-80 with new T2DM | Same criteria in EHR: first T2DM # | diagnosis, no prior CVD, no prior | ICD-10 E11 code, no prior MACE, # | metformin use, eGFR >= 30, no | no metformin dispensing before index, # | contraindications to metformin. | eGFR >= 30 within prior 6 months. # | | # Treatment strategies | Strategy 1: Initiate metformin | Same. Defined by first metformin # | within 3 months of T2DM diagnosis. | prescription fill date relative to # | Strategy 2: Do not initiate within | T2DM diagnosis date. # | 3 months (delayed: 3-12 months). | # | | # Treatment assignment | Random assignment at T2DM diagnosis. | Not random. Adjusted via IPTW or # | | PS matching on baseline confounders. # | | # Start of follow-up | Date of randomisation (= date of | Date of T2DM diagnosis (time zero). # (time zero) | T2DM diagnosis). | All eligible patients enter at this # | | date, regardless of treatment timing. # | | # Outcome | First MACE (MI, stroke, or CV death) | Same. MACE via ICD-10 codes. # | within 5 years. | Censored at 5 years, loss to # | | follow-up, or non-CV death. # | | # Causal contrast | ITT: effect of assignment to early | ITT: compare groups as assigned. # | vs delayed. Per-protocol: effect of | Per-protocol: clone-censor-weight. # | adhering to assignment. | # | | # Analysis plan | Cox PH model; ITT primary. | IPTW-weighted Cox PH for ITT. # | | Clone-censor-weight for per-protocol. # --- Part (b): Sources of immortal time bias --- # # 1. SURVIVAL REQUIREMENT FOR CLASSIFICATION: # To be classified as a "delayed initiator," a patient must survive at # least 3 months without starting metformin. This period is "immortal # time" -- the patient cannot die (by definition) during this window. # This artificially inflates survival in the delayed group. # # 2. MISCLASSIFICATION OF PERSON-TIME: # If follow-up starts at T2DM diagnosis but treatment classification # depends on future behavior, person-time before treatment initiation # is misclassified. The period before a patient's first metformin # prescription is attributed to the wrong treatment group. # # 3. EXCLUSION OF EARLY DEATHS: # Patients who die before initiating metformin may be excluded from # the analysis entirely, creating survivorship bias. # # HOW TARGET TRIAL EMULATION AVOIDS THIS: # - Aligns time zero with eligibility (T2DM diagnosis), not treatment # - Uses clone-censor-weight: clones each patient into both strategies, # censors when they deviate, and applies IPCW to correct for the # informative censoring introduced by this approach. # --- Part (c): New-user, active comparator design --- # # NEW-USER (INCIDENT USER) DESIGN: # - Includes patients only at the point of first treatment decision # - Avoids prevalent user bias (selective survival of tolerators) # - Captures early effects including initial side effects # - Provides a clear "time zero" aligned with treatment decision # # ACTIVE COMPARATOR DESIGN: # - Compares early metformin vs delayed metformin (not vs no treatment) # - Patients in both groups are deemed to need metformin # - Reduces confounding by indication: the comparison is about timing, # not about treatment vs no treatment # - Improves positivity: most T2DM patients can plausibly be in either # timing group # - Reduces healthy user bias: both groups seek treatment # - Mimics the clinically relevant question # --- Part (d): Unmeasured confounders and sensitivity analysis --- # # POTENTIAL UNMEASURED CONFOUNDERS: # 1. Diabetes severity trajectory (HbA1c trends) # 2. Health literacy and medication adherence behaviour # 3. Physician quality / prescribing culture # 4. Socioeconomic status / insurance coverage # 5. Diet, exercise, and lifestyle factors # 6. Patient preferences and shared decision-making # # SENSITIVITY ANALYSIS APPROACHES: # 1. E-value: quantify minimum confounder strength to nullify the result # 2. Quantitative bias analysis: use external data to model bias # 3. Negative control outcomes: test outcomes unrelated to metformin # 4. Negative control exposures: test unrelated treatments # 5. Instrumental variable analysis: e.g., physician preference print("=" * 70) print("Exercise 4: Target Trial Emulation (Conceptual)") print("=" * 70) print() print("All answers are provided as comments in this script.") print("Review the comments for:") print(" (a) Complete target trial protocol table") print(" (b) Sources of immortal time bias in a naive analysis") print(" (c) How new-user, active comparator design addresses confounding") print(" (d) Unmeasured confounders and sensitivity analysis approaches") print() print("Key takeaway: Target trial emulation forces explicit specification") print("of all protocol components, making assumptions transparent and") print("reducing common biases like immortal time bias and prevalent user bias.") ``` ::: ## Chapter 18: Meta-Analysis ### Exercise 1 ::: {.panel-tabset} #### R ```{r} #| label: sol-ch18-ex1-r #| eval: false #| code-fold: true #| code-summary: "Show R solution" # ============================================================================= # Chapter 18 - Exercise 1: Basic Meta-Analysis in R # New anticoagulant vs warfarin for stroke prevention in atrial fibrillation # ============================================================================= library(tidyverse) library(meta) library(metafor) # --- Dataset --- af_trials <- data.frame( study = c("TRAIL-1", "GUARD-AF", "SHIELD", "ORBIT-AF", "VENTURE", "COMPASS-AF", "PIONEER-2", "ATLAS-AF"), events_new = c(28, 45, 112, 67, 33, 89, 52, 41), n_new = c(1200, 2500, 5400, 3100, 1800, 4200, 2800, 2100), events_warf = c(42, 58, 148, 84, 29, 102, 61, 53), n_warf = c(1200, 2500, 5400, 3100, 1800, 4200, 2800, 2100) ) # --- Part (a): Compute RR and 95% CI for each trial --- # Using metafor's escalc function es <- escalc(measure = "RR", ai = events_new, n1i = n_new, ci = events_warf, n2i = n_warf, data = af_trials) # Add human-readable columns es$RR <- exp(es$yi) es$ci_lo <- exp(es$yi - 1.96 * sqrt(es$vi)) es$ci_hi <- exp(es$yi + 1.96 * sqrt(es$vi)) cat("Individual trial risk ratios:\n") print(es[, c("study", "RR", "ci_lo", "ci_hi")], digits = 3) # By hand verification for first trial rr_trail1 <- (28/1200) / (42/1200) se_trail1 <- sqrt(1/28 - 1/1200 + 1/42 - 1/1200) cat("\nManual check (TRAIL-1):") cat("\n RR =", round(rr_trail1, 3)) cat("\n 95% CI: (", round(exp(log(rr_trail1) - 1.96*se_trail1), 3), ",", round(exp(log(rr_trail1) + 1.96*se_trail1), 3), ")\n") # --- Part (b): Random-effects meta-analysis --- # Using the meta package m1 <- metabin( event.e = events_new, n.e = n_new, event.c = events_warf, n.c = n_warf, studlab = study, data = af_trials, sm = "RR", method.tau = "REML", prediction = TRUE ) cat("\n--- Random-effects meta-analysis ---\n") print(summary(m1)) # Also using metafor for comparison res <- rma(yi, vi, data = es, method = "REML") cat("\n--- metafor results ---\n") print(summary(res)) # --- Part (c): Forest plot --- forest(m1, sortvar = TE, label.left = "Favours new anticoagulant", label.right = "Favours warfarin", col.diamond = "steelblue", col.square = "darkblue", print.tau2 = TRUE, print.I2 = TRUE, print.pval.Q = TRUE, prediction = TRUE, main = "New Anticoagulant vs Warfarin: Stroke Prevention in AF") cat("\n--- Does the pooled effect favour the new anticoagulant? ---\n") pooled_rr <- exp(m1$TE.random) cat("Pooled RR:", round(pooled_rr, 3), "\n") cat("95% CI: (", round(exp(m1$lower.random), 3), ",", round(exp(m1$upper.random), 3), ")\n") if (pooled_rr < 1 & exp(m1$upper.random) < 1) { cat("Yes, the pooled effect significantly favours the new anticoagulant.\n") } else if (pooled_rr < 1) { cat("The point estimate favours the new anticoagulant, but the CI includes 1.\n") } else { cat("The pooled effect does not favour the new anticoagulant.\n") } # --- Part (d): I-squared and prediction interval --- cat("\n--- Heterogeneity ---\n") cat("I-squared:", round(m1$I2, 1), "%\n") cat("tau-squared:", round(m1$tau2, 4), "\n") cat("Q statistic:", round(m1$Q, 2), ", df =", m1$df.Q, ", p =", round(m1$pval.Q, 4), "\n") cat("\n--- Prediction interval ---\n") cat("Prediction interval for RR: (", round(exp(m1$lower.predict), 3), ",", round(exp(m1$upper.predict), 3), ")\n") # Interpretation if (m1$I2 < 25) { cat("I-squared suggests LOW heterogeneity.\n") } else if (m1$I2 < 50) { cat("I-squared suggests MODERATE heterogeneity.\n") } else if (m1$I2 < 75) { cat("I-squared suggests SUBSTANTIAL heterogeneity.\n") } else { cat("I-squared suggests CONSIDERABLE heterogeneity.\n") } cat("\nThe prediction interval tells us the range within which the true\n") cat("effect in a future study is expected to fall. If it crosses 1,\n") cat("some settings might see no benefit from the new anticoagulant.\n") # --- Part (e): Funnel plot and Egger's test --- funnel(m1, xlab = "Risk Ratio (log scale)", studlab = TRUE, col = "steelblue", pch = 16, main = "Funnel Plot") cat("\n--- Egger's test for funnel plot asymmetry ---\n") egger <- metabias(m1, method.bias = "Egger") print(egger) if (egger$p.value < 0.05) { cat("Egger's test is significant (p < 0.05), suggesting potential\n") cat("publication bias or small-study effects.\n") } else { cat("Egger's test is not significant, no strong evidence of\n") cat("publication bias. However, with only", length(af_trials$study), "studies,\n") cat("the test has limited power (recommended: >= 10 studies).\n") } # Trim-and-fill as additional check tf <- trimfill(m1) cat("\n--- Trim-and-fill ---\n") cat("Imputed missing studies:", tf$k0, "\n") cat("Adjusted pooled RR:", round(exp(tf$TE.random), 3), "\n") # --- Part (f): Leave-one-out sensitivity analysis --- cat("\n--- Leave-one-out sensitivity analysis ---\n") l1o <- leave1out(res) print(l1o) # Check if removing any single study changes the conclusion cat("\nIs the result robust to removing any single study?\n") all_sig <- all(exp(l1o$ci.lb) < 1 | exp(l1o$ci.ub) > 1) # Check if all leave-one-out CIs exclude 1 (if pooled is significant) if (exp(res$ci.ub) < 1) { robust <- all(exp(l1o$ci.ub) < 1) } else { robust <- TRUE } if (robust) { cat("Yes - no single study removal changes the direction or significance.\n") } else { cat("No - removing at least one study changes the conclusion.\n") cat("The result may be driven by specific influential studies.\n") } # Influence diagnostics inf <- influence(res) plot(inf, main = "Influence Diagnostics") ``` #### Python ```{python} #| label: sol-ch18-ex1-py #| eval: false #| code-fold: true #| code-summary: "Show Python solution" # ============================================================================= # Chapter 18 - Exercise 1: Basic Meta-Analysis in Python # (Python version of the R Exercise 1 for completeness) # New anticoagulant vs warfarin for stroke prevention in AF # ============================================================================= import numpy as np import pandas as pd from scipy import stats import matplotlib.pyplot as plt # --- Dataset --- af_trials = pd.DataFrame({ 'study': ['TRAIL-1', 'GUARD-AF', 'SHIELD', 'ORBIT-AF', 'VENTURE', 'COMPASS-AF', 'PIONEER-2', 'ATLAS-AF'], 'events_new': [28, 45, 112, 67, 33, 89, 52, 41], 'n_new': [1200, 2500, 5400, 3100, 1800, 4200, 2800, 2100], 'events_warf': [42, 58, 148, 84, 29, 102, 61, 53], 'n_warf': [1200, 2500, 5400, 3100, 1800, 4200, 2800, 2100] }) # --- Part (a): Compute log RR and variances --- af_trials['rr'] = (af_trials['events_new'] / af_trials['n_new']) / \ (af_trials['events_warf'] / af_trials['n_warf']) af_trials['log_rr'] = np.log(af_trials['rr']) af_trials['var_log_rr'] = ( 1/af_trials['events_new'] - 1/af_trials['n_new'] + 1/af_trials['events_warf'] - 1/af_trials['n_warf'] ) af_trials['se_log_rr'] = np.sqrt(af_trials['var_log_rr']) # 95% CI for each study af_trials['ci_lo'] = np.exp(af_trials['log_rr'] - 1.96 * af_trials['se_log_rr']) af_trials['ci_hi'] = np.exp(af_trials['log_rr'] + 1.96 * af_trials['se_log_rr']) print("Individual trial risk ratios:") print(af_trials[['study', 'rr', 'ci_lo', 'ci_hi']].to_string( float_format=lambda x: f"{x:.3f}", index=False)) # --- Part (b): Fixed-effect (inverse variance) --- w_fe = 1 / af_trials['var_log_rr'] pooled_fe = np.sum(w_fe * af_trials['log_rr']) / np.sum(w_fe) se_fe = np.sqrt(1 / np.sum(w_fe)) ci_fe = (pooled_fe - 1.96*se_fe, pooled_fe + 1.96*se_fe) print(f"\n--- Fixed-effect meta-analysis ---") print(f"Pooled log RR: {pooled_fe:.4f}") print(f"Pooled RR: {np.exp(pooled_fe):.4f} " f"(95% CI: {np.exp(ci_fe[0]):.4f} - {np.exp(ci_fe[1]):.4f})") # --- Part (c): DerSimonian-Laird random-effects --- k = len(af_trials) Q = np.sum(w_fe * (af_trials['log_rr'] - pooled_fe)**2) C = np.sum(w_fe) - np.sum(w_fe**2) / np.sum(w_fe) tau2 = max(0, (Q - (k - 1)) / C) w_re = 1 / (af_trials['var_log_rr'] + tau2) pooled_re = np.sum(w_re * af_trials['log_rr']) / np.sum(w_re) se_re = np.sqrt(1 / np.sum(w_re)) ci_re = (pooled_re - 1.96*se_re, pooled_re + 1.96*se_re) # --- Part (d): Q, I-squared, tau-squared --- I2 = max(0, (Q - (k-1)) / Q) * 100 p_Q = 1 - stats.chi2.cdf(Q, k-1) print(f"\n--- Random-effects meta-analysis (DerSimonian-Laird) ---") print(f"Pooled log RR: {pooled_re:.4f}") print(f"Pooled RR: {np.exp(pooled_re):.4f} " f"(95% CI: {np.exp(ci_re[0]):.4f} - {np.exp(ci_re[1]):.4f})") print(f"\nHeterogeneity:") print(f" tau^2: {tau2:.4f}") print(f" I^2: {I2:.1f}%") print(f" Q: {Q:.2f}, df = {k-1}, p = {p_Q:.4f}") # Prediction interval t_crit = stats.t.ppf(0.975, k - 2) pi_lo = pooled_re - t_crit * np.sqrt(se_re**2 + tau2) pi_hi = pooled_re + t_crit * np.sqrt(se_re**2 + tau2) print(f"\nPrediction interval for RR: ({np.exp(pi_lo):.4f}, {np.exp(pi_hi):.4f})") if np.exp(pi_lo) < 1 and np.exp(pi_hi) > 1: print("The prediction interval crosses 1, meaning some future settings") print("might not see a benefit from the new anticoagulant.") else: print("The prediction interval does not cross 1.") # --- Part (e): Forest plot --- fig, ax = plt.subplots(figsize=(10, 8)) y_pos = list(range(k, 0, -1)) weights = w_re / w_re.max() # Individual studies for i, (_, row) in enumerate(af_trials.iterrows()): ci_lo_i = row['ci_lo'] ci_hi_i = row['ci_hi'] ax.plot([ci_lo_i, ci_hi_i], [y_pos[i], y_pos[i]], 'k-', linewidth=1) size = weights.iloc[i] * 200 ax.scatter(row['rr'], y_pos[i], s=size, c='steelblue', zorder=5, edgecolors='darkblue') # Null line ax.axvline(x=1, color='black', linestyle='-', linewidth=0.5) # Pooled estimate line ax.axvline(x=np.exp(pooled_re), color='steelblue', linestyle='--', alpha=0.5) # Pooled diamond diamond_y = 0 dx = [np.exp(ci_re[0]), np.exp(pooled_re), np.exp(ci_re[1]), np.exp(pooled_re)] dy = [diamond_y, diamond_y + 0.3, diamond_y, diamond_y - 0.3] ax.fill(dx, dy, color='steelblue', alpha=0.7) ax.set_yticks(y_pos + [0]) ax.set_yticklabels(list(af_trials['study']) + ['Pooled RE']) ax.set_xlabel('Risk Ratio') ax.set_title('Forest Plot: New Anticoagulant vs Warfarin for AF') ax.set_xscale('log') plt.tight_layout() plt.savefig('forest_plot_af.png', dpi=300) plt.show() # --- Part (f): Funnel plot and Egger's test --- fig, ax = plt.subplots(figsize=(8, 6)) ax.scatter(af_trials['log_rr'], af_trials['se_log_rr'], c='steelblue', s=80, edgecolors='darkblue', zorder=5) ax.axvline(x=pooled_re, color='grey', linestyle='--', alpha=0.7) # Pseudo-confidence region se_range = np.linspace(0, af_trials['se_log_rr'].max() * 1.1, 100) ax.plot(pooled_re - 1.96*se_range, se_range, 'k--', alpha=0.3) ax.plot(pooled_re + 1.96*se_range, se_range, 'k--', alpha=0.3) # Add study labels for _, row in af_trials.iterrows(): ax.annotate(row['study'], (row['log_rr'], row['se_log_rr']), textcoords="offset points", xytext=(5, 5), fontsize=7) ax.invert_yaxis() ax.set_xlabel('Log Risk Ratio') ax.set_ylabel('Standard Error') ax.set_title('Funnel Plot') plt.tight_layout() plt.savefig('funnel_plot_af.png', dpi=300) plt.show() # Egger's test: weighted regression of effect on SE import statsmodels.api as sm X_egger = sm.add_constant(af_trials['se_log_rr']) egger_model = sm.WLS(af_trials['log_rr'], X_egger, weights=w_fe).fit() print(f"\n--- Egger's test ---") print(f"Intercept: {egger_model.params.iloc[0]:.4f}") print(f"p-value: {egger_model.pvalues.iloc[0]:.4f}") if egger_model.pvalues.iloc[0] < 0.05: print("Egger's test is significant, suggesting potential publication bias.") else: print("Egger's test is not significant.") print(f"Note: With only {k} studies, the test has limited power.") # --- Leave-one-out sensitivity analysis --- print(f"\n--- Leave-one-out sensitivity analysis ---") print(f"{'Excluded study':<15} {'Pooled RR':>10} {'95% CI':>25} {'I2':>8}") print("-" * 60) for i in range(k): # Exclude study i mask = af_trials.index != i w_i = 1 / af_trials.loc[mask, 'var_log_rr'] # Fixed-effect pooled pooled_i_fe = np.sum(w_i * af_trials.loc[mask, 'log_rr']) / np.sum(w_i) # Q and tau2 Q_i = np.sum(w_i * (af_trials.loc[mask, 'log_rr'] - pooled_i_fe)**2) C_i = np.sum(w_i) - np.sum(w_i**2) / np.sum(w_i) tau2_i = max(0, (Q_i - (k - 2)) / C_i) # Random-effects w_re_i = 1 / (af_trials.loc[mask, 'var_log_rr'] + tau2_i) pooled_i = np.sum(w_re_i * af_trials.loc[mask, 'log_rr']) / np.sum(w_re_i) se_i = np.sqrt(1 / np.sum(w_re_i)) ci_i = (pooled_i - 1.96*se_i, pooled_i + 1.96*se_i) I2_i = max(0, (Q_i - (k-2)) / Q_i) * 100 print(f"{af_trials.loc[i, 'study']:<15} {np.exp(pooled_i):>10.4f} " f"({np.exp(ci_i[0]):.4f}, {np.exp(ci_i[1]):.4f}) {I2_i:>6.1f}%") print("\nIf removing any single study changes the conclusion (CI crosses 1),") print("the result is sensitive to that study.") ``` ::: ### Exercise 2 ::: {.panel-tabset} #### R ```{r} #| label: sol-ch18-ex2-r #| eval: false #| code-fold: true #| code-summary: "Show R solution" # ============================================================================= # Chapter 18 - Exercise 2: Meta-Analysis from Scratch in R # (Exercise 2 is Python-focused, but we provide an R version as well) # Manual implementation of meta-analysis computations # ============================================================================= library(tidyverse) # --- Dataset (same as Exercise 1) --- af_trials <- data.frame( study = c("TRAIL-1", "GUARD-AF", "SHIELD", "ORBIT-AF", "VENTURE", "COMPASS-AF", "PIONEER-2", "ATLAS-AF"), events_new = c(28, 45, 112, 67, 33, 89, 52, 41), n_new = c(1200, 2500, 5400, 3100, 1800, 4200, 2800, 2100), events_warf = c(42, 58, 148, 84, 29, 102, 61, 53), n_warf = c(1200, 2500, 5400, 3100, 1800, 4200, 2800, 2100) ) # --- Part (a): Compute log RR and variances by hand --- af_trials$rr <- (af_trials$events_new / af_trials$n_new) / (af_trials$events_warf / af_trials$n_warf) af_trials$log_rr <- log(af_trials$rr) af_trials$var_log_rr <- (1/af_trials$events_new - 1/af_trials$n_new + 1/af_trials$events_warf - 1/af_trials$n_warf) af_trials$se_log_rr <- sqrt(af_trials$var_log_rr) cat("Log risk ratios and SEs:\n") print(af_trials[, c("study", "log_rr", "se_log_rr", "rr")]) # --- Part (b): Fixed-effect inverse-variance method --- w_fe <- 1 / af_trials$var_log_rr pooled_fe <- sum(w_fe * af_trials$log_rr) / sum(w_fe) se_fe <- sqrt(1 / sum(w_fe)) ci_fe_lo <- pooled_fe - 1.96 * se_fe ci_fe_hi <- pooled_fe + 1.96 * se_fe z_fe <- pooled_fe / se_fe p_fe <- 2 * (1 - pnorm(abs(z_fe))) cat("\n--- Fixed-effect meta-analysis (by hand) ---\n") cat("Pooled log RR:", round(pooled_fe, 4), "\n") cat("Pooled RR:", round(exp(pooled_fe), 4), "(95% CI:", round(exp(ci_fe_lo), 4), "-", round(exp(ci_fe_hi), 4), ")\n") cat("z =", round(z_fe, 3), ", p =", round(p_fe, 4), "\n") # --- Part (c): DerSimonian-Laird random-effects --- k <- nrow(af_trials) # Cochran's Q Q <- sum(w_fe * (af_trials$log_rr - pooled_fe)^2) # C constant C_val <- sum(w_fe) - sum(w_fe^2) / sum(w_fe) # Between-study variance tau2 <- max(0, (Q - (k - 1)) / C_val) # Random-effects weights w_re <- 1 / (af_trials$var_log_rr + tau2) pooled_re <- sum(w_re * af_trials$log_rr) / sum(w_re) se_re <- sqrt(1 / sum(w_re)) ci_re_lo <- pooled_re - 1.96 * se_re ci_re_hi <- pooled_re + 1.96 * se_re z_re <- pooled_re / se_re p_re <- 2 * (1 - pnorm(abs(z_re))) cat("\n--- Random-effects meta-analysis (DerSimonian-Laird, by hand) ---\n") cat("Pooled log RR:", round(pooled_re, 4), "\n") cat("Pooled RR:", round(exp(pooled_re), 4), "(95% CI:", round(exp(ci_re_lo), 4), "-", round(exp(ci_re_hi), 4), ")\n") cat("z =", round(z_re, 3), ", p =", round(p_re, 4), "\n") # --- Part (d): Q, I-squared, tau-squared --- p_Q <- 1 - pchisq(Q, df = k - 1) I2 <- max(0, (Q - (k - 1)) / Q) * 100 cat("\n--- Heterogeneity statistics ---\n") cat("Q =", round(Q, 2), ", df =", k - 1, ", p =", round(p_Q, 4), "\n") cat("tau^2 =", round(tau2, 4), "\n") cat("I^2 =", round(I2, 1), "%\n") # Prediction interval t_crit <- qt(0.975, df = k - 2) pi_lo <- pooled_re - t_crit * sqrt(se_re^2 + tau2) pi_hi <- pooled_re + t_crit * sqrt(se_re^2 + tau2) cat("Prediction interval for RR: (", round(exp(pi_lo), 4), ",", round(exp(pi_hi), 4), ")\n") # --- Part (e): Forest plot (base R, no packages) --- af_trials$w_re <- w_re af_trials$ci_lo <- exp(af_trials$log_rr - 1.96 * af_trials$se_log_rr) af_trials$ci_hi <- exp(af_trials$log_rr + 1.96 * af_trials$se_log_rr) par(mar = c(5, 10, 4, 2)) y_pos <- k:1 plot(af_trials$rr, y_pos, xlim = c(0.3, 2.0), ylim = c(-0.5, k + 0.5), pch = 15, cex = af_trials$w_re / max(af_trials$w_re) * 2, xlab = "Risk Ratio", ylab = "", yaxt = "n", main = "Forest Plot (Built from Scratch)", log = "x", col = "steelblue") # Study CIs segments(af_trials$ci_lo, y_pos, af_trials$ci_hi, y_pos, lwd = 1.5) # Study labels axis(2, at = y_pos, labels = af_trials$study, las = 1) # Null line abline(v = 1, lty = 2, col = "grey50") # Pooled estimate diamond diamond_x <- c(exp(ci_re_lo), exp(pooled_re), exp(ci_re_hi), exp(pooled_re)) diamond_y <- c(0, 0.3, 0, -0.3) polygon(diamond_x, diamond_y, col = "steelblue", border = "darkblue") # Pooled reference line abline(v = exp(pooled_re), col = "steelblue", lty = 3, lwd = 1) # --- Part (f): Funnel plot and Egger's test (by hand) --- par(mar = c(5, 5, 4, 2)) plot(af_trials$log_rr, af_trials$se_log_rr, pch = 16, col = "steelblue", cex = 1.5, xlim = c(min(af_trials$log_rr) - 0.3, max(af_trials$log_rr) + 0.3), ylim = rev(c(0, max(af_trials$se_log_rr) * 1.1)), xlab = "Log Risk Ratio", ylab = "Standard Error", main = "Funnel Plot (Built from Scratch)") abline(v = pooled_re, col = "grey50", lty = 2) # Pseudo-CI region se_seq <- seq(0, max(af_trials$se_log_rr) * 1.1, length.out = 100) lines(pooled_re - 1.96 * se_seq, se_seq, lty = 3, col = "grey70") lines(pooled_re + 1.96 * se_seq, se_seq, lty = 3, col = "grey70") # Egger's test: weighted regression of standardised effect on precision # Regression of yi/sei on 1/sei, weighted by 1/vi # Equivalent: regress yi on sei, weighted by 1/vi egger_fit <- lm(log_rr ~ se_log_rr, data = af_trials, weights = w_fe) cat("\n--- Egger's test (by hand) ---\n") cat("Intercept:", round(coef(egger_fit)[1], 4), "\n") cat("SE:", round(summary(egger_fit)$coefficients[1, 2], 4), "\n") cat("p-value:", round(summary(egger_fit)$coefficients[1, 4], 4), "\n") if (summary(egger_fit)$coefficients[1, 4] < 0.05) { cat("Significant asymmetry detected.\n") } else { cat("No significant asymmetry (limited power with", k, "studies).\n") } ``` #### Python ```{python} #| label: sol-ch18-ex2-py #| eval: false #| code-fold: true #| code-summary: "Show Python solution" # ============================================================================= # Chapter 18 - Exercise 2: Meta-Analysis from Scratch in Python # New anticoagulant vs warfarin for stroke prevention in AF # ============================================================================= import numpy as np import pandas as pd from scipy import stats import statsmodels.api as sm import matplotlib.pyplot as plt # --- Dataset --- af_trials = pd.DataFrame({ 'study': ['TRAIL-1', 'GUARD-AF', 'SHIELD', 'ORBIT-AF', 'VENTURE', 'COMPASS-AF', 'PIONEER-2', 'ATLAS-AF'], 'events_new': [28, 45, 112, 67, 33, 89, 52, 41], 'n_new': [1200, 2500, 5400, 3100, 1800, 4200, 2800, 2100], 'events_warf': [42, 58, 148, 84, 29, 102, 61, 53], 'n_warf': [1200, 2500, 5400, 3100, 1800, 4200, 2800, 2100] }) # --- Part (a): Compute log risk ratios and variances --- af_trials['rr'] = (af_trials['events_new'] / af_trials['n_new']) / \ (af_trials['events_warf'] / af_trials['n_warf']) af_trials['log_rr'] = np.log(af_trials['rr']) # Variance of log RR: 1/a - 1/n1 + 1/c - 1/n2 af_trials['var_log_rr'] = ( 1/af_trials['events_new'] - 1/af_trials['n_new'] + 1/af_trials['events_warf'] - 1/af_trials['n_warf'] ) af_trials['se_log_rr'] = np.sqrt(af_trials['var_log_rr']) print("Part (a): Log risk ratios and standard errors") print(af_trials[['study', 'log_rr', 'se_log_rr', 'rr']].to_string( float_format=lambda x: f"{x:.4f}", index=False)) # --- Part (b): Fixed-effect inverse-variance method --- w_fe = 1 / af_trials['var_log_rr'] pooled_fe = np.sum(w_fe * af_trials['log_rr']) / np.sum(w_fe) se_fe = np.sqrt(1 / np.sum(w_fe)) ci_fe = (pooled_fe - 1.96 * se_fe, pooled_fe + 1.96 * se_fe) z_fe = pooled_fe / se_fe p_fe = 2 * (1 - stats.norm.cdf(abs(z_fe))) print(f"\n--- Part (b): Fixed-effect meta-analysis ---") print(f"Pooled log RR: {pooled_fe:.4f}") print(f"Pooled RR: {np.exp(pooled_fe):.4f} " f"(95% CI: {np.exp(ci_fe[0]):.4f} - {np.exp(ci_fe[1]):.4f})") print(f"z = {z_fe:.3f}, p = {p_fe:.4f}") # Study weights (as percentages) af_trials['weight_fe_pct'] = (w_fe / w_fe.sum() * 100).round(1) print(f"\nStudy weights (fixed-effect):") print(af_trials[['study', 'weight_fe_pct']].to_string(index=False)) # --- Part (c): DerSimonian-Laird random-effects --- k = len(af_trials) # Step 1: Cochran's Q statistic Q = np.sum(w_fe * (af_trials['log_rr'] - pooled_fe)**2) # Step 2: C constant C = np.sum(w_fe) - np.sum(w_fe**2) / np.sum(w_fe) # Step 3: Between-study variance tau^2 tau2 = max(0, (Q - (k - 1)) / C) # Step 4: Random-effects weights w_re = 1 / (af_trials['var_log_rr'] + tau2) pooled_re = np.sum(w_re * af_trials['log_rr']) / np.sum(w_re) se_re = np.sqrt(1 / np.sum(w_re)) ci_re = (pooled_re - 1.96 * se_re, pooled_re + 1.96 * se_re) z_re = pooled_re / se_re p_re = 2 * (1 - stats.norm.cdf(abs(z_re))) print(f"\n--- Part (c): Random-effects meta-analysis (DerSimonian-Laird) ---") print(f"Pooled log RR: {pooled_re:.4f}") print(f"Pooled RR: {np.exp(pooled_re):.4f} " f"(95% CI: {np.exp(ci_re[0]):.4f} - {np.exp(ci_re[1]):.4f})") print(f"z = {z_re:.3f}, p = {p_re:.4f}") af_trials['weight_re_pct'] = (w_re / w_re.sum() * 100).round(1) print(f"\nStudy weights (random-effects):") print(af_trials[['study', 'weight_fe_pct', 'weight_re_pct']].to_string(index=False)) # --- Part (d): Q, I-squared, tau-squared --- p_Q = 1 - stats.chi2.cdf(Q, k - 1) I2 = max(0, (Q - (k - 1)) / Q) * 100 print(f"\n--- Part (d): Heterogeneity statistics ---") print(f"Q = {Q:.2f}, df = {k-1}, p = {p_Q:.4f}") print(f"tau^2 = {tau2:.4f}") print(f"tau = {np.sqrt(tau2):.4f}") print(f"I^2 = {I2:.1f}%") if I2 < 25: print("Interpretation: LOW heterogeneity") elif I2 < 50: print("Interpretation: MODERATE heterogeneity") elif I2 < 75: print("Interpretation: SUBSTANTIAL heterogeneity") else: print("Interpretation: CONSIDERABLE heterogeneity") # Prediction interval t_crit = stats.t.ppf(0.975, k - 2) pi_lo = pooled_re - t_crit * np.sqrt(se_re**2 + tau2) pi_hi = pooled_re + t_crit * np.sqrt(se_re**2 + tau2) print(f"\nPrediction interval for RR: ({np.exp(pi_lo):.4f}, {np.exp(pi_hi):.4f})") # --- Part (e): Forest plot --- fig, ax = plt.subplots(figsize=(10, 8)) y_pos = list(range(k, 0, -1)) weights_norm = w_re / w_re.max() for i, (_, row) in enumerate(af_trials.iterrows()): ci_lo_i = np.exp(row['log_rr'] - 1.96 * row['se_log_rr']) ci_hi_i = np.exp(row['log_rr'] + 1.96 * row['se_log_rr']) # CI line ax.plot([ci_lo_i, ci_hi_i], [y_pos[i], y_pos[i]], 'k-', linewidth=1) # Study point (size proportional to weight) size = weights_norm.iloc[i] * 200 ax.scatter(row['rr'], y_pos[i], s=size, c='steelblue', zorder=5, edgecolors='darkblue') # Weight annotation ax.text(2.0, y_pos[i], f"{row['weight_re_pct']:.1f}%", ha='left', va='center', fontsize=9) # Null line ax.axvline(x=1, color='black', linestyle='-', linewidth=0.5) # Pooled estimate line ax.axvline(x=np.exp(pooled_re), color='steelblue', linestyle='--', alpha=0.5) # Pooled diamond diamond_x = [np.exp(ci_re[0]), np.exp(pooled_re), np.exp(ci_re[1]), np.exp(pooled_re)] diamond_y = [0, 0.3, 0, -0.3] ax.fill(diamond_x, diamond_y, color='steelblue', alpha=0.7) # Labels ax.set_yticks(y_pos + [0]) ax.set_yticklabels(list(af_trials['study']) + ['Pooled RE']) ax.set_xlabel('Risk Ratio') ax.set_title('Forest Plot: New Anticoagulant vs Warfarin for AF\n' f'RE Model: RR = {np.exp(pooled_re):.3f} ' f'(95% CI: {np.exp(ci_re[0]):.3f}-{np.exp(ci_re[1]):.3f}), ' f'I\u00B2 = {I2:.1f}%') ax.set_xscale('log') # Add "Favours" labels ax.text(0.5, -1.2, 'Favours new anticoagulant', ha='center', fontsize=9, style='italic') ax.text(1.8, -1.2, 'Favours warfarin', ha='center', fontsize=9, style='italic') plt.tight_layout() plt.savefig('forest_plot_af.png', dpi=300) plt.show() # --- Part (f): Funnel plot and Egger's test --- fig, ax = plt.subplots(figsize=(8, 6)) ax.scatter(af_trials['log_rr'], af_trials['se_log_rr'], c='steelblue', s=80, edgecolors='darkblue', zorder=5) ax.axvline(x=pooled_re, color='grey', linestyle='--', alpha=0.7, label='Pooled effect') # Pseudo-confidence regions se_range = np.linspace(0, af_trials['se_log_rr'].max() * 1.1, 100) ax.plot(pooled_re - 1.96*se_range, se_range, 'k--', alpha=0.3) ax.plot(pooled_re + 1.96*se_range, se_range, 'k--', alpha=0.3) # Add study labels for _, row in af_trials.iterrows(): ax.annotate(row['study'], (row['log_rr'], row['se_log_rr']), textcoords="offset points", xytext=(5, 5), fontsize=7) ax.invert_yaxis() ax.set_xlabel('Log Risk Ratio') ax.set_ylabel('Standard Error') ax.set_title('Funnel Plot: Checking for Publication Bias') ax.legend() plt.tight_layout() plt.savefig('funnel_plot_af.png', dpi=300) plt.show() # Egger's regression test # Weighted regression of effect estimate on standard error X_egger = sm.add_constant(af_trials['se_log_rr']) egger_model = sm.WLS(af_trials['log_rr'], X_egger, weights=1/af_trials['var_log_rr']).fit() print(f"\n--- Part (f): Egger's test ---") print(f"Intercept: {egger_model.params.iloc[0]:.4f}") print(f"SE of intercept: {egger_model.bse.iloc[0]:.4f}") print(f"t-statistic: {egger_model.tvalues.iloc[0]:.3f}") print(f"p-value: {egger_model.pvalues.iloc[0]:.4f}") if egger_model.pvalues.iloc[0] < 0.05: print("Egger's test is significant (p < 0.05).") print("This suggests potential publication bias or small-study effects.") else: print("Egger's test is not significant.") print(f"Note: With only {k} studies, the test has limited power.") print("At least 10 studies are recommended for reliable Egger's test.") ``` ::: ### Exercise 3 ::: {.panel-tabset} #### R ```{r} #| label: sol-ch18-ex3-r #| eval: false #| code-fold: true #| code-summary: "Show R solution" # ============================================================================= # Chapter 18 - Exercise 3: Subgroup Analysis and Meta-Regression in R # ============================================================================= library(tidyverse) library(meta) library(metafor) # --- Dataset --- af_trials <- data.frame( study = c("TRAIL-1", "GUARD-AF", "SHIELD", "ORBIT-AF", "VENTURE", "COMPASS-AF", "PIONEER-2", "ATLAS-AF"), events_new = c(28, 45, 112, 67, 33, 89, 52, 41), n_new = c(1200, 2500, 5400, 3100, 1800, 4200, 2800, 2100), events_warf = c(42, 58, 148, 84, 29, 102, 61, 53), n_warf = c(1200, 2500, 5400, 3100, 1800, 4200, 2800, 2100), region = c("Europe", "North America", "Europe", "Asia", "North America", "Europe", "Asia", "North America"), mean_age = c(72, 68, 74, 65, 70, 71, 63, 69), pct_female = c(38, 42, 35, 48, 40, 37, 52, 44) ) # Compute effect sizes using metafor es <- escalc(measure = "RR", ai = events_new, n1i = n_new, ci = events_warf, n2i = n_warf, data = af_trials) # Base meta-analysis m1 <- metabin( event.e = events_new, n.e = n_new, event.c = events_warf, n.c = n_warf, studlab = study, data = af_trials, sm = "RR", method.tau = "REML" ) # --- Part (a): Subgroup analysis by region --- cat("--- Part (a): Subgroup analysis by region ---\n\n") # Using meta package m_sub <- update(m1, subgroup = af_trials$region) cat("Subgroup analysis results:\n") print(summary(m_sub)) # Forest plot by subgroup forest(m_sub, sortvar = TE, label.left = "Favours new anticoagulant", label.right = "Favours warfarin", col.diamond = "steelblue", print.tau2 = TRUE, print.I2 = TRUE, print.subgroup.name = TRUE, main = "Subgroup Analysis by Region") # Using metafor for the test of subgroup differences res_sub <- rma(yi, vi, mods = ~ region, data = es, method = "REML") cat("\nTest for subgroup differences (metafor):\n") print(summary(res_sub)) cat("\nDo treatment effects differ by region?\n") cat("Test for moderation: QM =", round(res_sub$QM, 2), ", df =", res_sub$m, ", p =", round(res_sub$QMp, 4), "\n") if (res_sub$QMp < 0.05) { cat("Yes, there is a statistically significant difference between regions.\n") } else { cat("No, there is no statistically significant difference between regions.\n") cat("However, with only", nrow(es), "studies split across", length(unique(es$region)), "regions,\n") cat("statistical power for detecting subgroup differences is limited.\n") } # --- Part (b): Meta-regression with mean age --- cat("\n--- Part (b): Meta-regression with mean age ---\n\n") res_age <- rma(yi, vi, mods = ~ mean_age, data = es, method = "REML") print(summary(res_age)) cat("\nInterpretation:\n") cat("Coefficient for mean_age:", round(coef(res_age)["mean_age"], 4), "\n") cat("p-value:", round(res_age$pval[2], 4), "\n") cat("R^2 (proportion of heterogeneity explained):", round(max(0, res_age$R2), 1), "%\n") if (res_age$pval[2] < 0.05) { cat("There IS a statistically significant relationship between mean age\n") cat("and treatment effect. Each 1-year increase in mean age is associated\n") cat("with a", round(coef(res_age)["mean_age"], 4), "change in log RR.\n") } else { cat("There is NO statistically significant relationship between mean age\n") cat("and treatment effect (p =", round(res_age$pval[2], 3), ").\n") } # --- Part (c): Meta-regression with percentage female --- cat("\n--- Part (c): Meta-regression with percentage female ---\n\n") res_fem <- rma(yi, vi, mods = ~ pct_female, data = es, method = "REML") print(summary(res_fem)) cat("\nInterpretation:\n") cat("Coefficient for pct_female:", round(coef(res_fem)["pct_female"], 4), "\n") cat("p-value:", round(res_fem$pval[2], 4), "\n") cat("R^2:", round(max(0, res_fem$R2), 1), "%\n") cat("\n*** ECOLOGICAL FALLACY WARNING ***\n") cat("Even if there is an association between trial-level percentage female\n") cat("and the treatment effect, this does NOT prove that individual women\n") cat("respond differently to the treatment than individual men.\n") cat("Trials with higher percentage female may differ in other ways:\n") cat(" - Geographic region (cultural differences in enrolment)\n") cat(" - Mean age (women often present with AF at older ages)\n") cat(" - Comorbidity profiles\n") cat(" - Trial methodology\n") cat("Only an individual participant data (IPD) meta-analysis with a\n") cat("treatment-by-sex interaction term can properly assess whether\n") cat("sex modifies the treatment effect.\n") # --- Part (d): Bubble plot of meta-regression on mean age --- cat("\n--- Part (d): Bubble plot ---\n") # Method 1: Using metafor's built-in bubble plot regplot(res_age, xlab = "Mean Age (years)", ylab = "Log Risk Ratio", main = "Meta-Regression: Treatment Effect vs Mean Age", ci = TRUE, pi = TRUE, # prediction interval col = "steelblue", bg = "lightblue", las = 1) # Method 2: Using ggplot2 for more control es$w_re <- 1 / (es$vi + res_age$tau2) es$w_norm <- es$w_re / max(es$w_re) # Predicted line from meta-regression age_seq <- seq(min(es$mean_age) - 2, max(es$mean_age) + 2, length.out = 100) pred <- predict(res_age, newmods = age_seq) pred_df <- data.frame( mean_age = age_seq, pred = pred$pred, ci_lo = pred$ci.lb, ci_hi = pred$ci.ub, pi_lo = pred$pi.lb, pi_hi = pred$pi.ub ) p_bubble <- ggplot() + # Prediction interval geom_ribbon(data = pred_df, aes(x = mean_age, ymin = pi_lo, ymax = pi_hi), fill = "grey90", alpha = 0.5) + # Confidence interval for regression line geom_ribbon(data = pred_df, aes(x = mean_age, ymin = ci_lo, ymax = ci_hi), fill = "steelblue", alpha = 0.2) + # Regression line geom_line(data = pred_df, aes(x = mean_age, y = pred), colour = "steelblue", linewidth = 1) + # Study points (bubbles) geom_point(data = es, aes(x = mean_age, y = yi, size = w_norm), colour = "steelblue", alpha = 0.7) + # Study labels geom_text(data = es, aes(x = mean_age, y = yi, label = study), size = 2.5, vjust = -1.5) + # Reference line at null geom_hline(yintercept = 0, linetype = "dashed", colour = "grey50") + # Aesthetics scale_size_continuous(range = c(3, 10), guide = "none") + labs(x = "Mean Age (years)", y = "Log Risk Ratio", title = "Bubble Plot: Meta-Regression of Treatment Effect on Mean Age", subtitle = paste0("Slope = ", round(coef(res_age)["mean_age"], 4), ", p = ", round(res_age$pval[2], 3), "; bubble size proportional to study weight")) + theme_minimal() + theme( plot.title = element_text(face = "bold", size = 12), axis.title = element_text(face = "bold") ) print(p_bubble) ggsave("bubble_plot_age.png", p_bubble, width = 8, height = 6, dpi = 300) cat("\nBubble plot saved. Larger bubbles indicate more precise studies.\n") cat("The regression line shows the relationship between mean age\n") cat("and treatment effect, with shaded confidence and prediction intervals.\n") ``` #### Python ```{python} #| label: sol-ch18-ex3-py #| eval: false #| code-fold: true #| code-summary: "Show Python solution" # ============================================================================= # Chapter 18 - Exercise 3: Subgroup Analysis and Meta-Regression in Python # ============================================================================= import numpy as np import pandas as pd from scipy import stats import statsmodels.api as sm import matplotlib.pyplot as plt # --- Dataset --- af_trials = pd.DataFrame({ 'study': ['TRAIL-1', 'GUARD-AF', 'SHIELD', 'ORBIT-AF', 'VENTURE', 'COMPASS-AF', 'PIONEER-2', 'ATLAS-AF'], 'events_new': [28, 45, 112, 67, 33, 89, 52, 41], 'n_new': [1200, 2500, 5400, 3100, 1800, 4200, 2800, 2100], 'events_warf': [42, 58, 148, 84, 29, 102, 61, 53], 'n_warf': [1200, 2500, 5400, 3100, 1800, 4200, 2800, 2100], 'region': ['Europe', 'North America', 'Europe', 'Asia', 'North America', 'Europe', 'Asia', 'North America'], 'mean_age': [72, 68, 74, 65, 70, 71, 63, 69], 'pct_female': [38, 42, 35, 48, 40, 37, 52, 44] }) # Compute effect sizes af_trials['rr'] = (af_trials['events_new'] / af_trials['n_new']) / \ (af_trials['events_warf'] / af_trials['n_warf']) af_trials['log_rr'] = np.log(af_trials['rr']) af_trials['var_log_rr'] = ( 1/af_trials['events_new'] - 1/af_trials['n_new'] + 1/af_trials['events_warf'] - 1/af_trials['n_warf'] ) af_trials['se_log_rr'] = np.sqrt(af_trials['var_log_rr']) def dl_random_effects(log_rr, var): """DerSimonian-Laird random-effects meta-analysis.""" k = len(log_rr) w_fe = 1 / var pooled_fe = np.sum(w_fe * log_rr) / np.sum(w_fe) Q = np.sum(w_fe * (log_rr - pooled_fe)**2) C = np.sum(w_fe) - np.sum(w_fe**2) / np.sum(w_fe) tau2 = max(0, (Q - (k - 1)) / C) w_re = 1 / (var + tau2) pooled_re = np.sum(w_re * log_rr) / np.sum(w_re) se_re = np.sqrt(1 / np.sum(w_re)) I2 = max(0, (Q - (k-1)) / Q) * 100 if Q > 0 else 0 return pooled_re, se_re, tau2, Q, I2, w_re # Overall random-effects analysis pooled_re, se_re, tau2, Q, I2, w_re = dl_random_effects( af_trials['log_rr'].values, af_trials['var_log_rr'].values ) # --- Part (a): Subgroup analysis by region --- print("=" * 60) print("Part (a): Subgroup Analysis by Region") print("=" * 60) regions = af_trials['region'].unique() subgroup_results = [] for region in regions: mask = af_trials['region'] == region sub = af_trials[mask] if len(sub) >= 2: p_re, s_re, t2, q, i2, w = dl_random_effects( sub['log_rr'].values, sub['var_log_rr'].values ) ci = (p_re - 1.96*s_re, p_re + 1.96*s_re) else: # Single study: use study estimate directly p_re = sub['log_rr'].values[0] s_re = sub['se_log_rr'].values[0] t2, q, i2 = 0, 0, 0 ci = (p_re - 1.96*s_re, p_re + 1.96*s_re) subgroup_results.append({ 'region': region, 'k': len(sub), 'pooled_rr': np.exp(p_re), 'ci_lo': np.exp(ci[0]), 'ci_hi': np.exp(ci[1]), 'I2': i2, 'tau2': t2 }) print(f"\n{region} (k={len(sub)}):") print(f" Pooled RR: {np.exp(p_re):.4f} " f"(95% CI: {np.exp(ci[0]):.4f} - {np.exp(ci[1]):.4f})") print(f" I^2: {i2:.1f}%") # Test for subgroup differences using meta-regression with region dummies region_dummies = pd.get_dummies(af_trials['region'], drop_first=True).astype(float) X_sub = sm.add_constant(region_dummies) wls_sub = sm.WLS(af_trials['log_rr'], X_sub, weights=1/af_trials['var_log_rr']).fit() # Wald test for region coefficients from scipy.stats import chi2 f_stat = wls_sub.f_test(np.eye(len(wls_sub.params))[1:]) print(f"\nTest for subgroup differences:") print(f" F-statistic: {f_stat.fvalue[0][0]:.3f}") print(f" p-value: {f_stat.pvalue:.4f}") if f_stat.pvalue < 0.05: print(" Significant difference between regions.") else: print(" No significant difference between regions.") print(f" (Limited power with only {len(af_trials)} studies)") # --- Part (b): Meta-regression with mean age --- print(f"\n{'=' * 60}") print("Part (b): Meta-Regression with Mean Age") print("=" * 60) X_age = sm.add_constant(af_trials['mean_age']) wls_age = sm.WLS(af_trials['log_rr'], X_age, weights=1/af_trials['var_log_rr']).fit() print(f"\nMeta-regression: log_rr ~ mean_age") print(f" Intercept: {wls_age.params.iloc[0]:.4f} (p = {wls_age.pvalues.iloc[0]:.4f})") print(f" Mean age slope: {wls_age.params.iloc[1]:.4f} (p = {wls_age.pvalues.iloc[1]:.4f})") print(f" R-squared: {wls_age.rsquared:.3f}") if wls_age.pvalues.iloc[1] < 0.05: print("\n There IS a significant relationship between mean age and effect size.") if wls_age.params.iloc[1] < 0: print(" Trials with older populations show larger treatment benefits.") else: print(" Trials with older populations show smaller treatment benefits.") else: print(f"\n No significant relationship (p = {wls_age.pvalues.iloc[1]:.3f}).") # --- Part (c): Meta-regression with percentage female --- print(f"\n{'=' * 60}") print("Part (c): Meta-Regression with Percentage Female") print("=" * 60) X_fem = sm.add_constant(af_trials['pct_female']) wls_fem = sm.WLS(af_trials['log_rr'], X_fem, weights=1/af_trials['var_log_rr']).fit() print(f"\nMeta-regression: log_rr ~ pct_female") print(f" Intercept: {wls_fem.params.iloc[0]:.4f} (p = {wls_fem.pvalues.iloc[0]:.4f})") print(f" Pct female slope: {wls_fem.params.iloc[1]:.4f} (p = {wls_fem.pvalues.iloc[1]:.4f})") print(f" R-squared: {wls_fem.rsquared:.3f}") print("\n *** ECOLOGICAL FALLACY WARNING ***") print(" This analysis examines TRIAL-LEVEL associations between percentage") print(" female and treatment effect. Even if significant, this does NOT") print(" prove that individual women respond differently to treatment.") print(" Trials with more women may differ in other characteristics:") print(" - Geographic region and healthcare systems") print(" - Mean age and comorbidity profiles") print(" - Trial methodology and inclusion criteria") print(" Only individual participant data (IPD) meta-analysis with a") print(" treatment-by-sex interaction can properly test effect modification.") # --- Part (d): Bubble plot --- print(f"\n{'=' * 60}") print("Part (d): Bubble Plot") print("=" * 60) fig, ax = plt.subplots(figsize=(10, 7)) # Study weights for bubble sizes w_plot = 1 / af_trials['var_log_rr'] sizes = (w_plot / w_plot.max()) * 500 # Scatter: bubbles ax.scatter(af_trials['mean_age'], af_trials['log_rr'], s=sizes, alpha=0.6, c='steelblue', edgecolors='darkblue', linewidth=1, zorder=5) # Add study labels for _, row in af_trials.iterrows(): ax.annotate(row['study'], (row['mean_age'], row['log_rr']), textcoords="offset points", xytext=(8, 8), fontsize=8, color='grey30') # Meta-regression line age_range = np.linspace(af_trials['mean_age'].min() - 2, af_trials['mean_age'].max() + 2, 100) pred = wls_age.params.iloc[0] + wls_age.params.iloc[1] * age_range ax.plot(age_range, pred, 'steelblue', linewidth=2, label=f'Meta-regression line (slope={wls_age.params.iloc[1]:.4f})') # Confidence band (approximate) X_pred = sm.add_constant(age_range) pred_full = wls_age.get_prediction(X_pred) pred_ci = pred_full.conf_int(alpha=0.05) ax.fill_between(age_range, pred_ci[:, 0], pred_ci[:, 1], alpha=0.15, color='steelblue', label='95% CI') # Reference line ax.axhline(y=0, color='grey', linestyle='--', linewidth=0.8, label='Null (log RR = 0)') ax.set_xlabel('Mean Age (years)', fontweight='bold', fontsize=12) ax.set_ylabel('Log Risk Ratio', fontweight='bold', fontsize=12) ax.set_title('Bubble Plot: Meta-Regression of Treatment Effect on Mean Age\n' f'(Slope p = {wls_age.pvalues.iloc[1]:.3f}; ' f'bubble size = study weight)', fontweight='bold', fontsize=13) ax.legend(loc='best', frameon=True, framealpha=0.9) plt.tight_layout() plt.savefig('bubble_plot_age.png', dpi=300) plt.show() print("\nBubble plot saved as 'bubble_plot_age.png'") print("Larger bubbles represent more precise (higher-weight) studies.") ``` ::: ### Exercise 4 ::: {.panel-tabset} #### R ```{r} #| label: sol-ch18-ex4-r #| eval: false #| code-fold: true #| code-summary: "Show R solution" # ============================================================================= # Chapter 18 - Exercise 4: Critical Appraisal of a Meta-Analysis (Conceptual) # 12 trials of new surgical technique vs standard care for knee OA # ============================================================================= # This is a conceptual exercise. Answers are provided as detailed comments. # Given information: # - Pooled SMD for pain: -0.62 (95% CI: -0.89 to -0.35), p < 0.001 # - I^2 = 78%, tau^2 = 0.15, Q test p < 0.001 # - Prediction interval: -1.42 to 0.18 # - Egger's test: p = 0.03 # - 8 of 12 trials were single-centre with fewer than 100 participants # --- Part (a): Interpret the pooled effect and clinical significance --- # # The pooled SMD of -0.62 favours the new surgical technique (negative = # lower pain scores = better outcome). # # Using Cohen's conventions for interpreting SMD: # - Small effect: |d| = 0.2 # - Medium effect: |d| = 0.5 # - Large effect: |d| = 0.8 # # An SMD of -0.62 is a MEDIUM-TO-LARGE effect size. The 95% CI (-0.89 to # -0.35) excludes zero, and the p-value is < 0.001, indicating statistical # significance. # # HOWEVER, statistical significance does not imply clinical significance. # To assess clinical importance, we should consider: # 1. The MINIMUM CLINICALLY IMPORTANT DIFFERENCE (MCID) for the pain # outcome. For common knee OA pain scales, MCID is typically SMD ~0.3-0.5. # The point estimate exceeds this, but the CI includes values near 0.35. # 2. The ABSOLUTE DIFFERENCE on the original scale would be more # interpretable (e.g., "5 points on a 0-100 VAS scale"). # 3. Whether the pain reduction justifies the RISKS AND COSTS of a new # surgical procedure (risk-benefit analysis). # --- Part (b): Prediction interval vs confidence interval --- # # The CONFIDENCE INTERVAL (-0.89 to -0.35) tells us about the AVERAGE # true effect across all settings. It says: "We are 95% confident that # the mean true effect lies between -0.89 and -0.35." # # The PREDICTION INTERVAL (-1.42 to 0.18) tells us where the true effect # of a FUTURE STUDY (in a new setting) is likely to fall. It incorporates # BOTH sampling uncertainty AND between-study heterogeneity. # # Key insight: The prediction interval CROSSES ZERO (includes 0.18). # This means: # - While the average effect is beneficial, in some settings the new # technique might provide NO BENEFIT or even be slightly harmful. # - A new centre implementing this technique cannot be confident of # seeing a benefit. # - The prediction interval is the more honest representation of # uncertainty for clinical decision-making. # # This discrepancy (significant CI but prediction interval crossing null) # is common when I^2 is high and highlights why reporting only the pooled # CI can be misleading. # --- Part (c): Implications of I^2 = 78% --- # # I^2 = 78% means that 78% of the observed variability in effect sizes # is due to TRUE BETWEEN-STUDY HETEROGENEITY rather than chance. # This is classified as CONSIDERABLE heterogeneity (>75%). # # Implications: # 1. The assumption of a single common effect is clearly violated. # 2. The RANDOM-EFFECTS model is appropriate (and was used), but the # pooled estimate should be interpreted as an AVERAGE across heterogeneous # true effects, not as a single definitive answer. # 3. The heterogeneity is statistically significant (Q test p < 0.001), # confirming this is not due to sampling variation. # 4. INVESTIGATING SOURCES of heterogeneity is essential: # - Subgroup analysis by study characteristics (e.g., surgical # technique variation, patient severity, follow-up duration) # - Meta-regression to model effect modifiers # - Sensitivity analysis excluding outlier or high-risk-of-bias studies # 5. The random-effects model gives MORE WEIGHT TO SMALL STUDIES, # which in this case (8 of 12 are small) means the pooled estimate # is heavily influenced by potentially biased small studies. # --- Part (d): Concerns about Egger's test and small trials --- # # Egger's test is significant (p = 0.03), suggesting FUNNEL PLOT ASYMMETRY. # Combined with the fact that 8 of 12 trials are small single-centre # studies, this raises several concerns: # # 1. PUBLICATION BIAS: Small trials showing no benefit may not have been # published. The available evidence may overestimate the true effect. # This is the most common interpretation of funnel plot asymmetry. # # 2. SMALL-STUDY EFFECTS: Small studies may have different effect sizes # for legitimate reasons: # - More selected patient populations (more severe cases) # - More enthusiastic, expert surgeons (performance bias) # - Less rigorous outcome assessment (detection bias) # - Higher risk of bias in general # # 3. SINGLE-CENTRE BIAS: Single-centre trials often report larger effects # than multicentre trials because of: # - Surgeon expertise (learning curve effects) # - Patient selection # - Lack of external validity # - Potential unblinding or outcome assessment bias # # 4. The combination of significant Egger's test + predominantly small # single-centre trials is a RED FLAG. The pooled effect may be # substantially overestimated. # --- Part (e): Additional analyses wanted --- # # 1. TRIM-AND-FILL ANALYSIS: To estimate the adjusted pooled effect after # accounting for potentially missing studies. # # 2. RISK OF BIAS ASSESSMENT: Using Cochrane Risk of Bias tool (RoB 2) for # each trial. Present a risk of bias summary plot. # # 3. SENSITIVITY ANALYSIS RESTRICTED TO LARGER TRIALS: Re-run the meta- # analysis using only the 4 larger/multicentre trials to see if the # effect persists. # # 4. SENSITIVITY ANALYSIS BY RISK OF BIAS: Exclude high-risk-of-bias # studies and re-estimate. # # 5. LEAVE-ONE-OUT ANALYSIS: Check if any single study is driving the # result. # # 6. SUBGROUP ANALYSIS by: # - Study size (small vs large) # - Single-centre vs multicentre # - Blinding status # - Follow-up duration # - OA severity at baseline # # 7. META-REGRESSION: Examine whether study-level characteristics # (sample size, year, risk of bias score) moderate the effect. # # 8. FUNCTIONAL OUTCOMES: Pain is subjective and susceptible to placebo # effects (especially in surgical trials). Objective outcomes (e.g., # range of motion, need for total knee replacement) would be more # convincing. # # 9. SHAM SURGERY COMPARISON: Were any trials sham-controlled? Without # sham surgery, the "placebo effect of surgery" cannot be distinguished # from a true treatment effect. # --- Part (f): Would you change clinical practice? --- # # NO, I would NOT change clinical practice based on this meta-analysis # alone. The reasons are: # # 1. HIGH HETEROGENEITY (I^2 = 78%): The treatment effect is highly # variable across settings. Some settings may see no benefit. # # 2. PREDICTION INTERVAL CROSSES NULL: A new centre cannot be confident # of achieving benefit. # # 3. EVIDENCE OF PUBLICATION BIAS: The significant Egger's test and # predominance of small positive studies suggest the true effect may # be substantially smaller than the pooled estimate. # # 4. MOSTLY SMALL SINGLE-CENTRE TRIALS: The evidence base lacks large, # multicentre, adequately blinded trials that demonstrate external # validity and generalisability. # # 5. SURGICAL INTERVENTION: Given the invasiveness, costs, and risks # of surgery, a higher evidence bar is appropriate. One would want # at least one large, multicentre, preferably sham-controlled RCT # showing clinically meaningful benefit. # # 6. NO SHAM CONTROL: Surgical procedures are known to have strong # placebo effects (cf. Moseley et al., 2002 on arthroscopic surgery # for knee OA). Without sham-controlled trials, the observed benefit # could be largely or entirely due to placebo. # # RECOMMENDED NEXT STEPS: # - Commission a large, multicentre, sham-controlled RCT # - Include objective outcomes alongside patient-reported pain # - Longer follow-up (surgical effects may wane over time) # - Standardise the surgical technique across centres # - Register the protocol prospectively cat("Exercise 4 is a conceptual exercise.\n") cat("All answers are provided as detailed comments in this script.\n") cat("Review the comments for:\n") cat(" (a) Interpretation of pooled SMD and clinical significance\n") cat(" (b) Prediction interval vs confidence interval\n") cat(" (c) Implications of I^2 = 78%\n") cat(" (d) Concerns about Egger's test and small trials\n") cat(" (e) Additional analyses wanted\n") cat(" (f) Whether to change clinical practice\n") ``` #### Python ```{python} #| label: sol-ch18-ex4-py #| eval: false #| code-fold: true #| code-summary: "Show Python solution" # ============================================================================= # Chapter 18 - Exercise 4: Critical Appraisal of a Meta-Analysis (Conceptual) # 12 trials of new surgical technique vs standard care for knee OA # ============================================================================= # This is a conceptual exercise. Answers are provided as detailed comments. # Given information: # - Pooled SMD for pain: -0.62 (95% CI: -0.89 to -0.35), p < 0.001 # - I^2 = 78%, tau^2 = 0.15, Q test p < 0.001 # - Prediction interval: -1.42 to 0.18 # - Egger's test: p = 0.03 # - 8 of 12 trials were single-centre with fewer than 100 participants # --- Part (a): Interpret the pooled effect and clinical significance --- # # The pooled SMD of -0.62 favours the new surgical technique (lower pain). # By Cohen's conventions: |d|=0.2 is small, 0.5 is medium, 0.8 is large. # So -0.62 represents a MEDIUM-TO-LARGE effect. # # The 95% CI (-0.89 to -0.35) excludes zero, and p < 0.001. # # HOWEVER, clinical significance requires considering: # 1. The Minimum Clinically Important Difference (MCID) -- typically # SMD ~0.3-0.5 for knee OA pain scales. The CI includes values # near MCID threshold. # 2. The absolute difference on the original pain scale. # 3. Risk-benefit ratio for a surgical procedure. # 4. Patient values and preferences. # --- Part (b): Prediction interval vs confidence interval --- # # CI (-0.89 to -0.35): Describes uncertainty about the AVERAGE effect. # PI (-1.42 to 0.18): Describes where the NEXT STUDY'S true effect # is likely to fall. # # Critical difference: The PI CROSSES ZERO (upper bound = 0.18). # This means in some settings, the technique might provide NO BENEFIT. # A new centre implementing this cannot be confident of benefit. # The PI is more relevant for individual clinical decisions. # --- Part (c): Implications of I^2 = 78% --- # # I^2 = 78% = CONSIDERABLE heterogeneity (>75% threshold). # 78% of variability is due to true between-study differences. # # Implications: # 1. No single common effect exists -- the pooled estimate is an average. # 2. Sources of heterogeneity must be investigated (subgroup analysis, # meta-regression, sensitivity analysis). # 3. Random-effects model gives more weight to small studies, which # here (8/12 small) means heavy influence from potentially biased work. # 4. Reporting the pooled estimate without the PI is misleading. # --- Part (d): Concerns about Egger's test + small trials --- # # Egger's test p = 0.03 indicates significant funnel plot asymmetry. # Combined with 8/12 small single-centre trials, concerns include: # # 1. PUBLICATION BIAS: Negative small trials may not have been published. # 2. SMALL-STUDY EFFECTS: Small trials often show larger effects due to # selected populations, enthusiastic surgeons, or less rigorous methods. # 3. SINGLE-CENTRE BIAS: Lack of external validity, expertise effects. # 4. The combination is a RED FLAG: the pooled effect is likely # OVERESTIMATED. # --- Part (e): Additional analyses wanted --- # # 1. Trim-and-fill to estimate adjusted effect # 2. Risk of bias assessment (Cochrane RoB 2) for each trial # 3. Sensitivity analysis restricted to larger/multicentre trials only # 4. Sensitivity analysis excluding high-risk-of-bias studies # 5. Leave-one-out analysis # 6. Subgroup analyses: by study size, single vs multicentre, blinding # 7. Meta-regression on study-level characteristics # 8. Objective functional outcomes (not just subjective pain) # 9. Check for sham-controlled trials (surgical placebo effect) # --- Part (f): Would you change clinical practice? --- # # NO. Reasons: # 1. High heterogeneity (I^2 = 78%) -- variable effects across settings # 2. Prediction interval crosses null -- some settings may see no benefit # 3. Evidence of publication bias (Egger's p = 0.03) # 4. Evidence base dominated by small single-centre trials # 5. Surgery requires higher evidence bar due to invasiveness/costs/risks # 6. No sham-controlled trials mentioned (surgical placebo effect) # # Recommended: Commission a large multicentre sham-controlled RCT with # objective outcomes and long follow-up before changing practice. # Print summary for verification print("=" * 70) print("Exercise 4: Critical Appraisal of a Published Meta-Analysis") print("=" * 70) print() print("Given:") print(" Pooled SMD: -0.62 (95% CI: -0.89 to -0.35)") print(" I^2 = 78%, tau^2 = 0.15, Q test p < 0.001") print(" Prediction interval: -1.42 to 0.18") print(" Egger's test: p = 0.03") print(" 8/12 trials: small, single-centre (<100 participants)") print() print("(a) POOLED EFFECT: Medium-to-large (SMD = -0.62), statistically") print(" significant but clinical significance depends on MCID and") print(" risk-benefit analysis for a surgical intervention.") print() print("(b) PREDICTION INTERVAL crosses zero (upper bound 0.18), meaning") print(" some settings may see no benefit. The CI alone is misleading.") print() print("(c) I^2 = 78% indicates considerable heterogeneity. The pooled") print(" estimate is an average of very different true effects.") print() print("(d) Egger's p = 0.03 + mostly small trials = RED FLAG for") print(" publication bias and small-study effects. The true effect") print(" is likely overestimated.") print() print("(e) Need: trim-and-fill, risk of bias assessment, sensitivity") print(" analyses (excluding small/high-bias trials), sham control check.") print() print("(f) Do NOT change practice. Evidence is insufficient due to") print(" heterogeneity, publication bias concerns, and reliance on") print(" small single-centre trials. Need large multicentre sham-RCT.") ``` ::: ## Chapter 19: Journal-Ready Analysis ### Exercise 1 ::: {.panel-tabset} #### R ```{r} #| label: sol-ch19-ex1-r #| eval: false #| code-fold: true #| code-summary: "Show R solution" # ============================================================================= # Chapter 19 - Exercise 1: Table 1 in R # Using the lung dataset, create a publication-quality Table 1 # ============================================================================= library(tidyverse) library(gtsummary) library(gt) library(survival) # --- Load data --- data(lung, package = "survival") # Prepare data: recode sex for clarity lung_clean <- lung %>% mutate( sex = factor(sex, levels = c(1, 2), labels = c("Male", "Female")), ph.ecog = factor(ph.ecog, levels = 0:4, labels = c("Fully active", "Restricted", "Ambulatory", "Limited self-care", "Completely disabled")), status = factor(status, levels = c(1, 2), labels = c("Censored", "Dead")) ) # --- Part (a): Create Table 1 stratified by sex --- table1 <- lung_clean %>% select(sex, age, ph.ecog, ph.karno, meal.cal, wt.loss) %>% tbl_summary( by = sex, statistic = list( all_continuous() ~ "{mean} ({sd})", all_categorical() ~ "{n} ({p}%)" ), digits = list( all_continuous() ~ 1, all_categorical() ~ c(0, 1) ), label = list( age ~ "Age, years", ph.ecog ~ "ECOG performance status", ph.karno ~ "Karnofsky performance score", meal.cal ~ "Meal calories, kcal/day", wt.loss ~ "Weight loss, kg (last 6 months)" ), missing_text = "Missing" ) %>% add_overall() %>% add_n() # --- Part (b): Add SMDs instead of p-values --- # gtsummary supports SMD via add_difference() # For Table 1 with SMDs, we use a different approach table1_smd <- lung_clean %>% select(sex, age, ph.ecog, ph.karno, meal.cal, wt.loss) %>% tbl_summary( by = sex, statistic = list( all_continuous() ~ "{mean} ({sd})", all_categorical() ~ "{n} ({p}%)" ), digits = list( all_continuous() ~ 1, all_categorical() ~ c(0, 1) ), label = list( age ~ "Age, years", ph.ecog ~ "ECOG performance status", ph.karno ~ "Karnofsky performance score", meal.cal ~ "Meal calories, kcal/day", wt.loss ~ "Weight loss, kg (last 6 months)" ), missing_text = "Missing" ) %>% add_overall() %>% add_n() %>% add_difference( estimate_fun = list(all_continuous() ~ function(x) style_sigfig(x, digits = 3)), pvalue_fun = ~ style_pvalue(.x, digits = 3) ) %>% modify_header(label = "**Characteristic**") %>% modify_spanning_header(c("stat_1", "stat_2") ~ "**Sex**") %>% bold_labels() # Compute SMDs manually for continuous variables compute_smd <- function(x, group) { g1 <- x[group == "Male"] g2 <- x[group == "Female"] g1 <- g1[!is.na(g1)] g2 <- g2[!is.na(g2)] pooled_sd <- sqrt((var(g1) + var(g2)) / 2) if (pooled_sd == 0) return(0) (mean(g1) - mean(g2)) / pooled_sd } cat("Standardised Mean Differences (Male vs Female):\n") cat(" Age:", round(compute_smd(lung_clean$age, lung_clean$sex), 3), "\n") cat(" Karnofsky:", round(compute_smd(lung_clean$ph.karno, lung_clean$sex), 3), "\n") cat(" Meal calories:", round(compute_smd(lung_clean$meal.cal, lung_clean$sex), 3), "\n") cat(" Weight loss:", round(compute_smd(lung_clean$wt.loss, lung_clean$sex), 3), "\n") # Create the final publication table with gt formatting final_table <- table1_smd %>% as_gt() %>% gt::tab_header( title = "Table 1. Baseline Characteristics of the Study Population", subtitle = "NCCTG Lung Cancer Dataset, Stratified by Sex" ) %>% gt::tab_footnote( footnote = "Values are mean (SD) for continuous variables and n (%) for categorical variables. SMD = standardised mean difference.", locations = gt::cells_title() ) print(final_table) # --- Part (c): Export to Word and LaTeX --- # Export to Word (.docx) # gt::gtsave(final_table, "table1_lung.docx") cat("\nTo export to Word: gt::gtsave(final_table, 'table1_lung.docx')\n") # Export to LaTeX # gt::as_latex(final_table) %>% writeLines("table1_lung.tex") cat("To export to LaTeX: gt::as_latex(final_table)\n") # Export to HTML # gt::gtsave(final_table, "table1_lung.html") cat("To export to HTML: gt::gtsave(final_table, 'table1_lung.html')\n") cat("\nNote: SMDs < 0.1 indicate good balance. In an RCT, p-values for\n") cat("baseline comparisons are inappropriate because any differences are\n") cat("due to chance. In observational studies, SMDs are preferred over\n") cat("p-values because they are independent of sample size.\n") ``` #### Python ```{python} #| label: sol-ch19-ex1-py #| eval: false #| code-fold: true #| code-summary: "Show Python solution" # ============================================================================= # Chapter 19 - Exercise 1: Table 1 in Python # Using the lung dataset, create a publication-quality Table 1 # ============================================================================= import pandas as pd import numpy as np # --- Load data --- # Use lifelines' lung dataset from lifelines.datasets import load_lung lung = load_lung() # Prepare data lung['sex_label'] = lung['sex'].map({1: 'Male', 2: 'Female'}) lung['ph.ecog_label'] = lung['ph.ecog'].map({ 0: 'Fully active', 1: 'Restricted', 2: 'Ambulatory', 3: 'Limited self-care', 4: 'Completely disabled' }) # --- Part (a): Create Table 1 stratified by sex with SMDs --- try: from tableone import TableOne # Define variable types columns = ['age', 'ph.ecog', 'ph.karno', 'meal.cal', 'wt.loss'] categorical = ['ph.ecog'] nonnormal = [] # all continuous reported as mean (SD) # Create Table 1 with SMDs instead of p-values table1 = TableOne( lung, columns=columns, categorical=categorical, groupby='sex_label', nonnormal=nonnormal, pval=False, # No p-values smd=True, # Include SMDs rename={ 'age': 'Age, years', 'ph.ecog': 'ECOG performance status', 'ph.karno': 'Karnofsky performance score', 'meal.cal': 'Meal calories, kcal/day', 'wt.loss': 'Weight loss, kg (last 6 months)' }, missing=True, overall=True ) print("=" * 70) print("Table 1. Baseline Characteristics of the Study Population") print("NCCTG Lung Cancer Dataset, Stratified by Sex") print("=" * 70) print(table1.tabulate(tablefmt="github")) # --- Part (c): Export --- # table1.to_excel('table1_lung.xlsx') # table1.to_latex('table1_lung.tex') # table1.to_html('table1_lung.html') print("\nExport options:") print(" table1.to_excel('table1_lung.xlsx')") print(" table1.to_latex('table1_lung.tex')") print(" table1.to_html('table1_lung.html')") except ImportError: print("tableone package not installed. Building Table 1 manually.\n") # --- Manual Table 1 construction --- def compute_smd_cont(data, var, group_col): """Compute SMD for a continuous variable.""" g1 = data[data[group_col] == 'Male'][var].dropna() g2 = data[data[group_col] == 'Female'][var].dropna() pooled_sd = np.sqrt((g1.var() + g2.var()) / 2) if pooled_sd == 0: return 0 return (g1.mean() - g2.mean()) / pooled_sd def summarise_continuous(data, var, group_col): """Summarise continuous variable by group.""" results = {} for group in ['Overall'] + list(data[group_col].unique()): if group == 'Overall': subset = data[var].dropna() else: subset = data[data[group_col] == group][var].dropna() results[group] = f"{subset.mean():.1f} ({subset.std():.1f})" return results def summarise_categorical(data, var, group_col): """Summarise categorical variable by group.""" results = {} categories = sorted(data[var].dropna().unique()) for group in ['Overall'] + list(data[group_col].unique()): if group == 'Overall': subset = data[var].dropna() else: subset = data[data[group_col] == group][var].dropna() counts = subset.value_counts() total = len(subset) group_results = {} for cat in categories: n = counts.get(cat, 0) pct = n / total * 100 group_results[cat] = f"{n} ({pct:.1f}%)" results[group] = group_results return results # Build table cont_vars = { 'age': 'Age, years', 'ph.karno': 'Karnofsky performance score', 'meal.cal': 'Meal calories, kcal/day', 'wt.loss': 'Weight loss, kg (last 6 months)' } print(f"{'Variable':<40} {'Overall':>18} {'Male':>18} {'Female':>18} {'SMD':>8}") print("-" * 104) for var, label in cont_vars.items(): summ = summarise_continuous(lung, var, 'sex_label') smd = compute_smd_cont(lung, var, 'sex_label') n_miss = lung[var].isna().sum() miss_str = f" [missing: {n_miss}]" if n_miss > 0 else "" print(f"{label + miss_str:<40} {summ['Overall']:>18} " f"{summ.get('Male', 'N/A'):>18} {summ.get('Female', 'N/A'):>18} " f"{abs(smd):>7.3f}") # ECOG as categorical print(f"\n{'ECOG performance status':<40}") ecog_summ = summarise_categorical(lung, 'ph.ecog', 'sex_label') for cat in sorted(lung['ph.ecog'].dropna().unique()): label_map = {0: ' Fully active', 1: ' Restricted', 2: ' Ambulatory', 3: ' Limited self-care', 4: ' Completely disabled'} label = label_map.get(cat, f' {cat}') overall = ecog_summ['Overall'].get(cat, '0 (0.0%)') male = ecog_summ.get('Male', {}).get(cat, '0 (0.0%)') female = ecog_summ.get('Female', {}).get(cat, '0 (0.0%)') print(f"{label:<40} {overall:>18} {male:>18} {female:>18}") # --- Part (b): SMD interpretation --- print("\n" + "=" * 70) print("Standardised Mean Differences (SMD)") print("=" * 70) def compute_smd(data, var, group_col): g1 = data[data[group_col] == 'Male'][var].dropna() g2 = data[data[group_col] == 'Female'][var].dropna() pooled_sd = np.sqrt((g1.var() + g2.var()) / 2) if pooled_sd == 0: return 0 return (g1.mean() - g2.mean()) / pooled_sd for var, label in [('age', 'Age'), ('ph.karno', 'Karnofsky score'), ('meal.cal', 'Meal calories'), ('wt.loss', 'Weight loss')]: smd = compute_smd(lung, var, 'sex_label') balanced = "Balanced" if abs(smd) < 0.1 else "Imbalanced" print(f" {label:<20}: SMD = {smd:>7.3f} [{balanced}]") print("\nNote: SMD < 0.1 indicates good balance between groups.") print("In observational studies, SMDs are preferred over p-values because") print("they are independent of sample size, whereas p-values from large") print("samples can be significant for clinically trivial differences.") ``` ::: ### Exercise 2 ::: {.panel-tabset} #### R ```{r} #| label: sol-ch19-ex2-r #| eval: false #| code-fold: true #| code-summary: "Show R solution" # ============================================================================= # Chapter 19 - Exercise 2: Publication-Quality Multi-Panel Figure in R # Survival analysis of the lung dataset # ============================================================================= library(tidyverse) library(survival) library(survminer) library(broom) library(patchwork) # --- Publication theme and colour-blind palette --- theme_publication <- function(base_size = 11) { theme_minimal(base_size = base_size) %+replace% theme( text = element_text(family = "Arial"), plot.title = element_text(size = base_size + 2, face = "bold", hjust = 0, margin = margin(b = 10)), axis.title = element_text(size = base_size, face = "bold"), axis.text = element_text(size = base_size - 1, colour = "black"), legend.title = element_text(size = base_size, face = "bold"), legend.text = element_text(size = base_size - 1), legend.position = "bottom", panel.grid.minor = element_blank(), panel.grid.major = element_line(colour = "grey90"), strip.text = element_text(size = base_size, face = "bold"), plot.margin = margin(10, 10, 10, 10) ) } cb_palette <- c("#0072B2", "#D55E00", "#009E73", "#CC79A7", "#F0E442", "#56B4E9", "#E69F00", "#000000") # --- Load and prepare data --- data(lung, package = "survival") lung <- lung %>% mutate( sex_label = factor(sex, levels = c(1, 2), labels = c("Male", "Female")), status_event = status - 1 # Convert to 0/1 ) %>% filter(!is.na(ph.ecog) & !is.na(ph.karno) & !is.na(wt.loss)) # ============================================================================= # Panel A: Kaplan-Meier Curve by Sex # ============================================================================= fit_km <- survfit(Surv(time, status_event) ~ sex_label, data = lung) # Using ggsurvplot for KM curve p_km <- ggsurvplot( fit_km, data = lung, palette = cb_palette[1:2], risk.table = FALSE, pval = TRUE, pval.coord = c(700, 0.9), conf.int = TRUE, conf.int.alpha = 0.15, xlab = "Time (days)", ylab = "Survival probability", legend.labs = c("Male", "Female"), legend.title = "Sex", ggtheme = theme_publication(), title = "A" ) panel_a <- p_km$plot + theme(legend.position = c(0.8, 0.85), legend.background = element_rect(fill = "white", colour = "grey80"), plot.title = element_text(size = 14, face = "bold")) # ============================================================================= # Panel B: Forest Plot of Hazard Ratios from Multivariable Cox Model # ============================================================================= cox_model <- coxph(Surv(time, status_event) ~ age + sex_label + ph.ecog + ph.karno + wt.loss, data = lung) # Extract coefficients cox_tidy <- tidy(cox_model, conf.int = TRUE, exponentiate = TRUE) cox_tidy <- cox_tidy %>% mutate( label = case_when( term == "age" ~ "Age (per year)", term == "sex_labelFemale" ~ "Female vs Male", term == "ph.ecog" ~ "ECOG PS (per level)", term == "ph.karno" ~ "Karnofsky (per 10 pts)", term == "wt.loss" ~ "Weight loss (per kg)" ), # Rescale Karnofsky to per-10-point increase for interpretability estimate = ifelse(term == "ph.karno", estimate^10, estimate), conf.low = ifelse(term == "ph.karno", conf.low^10, conf.low), conf.high = ifelse(term == "ph.karno", conf.high^10, conf.high) ) panel_b <- ggplot(cox_tidy, aes(x = estimate, y = reorder(label, estimate))) + geom_vline(xintercept = 1, linetype = "dashed", colour = "grey50") + geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.2, colour = cb_palette[1], linewidth = 0.8) + geom_point(size = 3, colour = cb_palette[1]) + geom_text(aes(label = sprintf("%.2f (%.2f-%.2f)", estimate, conf.low, conf.high)), hjust = -0.1, size = 3, nudge_y = 0.25) + scale_x_log10() + labs(x = "Hazard Ratio (95% CI)", y = "", title = "B") + theme_publication() + theme( legend.position = "none", plot.title = element_text(size = 14, face = "bold") ) # ============================================================================= # Panel C: Calibration Plot for 1-Year Survival Prediction # ============================================================================= # Predict 1-year (365 day) survival probabilities # Use Cox model baseline hazard surv_pred <- summary(survfit(cox_model), times = 365) # For calibration, split into deciles of predicted risk lung$pred_1yr <- 1 - predict(cox_model, type = "expected", newdata = lung) / max(lung$time) * 365 # Alternative: use survfit-based prediction baseline_surv <- basehaz(cox_model, centered = FALSE) # Get linear predictor lung$lp <- predict(cox_model, type = "lp") # Simple calibration: bin by predicted risk deciles lung$pred_surv <- exp(-predict(cox_model, type = "expected")) # Create risk groups based on linear predictor lung$risk_group <- cut(lung$lp, breaks = quantile(lung$lp, probs = seq(0, 1, 0.1)), include.lowest = TRUE, labels = FALSE) # Calculate observed 1-year survival in each group cal_data <- lung %>% group_by(risk_group) %>% summarise( n = n(), events = sum(status_event), # KM estimate at 1 year obs_surv = { fit <- survfit(Surv(time, status_event) ~ 1, data = cur_data()) s <- summary(fit, times = 365) if (length(s$surv) > 0) s$surv else NA }, mean_lp = mean(lp), .groups = "drop" ) %>% filter(!is.na(obs_surv)) # Get predicted survival at each mean LP h0_365 <- approx(baseline_surv$time, baseline_surv$hazard, xout = 365)$y if (is.na(h0_365)) h0_365 <- max(baseline_surv$hazard) cal_data$pred_surv <- exp(-h0_365 * exp(cal_data$mean_lp)) panel_c <- ggplot(cal_data, aes(x = 1 - pred_surv, y = 1 - obs_surv)) + geom_abline(intercept = 0, slope = 1, linetype = "dashed", colour = "grey50") + geom_point(size = 3, colour = cb_palette[1]) + geom_smooth(method = "loess", se = TRUE, colour = cb_palette[2], fill = cb_palette[2], alpha = 0.2) + coord_equal(xlim = c(0, 1), ylim = c(0, 1)) + labs(x = "Predicted 1-year mortality risk", y = "Observed 1-year mortality", title = "C") + theme_publication() + theme(plot.title = element_text(size = 14, face = "bold")) # ============================================================================= # Combine panels using patchwork # ============================================================================= # Panel A takes the top row, B and C share the bottom row combined <- panel_a / (panel_b | panel_c) + plot_layout(heights = c(1, 1)) # Save at publication quality # Double-column layout: 183 mm wide ggsave("figure_survival_analysis.tiff", plot = combined, width = 183, height = 200, units = "mm", dpi = 300, compression = "lzw") ggsave("figure_survival_analysis.pdf", plot = combined, width = 183, height = 200, units = "mm", device = cairo_pdf) print(combined) cat("\nMulti-panel figure saved as:\n") cat(" figure_survival_analysis.tiff (300 DPI, TIFF)\n") cat(" figure_survival_analysis.pdf (vector PDF)\n") cat("\nSpecifications:\n") cat(" Width: 183 mm (double-column)\n") cat(" Resolution: 300 DPI\n") cat(" Colour palette: colour-blind accessible\n") cat(" Font: Arial\n") ``` #### Python ```{python} #| label: sol-ch19-ex2-py #| eval: false #| code-fold: true #| code-summary: "Show Python solution" # ============================================================================= # Chapter 19 - Exercise 2: Publication-Quality Multi-Panel Figure in Python # Survival analysis of the lung dataset # ============================================================================= import numpy as np import pandas as pd import matplotlib.pyplot as plt import matplotlib.gridspec as gridspec from lifelines import KaplanMeierFitter, CoxPHFitter from lifelines.datasets import load_lung from lifelines.statistics import logrank_test # --- Publication settings --- plt.rcParams.update({ 'font.family': 'Arial', 'font.size': 10, 'axes.labelsize': 11, 'axes.titlesize': 12, 'axes.labelweight': 'bold', 'figure.dpi': 300, 'savefig.dpi': 300, 'axes.spines.top': False, 'axes.spines.right': False, }) # Colour-blind friendly palette cb_palette = ['#0072B2', '#D55E00', '#009E73', '#CC79A7'] # --- Load and prepare data --- lung = load_lung() lung['status'] = lung['status'] - 1 # Convert to 0/1 lung = lung.dropna(subset=['ph.ecog', 'ph.karno', 'wt.loss', 'age', 'sex']) # ============================================================================= # Create the multi-panel figure # ============================================================================= fig = plt.figure(figsize=(7.2, 9.0)) # ~183mm x ~230mm gs = gridspec.GridSpec(2, 2, height_ratios=[1, 1], hspace=0.35, wspace=0.35) # ============================================================================= # Panel A: Kaplan-Meier Curve by Sex # ============================================================================= ax_a = fig.add_subplot(gs[0, :]) # Top row, full width kmf = KaplanMeierFitter() for sex_val, label, color in [(1, 'Male', cb_palette[0]), (2, 'Female', cb_palette[1])]: mask = lung['sex'] == sex_val kmf.fit(lung.loc[mask, 'T'], lung.loc[mask, 'status'], label=label) kmf.plot_survival_function(ax=ax_a, color=color, ci_show=True, ci_alpha=0.15) # Log-rank test lr = logrank_test( lung.loc[lung['sex'] == 1, 'T'], lung.loc[lung['sex'] == 2, 'T'], lung.loc[lung['sex'] == 1, 'status'], lung.loc[lung['sex'] == 2, 'status'] ) ax_a.text(0.7, 0.85, f'Log-rank p = {lr.p_value:.3f}', transform=ax_a.transAxes, fontsize=10, bbox=dict(boxstyle='round', facecolor='white', edgecolor='grey')) ax_a.set_xlabel('Time (days)') ax_a.set_ylabel('Survival Probability') ax_a.set_title('A', loc='left', fontweight='bold', fontsize=14) ax_a.set_ylim(0, 1.05) ax_a.legend(loc='lower left', frameon=True, framealpha=0.9) # Number at risk n_at_risk_times = [0, 200, 400, 600, 800, 1000] risk_text = "At risk:\n" for sex_val, label in [(1, 'Male'), (2, 'Female')]: mask = lung['sex'] == sex_val times = lung.loc[mask, 'T'] events = lung.loc[mask, 'status'] counts = [] for t in n_at_risk_times: counts.append(str(sum(times >= t))) risk_text += f" {label}: " + " ".join(counts) + "\n" # ============================================================================= # Panel B: Forest Plot of Hazard Ratios # ============================================================================= ax_b = fig.add_subplot(gs[1, 0]) # Fit Cox model cph = CoxPHFitter() cox_data = lung[['T', 'status', 'age', 'sex', 'ph.ecog', 'ph.karno', 'wt.loss']].copy() cph.fit(cox_data, duration_col='T', event_col='status') # Extract results summary_df = cph.summary[['exp(coef)', 'exp(coef) lower 95%', 'exp(coef) upper 95%']].copy() summary_df.columns = ['HR', 'CI_lo', 'CI_hi'] # Rename for display name_map = { 'age': 'Age (per year)', 'sex': 'Sex (Female vs Male)', 'ph.ecog': 'ECOG PS (per level)', 'ph.karno': 'Karnofsky (per point)', 'wt.loss': 'Weight loss (per kg)' } summary_df.index = summary_df.index.map(lambda x: name_map.get(x, x)) # Plot forest y_pos = range(len(summary_df) - 1, -1, -1) labels = list(summary_df.index) ax_b.axvline(x=1, color='grey', linestyle='--', linewidth=0.8) for i, (label, row) in enumerate(summary_df.iterrows()): y = list(y_pos)[i] ax_b.errorbarh(y, row['HR'], xerr=[[row['HR'] - row['CI_lo']], [row['CI_hi'] - row['HR']]], fmt='o', color=cb_palette[0], capsize=4, markersize=6, linewidth=1.5) # Annotate with HR (CI) ax_b.text(max(summary_df['CI_hi']) * 1.15, y, f"{row['HR']:.2f} ({row['CI_lo']:.2f}-{row['CI_hi']:.2f})", va='center', fontsize=7.5) ax_b.set_yticks(list(y_pos)) ax_b.set_yticklabels(labels, fontsize=9) ax_b.set_xlabel('Hazard Ratio (95% CI)') ax_b.set_xscale('log') ax_b.set_title('B', loc='left', fontweight='bold', fontsize=14) # ============================================================================= # Panel C: Calibration Plot for 1-Year Survival # ============================================================================= ax_c = fig.add_subplot(gs[1, 1]) # Predict survival at 1 year (365 days) using the Cox model lung_pred = cox_data.copy() lung_pred['lp'] = cph.predict_partial_hazard(lung_pred) # Create decile groups based on linear predictor lung_pred['risk_decile'] = pd.qcut(lung_pred['lp'], q=10, labels=False, duplicates='drop') # Predicted and observed 1-year mortality for each decile cal_results = [] for decile in sorted(lung_pred['risk_decile'].unique()): subset = lung_pred[lung_pred['risk_decile'] == decile] # Predicted: use Cox model survival function predicted_surv = cph.predict_survival_function(subset, times=[365]) pred_mort = 1 - predicted_surv.mean(axis=1).values[0] # Observed: Kaplan-Meier estimate at 365 days kmf_cal = KaplanMeierFitter() kmf_cal.fit(subset['T'], subset['status']) try: obs_surv = kmf_cal.predict(365) obs_mort = 1 - obs_surv except Exception: obs_mort = np.nan cal_results.append({ 'decile': decile, 'predicted': pred_mort, 'observed': obs_mort, 'n': len(subset) }) cal_df = pd.DataFrame(cal_results).dropna() # Plot calibration ax_c.plot([0, 1], [0, 1], 'k--', alpha=0.5, linewidth=0.8, label='Perfect calibration') ax_c.scatter(cal_df['predicted'], cal_df['observed'], s=cal_df['n'] * 2, c=cb_palette[0], alpha=0.7, edgecolors='darkblue', linewidth=0.5) # Fit LOESS-like smoothing from numpy.polynomial import polynomial as P if len(cal_df) > 3: z = np.polyfit(cal_df['predicted'], cal_df['observed'], 2) p = np.poly1d(z) x_smooth = np.linspace(cal_df['predicted'].min(), cal_df['predicted'].max(), 100) ax_c.plot(x_smooth, p(x_smooth), color=cb_palette[1], linewidth=1.5, label='Observed (smoothed)') ax_c.set_xlabel('Predicted 1-year mortality') ax_c.set_ylabel('Observed 1-year mortality') ax_c.set_title('C', loc='left', fontweight='bold', fontsize=14) ax_c.set_xlim(0, 1) ax_c.set_ylim(0, 1) ax_c.set_aspect('equal') ax_c.legend(loc='lower right', fontsize=8, frameon=True) # ============================================================================= # Save figure # ============================================================================= plt.tight_layout() plt.savefig('figure_survival_analysis.tiff', dpi=300, bbox_inches='tight') plt.savefig('figure_survival_analysis.pdf', bbox_inches='tight') plt.show() print("Multi-panel figure saved as:") print(" figure_survival_analysis.tiff (300 DPI, raster)") print(" figure_survival_analysis.pdf (vector)") print() print("Specifications:") print(" Width: ~183 mm (double-column)") print(" Resolution: 300 DPI") print(" Colour palette: colour-blind accessible") print(" Font: Arial") print() print("Panel A: Kaplan-Meier survival curves by sex with log-rank p-value") print("Panel B: Forest plot of hazard ratios from multivariable Cox model") print("Panel C: Calibration plot for 1-year mortality prediction") ``` ::: ### Exercise 3 ::: {.panel-tabset} #### R ```{r} #| label: sol-ch19-ex3-r #| eval: false #| code-fold: true #| code-summary: "Show R solution" # ============================================================================= # Chapter 19 - Exercise 3: Write a Statistical Methods Section # Retrospective cohort study: SGLT2i vs DPP4i and MACE in T2DM # ============================================================================= # This is a conceptual exercise. The statistical methods section is provided # as a multi-line character string that could be included in a manuscript. methods_section <- ' STATISTICAL METHODS Study Design and Population This was a retrospective cohort study of 5,000 adults with type 2 diabetes mellitus identified from the hospital electronic health records (EHR) database between 1 January 2015 and 31 December 2023. Patients were eligible if they had a new prescription for either an SGLT2 inhibitor or a DPP-4 inhibitor, with no prior use of the comparator drug class. Patients with a history of major adverse cardiovascular events (MACE) prior to the index date were excluded. The index date (time zero) was defined as the date of first prescription of the study drug, consistent with a new-user, active comparator design to minimise immortal time bias and confounding by indication. Primary and Secondary Outcomes The primary outcome was time to first MACE, defined as a composite of myocardial infarction (ICD-10: I21), ischaemic stroke (ICD-10: I63), or cardiovascular death (underlying cause of death codes I00-I99). Patients were followed from the index date until the first MACE event, death from non- cardiovascular causes, loss to follow-up, end of the study period (31 December 2023), or 5 years after the index date, whichever occurred first. Sample Size With 5,000 patients and an anticipated event rate of 8% over 5 years in the DPP-4 inhibitor group, the study had approximately 80% power to detect a hazard ratio of 0.70 or smaller at a two-sided alpha of 0.05, assuming a 1:1 treatment group ratio and accounting for 10% loss to follow-up. Descriptive Statistics Baseline characteristics were summarised as means (SD) for normally distributed continuous variables, medians (IQR) for skewed continuous variables, and frequencies (percentages) for categorical variables. Standardised mean differences (SMDs) were used to compare baseline characteristics between treatment groups, with an absolute SMD < 0.1 indicating adequate balance; p-values were not used for baseline comparisons in accordance with current recommendations. Propensity Score Estimation and Matching The propensity score -- the probability of receiving an SGLT2 inhibitor versus a DPP-4 inhibitor -- was estimated using multivariable logistic regression. Covariates included age, sex, body mass index (BMI), glycated haemoglobin (HbA1c), estimated glomerular filtration rate (eGFR), history of cardiovascular disease, hypertension, and smoking status. These covariates were selected a priori based on clinical knowledge and a directed acyclic graph (DAG) encoding assumed causal relationships. Propensity scores were used for 1:1 nearest-neighbour matching without replacement, using a caliper of 0.2 standard deviations of the logit of the propensity score. Covariate balance after matching was assessed using SMDs, with all covariates required to achieve an absolute SMD < 0.1. Propensity score overlap was assessed visually using density plots. Primary Analysis The primary analysis estimated the average treatment effect on the treated (ATT) using a Cox proportional hazards regression model fitted to the matched cohort, with SGLT2 inhibitor use as the sole covariate. The proportional hazards assumption was tested using scaled Schoenfeld residuals and log-log survival plots. If the assumption was violated, a time-varying coefficient or restricted mean survival time (RMST) analysis was planned as an alternative. Hazard ratios (HRs) with 95% confidence intervals (CIs) were reported. Missing Data Missing covariate data ranged from 2% (age) to 15% (BMI). Missingness was assumed to be missing at random (MAR) conditional on observed variables. Multiple imputation by chained equations (MICE) was performed with 50 imputed datasets and 20 iterations per dataset. The imputation model included all analysis variables (covariates, exposure, outcome indicator, and the Nelson-Aalen cumulative hazard estimate) to ensure compatibility with the substantive analysis model. Propensity score estimation and matching were performed within each imputed dataset, and results were pooled using Rubin rules. A complete-case analysis was performed as a sensitivity analysis. Sensitivity Analyses Five pre-specified sensitivity analyses were conducted: (1) complete-case analysis; (2) inverse probability of treatment weighting (IPTW) with stabilised weights as an alternative to matching; (3) inclusion of additional covariates (income quintile, number of medications) in the propensity score model; (4) restriction to patients with at least 12 months of follow-up; and (5) E-value calculation to quantify the minimum strength of association an unmeasured confounder would need with both the treatment and the outcome to explain away the observed association. Multiple Comparisons As this study had a single pre-specified primary outcome, no adjustment for multiple comparisons was applied to the primary analysis. Secondary outcomes were interpreted with appropriate caution as hypothesis-generating. Software All analyses were conducted using R version 4.4.1 (R Foundation for Statistical Computing, Vienna, Austria) with the following packages: MatchIt (v4.5.5) for propensity score matching, cobalt (v4.5.1) for balance assessment, survival (v3.7-0) for Cox regression, mice (v3.16.0) for multiple imputation, and survey (v4.4-2) for IPTW analyses. The random seed was set to 42 for reproducibility. Analysis code is available at [repository URL]. Two-sided p-values < 0.05 were considered statistically significant. ' # Print the methods section cat(methods_section) # --- Checklist verification --- cat("\n\n=== CHECKLIST: Key Items Addressed ===\n") cat("1. Study design and population: YES\n") cat("2. Primary and secondary outcomes: YES\n") cat("3. Sample size / power calculation: YES\n") cat("4. Descriptive statistics approach: YES\n") cat("5. Primary analysis model: YES\n") cat("6. Assumptions and how checked: YES (PH assumption)\n") cat("7. Missing data (extent & handling): YES (MICE, 50 datasets)\n") cat("8. Sensitivity analyses: YES (5 pre-specified)\n") cat("9. Multiple comparisons: YES (single primary)\n") cat("10. Software and versions: YES\n") cat("11. Active comparator, new-user design: YES\n") cat("12. DAG for covariate selection: YES\n") cat("13. Balance assessment (SMDs): YES\n") cat("14. E-value for unmeasured confounding: YES\n") ``` #### Python ```{python} #| label: sol-ch19-ex3-py #| eval: false #| code-fold: true #| code-summary: "Show Python solution" # ============================================================================= # Chapter 19 - Exercise 3: Write a Statistical Methods Section # Retrospective cohort study: SGLT2i vs DPP4i and MACE in T2DM # ============================================================================= # This is a conceptual exercise. The statistical methods section is provided # as a detailed text output. methods_section = """ STATISTICAL METHODS Study Design and Population This was a retrospective cohort study of 5,000 adults with type 2 diabetes mellitus identified from the hospital electronic health records (EHR) database between 1 January 2015 and 31 December 2023. Patients were eligible if they had a new prescription for either an SGLT2 inhibitor or a DPP-4 inhibitor, with no prior use of the comparator drug class. Patients with a history of major adverse cardiovascular events (MACE) prior to the index date were excluded. The index date (time zero) was defined as the date of first prescription of the study drug, consistent with a new-user, active comparator design to minimise immortal time bias and confounding by indication. Primary and Secondary Outcomes The primary outcome was time to first MACE, defined as a composite of myocardial infarction (ICD-10: I21), ischaemic stroke (ICD-10: I63), or cardiovascular death (underlying cause of death codes I00-I99). Patients were followed from the index date until the first MACE event, death from non- cardiovascular causes, loss to follow-up, end of the study period (31 December 2023), or 5 years after the index date, whichever occurred first. Sample Size With 5,000 patients and an anticipated event rate of 8% over 5 years in the DPP-4 inhibitor group, the study had approximately 80% power to detect a hazard ratio of 0.70 or smaller at a two-sided alpha of 0.05, assuming a 1:1 treatment group ratio and accounting for 10% loss to follow-up. Descriptive Statistics Baseline characteristics were summarised as means (SD) for normally distributed continuous variables, medians (IQR) for skewed continuous variables, and frequencies (percentages) for categorical variables. Standardised mean differences (SMDs) were used to compare baseline characteristics between treatment groups, with an absolute SMD < 0.1 indicating adequate balance; p-values were not used for baseline comparisons in accordance with current recommendations. Propensity Score Estimation and Matching The propensity score -- the probability of receiving an SGLT2 inhibitor versus a DPP-4 inhibitor -- was estimated using multivariable logistic regression. Covariates included age, sex, body mass index (BMI), glycated haemoglobin (HbA1c), estimated glomerular filtration rate (eGFR), history of cardiovascular disease, hypertension, and smoking status. These covariates were selected a priori based on clinical knowledge and a directed acyclic graph (DAG) encoding assumed causal relationships. Propensity scores were used for 1:1 nearest-neighbour matching without replacement, using a caliper of 0.2 standard deviations of the logit of the propensity score. Covariate balance after matching was assessed using SMDs, with all covariates required to achieve an absolute SMD < 0.1. Propensity score overlap was assessed visually using density plots. Primary Analysis The primary analysis estimated the average treatment effect on the treated (ATT) using a Cox proportional hazards regression model fitted to the matched cohort, with SGLT2 inhibitor use as the sole covariate. The proportional hazards assumption was tested using scaled Schoenfeld residuals and log-log survival plots. If the assumption was violated, a time-varying coefficient or restricted mean survival time (RMST) analysis was planned as an alternative. Hazard ratios (HRs) with 95% confidence intervals (CIs) were reported. Missing Data Missing covariate data ranged from 2% (age) to 15% (BMI). Missingness was assumed to be missing at random (MAR) conditional on observed variables. Multiple imputation by chained equations (MICE) was performed with 50 imputed datasets and 20 iterations per dataset. The imputation model included all analysis variables (covariates, exposure, outcome indicator, and the Nelson-Aalen cumulative hazard estimate) to ensure compatibility with the substantive analysis model. Propensity score estimation and matching were performed within each imputed dataset, and results were pooled using Rubin's rules. A complete-case analysis was performed as a sensitivity analysis. Sensitivity Analyses Five pre-specified sensitivity analyses were conducted: (1) complete-case analysis; (2) inverse probability of treatment weighting (IPTW) with stabilised weights as an alternative to matching; (3) inclusion of additional covariates (income quintile, number of medications) in the propensity score model; (4) restriction to patients with at least 12 months of follow-up; and (5) E-value calculation to quantify the minimum strength of association an unmeasured confounder would need with both the treatment and the outcome to explain away the observed association. Multiple Comparisons As this study had a single pre-specified primary outcome, no adjustment for multiple comparisons was applied to the primary analysis. Secondary outcomes were interpreted with appropriate caution as hypothesis-generating. Software All analyses were conducted using Python version 3.11 with the following packages: scikit-learn (v1.4.0) for propensity score estimation, lifelines (v0.29.0) for Cox proportional hazards regression, statsmodels (v0.14.1) for weighted regression analyses, and tableone (v0.9.1) for baseline characteristics tables. The random seed was set to 42 for reproducibility. Analysis code is available at [repository URL]. Two-sided p-values < 0.05 were considered statistically significant. """ print(methods_section) # --- Checklist verification --- print("=" * 60) print("CHECKLIST: Key Items Addressed in Methods Section") print("=" * 60) checklist = [ ("Study design and population", True), ("Primary and secondary outcomes (defined)", True), ("Sample size / power calculation", True), ("Descriptive statistics approach", True), ("Primary analysis model (Cox PH)", True), ("Assumptions and how checked (PH assumption)", True), ("Missing data (extent, mechanism, handling)", True), ("Sensitivity analyses (5 pre-specified)", True), ("Multiple comparisons addressed", True), ("Software and package versions", True), ("New-user, active comparator design", True), ("DAG for covariate selection", True), ("Balance assessment method (SMDs)", True), ("E-value for unmeasured confounding", True), ("Caliper specification for matching", True), ("Number of imputations and iterations", True), ] for item, addressed in checklist: status = "YES" if addressed else "NO" print(f" [{status:>3}] {item}") print("\nWord count: approximately 550 words") print("(The exercise requests 250-400 words; this is comprehensive") print("to cover all required items. Can be condensed for journals") print("with strict word limits by moving details to a supplement.)") ``` ::: ### Exercise 4 ::: {.panel-tabset} #### R ```{r} #| label: sol-ch19-ex4-r #| eval: false #| code-fold: true #| code-summary: "Show R solution" # ============================================================================= # Chapter 19 - Exercise 4: TRIPOD+AI Checklist (Conceptual) # Completed for Capstone 1: Cardiovascular Risk Prediction Model # ============================================================================= # This exercise asks you to complete the TRIPOD+AI checklist for one of the # capstone projects. Below is a completed checklist for Capstone 1: # "Development and external validation of a cardiovascular risk prediction # model using Framingham methodology, validated in NHANES data." checklist <- data.frame( Item = c( "1. Title", "2. Abstract", "3a. Background/rationale", "3b. Objectives", "4a. Source of data", "4b. Dates of study", "5a. Key eligibility criteria", "5b. Treatments received", "6a. Outcome definition", "6b. Outcome timing", "6c. Blinding of outcome", "7. Predictors", "8. Sample size", "9. Missing data", "10a. Statistical analysis: model development", "10b. Model specification", "10c. Predictor selection", "10d. Model performance measures", "11. Risk groups", "12a. Internal validation", "12b. External validation", "13. Fairness assessment (AI-specific)", "14. Results: participants", "15. Results: model development", "16. Results: model performance", "17. Results: model updating", "18. Discussion: interpretation", "19. Discussion: limitations", "20. Discussion: implications", "21. Supplementary: code and data", "AI-1. Data preprocessing", "AI-2. Hyperparameter tuning", "AI-3. Model explainability", "AI-4. Software and hardware" ), Section = c( "Title page", "Abstract", "Introduction", "Introduction", "Methods", "Methods", "Methods", "Methods", "Methods", "Methods", "Methods", "Methods", "Methods", "Methods", "Methods", "Methods", "Methods", "Methods", "Methods", "Methods", "Methods", "Methods", "Results", "Results", "Results", "Results", "Discussion", "Discussion", "Discussion", "Supplement", "Methods/Supplement", "Methods/Supplement", "Results/Supplement", "Methods" ), How_Addressed = c( # 1. Title "'Development and external validation of a logistic regression model for predicting 10-year cardiovascular disease risk: a Framingham-NHANES study.' Identifies study type (development + external validation) and modelling approach.", # 2. Abstract "Structured abstract with: objective, study design, data sources, outcome (10-year CVD), predictors, sample sizes, C-statistic and calibration results for both development and validation cohorts.", # 3a. Background "Existing CVD risk models (Framingham, QRISK, SCORE2) and their limitations. Gap: need for updated validation in contemporary US population.", # 3b. Objectives "To develop a 10-year CVD risk prediction model using Framingham data and externally validate it in NHANES 2017-2020.", # 4a. Source of data "Development: Framingham Heart Study teaching dataset (riskCommunicator R package). Validation: NHANES 2017-2020 (nhanesA package). Both are publicly available.", # 4b. Dates "Framingham: original cohort with follow-up through 2005. NHANES: 2017-2020 survey cycles.", # 5a. Eligibility "Adults aged 30-74, free of CVD at baseline, with complete data on key predictors. Exclude prior MI, stroke, heart failure.", # 5b. Treatments "Not applicable (prediction model, not treatment comparison). However, blood pressure treatment status is included as a predictor.", # 6a. Outcome definition "10-year cardiovascular event: composite of MI, coronary death, stroke, or heart failure requiring hospitalisation.", # 6b. Outcome timing "10-year follow-up from baseline examination. Censored at death from non-CVD causes or loss to follow-up.", # 6c. Blinding "Not applicable for retrospective analysis. Outcome ascertainment in Framingham was by adjudication committee blinded to risk factors.", # 7. Predictors "Pre-specified based on established Framingham model: age, sex, total cholesterol, HDL cholesterol, systolic blood pressure, blood pressure treatment status, smoking, diabetes. No data-driven selection.", # 8. Sample size "Sample size justified using Riley et al. (2020) criteria via pmsampsize package: 8 predictors, anticipated outcome prevalence 10%, target R-squared 0.15, requiring minimum ~800 events.", # 9. Missing data "Development cohort: <5% missing. Validation: up to 12% for some laboratory values. Handled by multiple imputation (MICE, 50 datasets, 20 iterations). Complete-case sensitivity analysis performed.", # 10a. Model development "Logistic regression for 10-year risk. Pre-specified predictors, no variable selection. Continuous predictors modelled with restricted cubic splines (4 knots for age and SBP; 3 knots for cholesterol) based on prior literature.", # 10b. Model specification "Full model equation provided in supplementary materials. Coefficients table with 95% CIs in main text.", # 10c. Predictor selection "All predictors pre-specified from clinical knowledge. No stepwise selection. Rationale provided for each predictor.", # 10d. Performance measures "Discrimination: C-statistic (Harrell's concordance) with 95% CI. Calibration: calibration plot (observed vs predicted), calibration slope, calibration-in-the-large. Both reported for development (bootstrap-corrected) and external validation.", # 11. Risk groups "Patients categorised into low (<5%), moderate (5-10%), high (10-20%), and very high (>20%) 10-year CVD risk groups. Classification tables provided.", # 12a. Internal validation "500 bootstrap resamples for optimism-corrected C-statistic and calibration slope. Apparent and corrected performance reported.", # 12b. External validation "Model applied to NHANES data without recalibration. C-statistic and calibration plot in external data. Recalibration of intercept also explored.", # 13. Fairness "C-statistic and calibration reported separately by sex (male/female) and by race/ethnicity (non-Hispanic White, non-Hispanic Black, Hispanic, Asian). Differences in performance >0.05 C-statistic flagged for discussion.", # 14. Participants "Flow diagram showing sample selection: patients screened, excluded (with reasons), and included in development and validation cohorts. Table 1 with baseline characteristics stratified by CVD status.", # 15. Model development "Full regression table with coefficients, standard errors, odds ratios, 95% CIs, and p-values. Presented as Table 2.", # 16. Model performance "Table 3: discrimination (C-statistic) and calibration metrics for development (apparent and corrected) and external validation. Figures: calibration plots (development and validation), decision curve analysis.", # 17. Model updating "Recalibration of intercept in NHANES improved calibration-in-the- large from X to Y. Full model updating was explored but not recommended due to limited improvement.", # 18. Interpretation "Model performance compared with published Framingham, QRISK3, and PCE models. Clinical utility assessed via decision curve analysis across clinically relevant threshold probabilities (5-20%).", # 19. Limitations "Limitations discussed: (1) Framingham cohort may not represent contemporary diverse populations; (2) NHANES validation limited by shorter follow-up and self-reported outcomes; (3) Missing data assumptions; (4) Ecological fallacy in subgroup performance assessment.", # 20. Implications "Clinical implications for CVD risk assessment. Recommendation for recalibration before use in non-US populations. Need for prospective validation.", # 21. Code and data "R code provided via GitHub repository [URL]. Framingham data available via riskCommunicator package. NHANES data publicly available. Reproducible via renv for package management.", # AI-1. Data preprocessing "Not applicable (logistic regression, not ML/AI). However, data cleaning steps, outlier handling, and transformation of predictors (restricted cubic splines) described in supplement.", # AI-2. Hyperparameter tuning "Not applicable for logistic regression. Number of knots for RCS pre-specified based on Harrell's recommendations.", # AI-3. Model explainability "Not applicable (logistic regression is inherently interpretable). Coefficient interpretation provided. Nomogram included for clinical use.", # AI-4. Software "R version 4.4.1. Packages: rms (v6.8), survival (v3.7), mice (v3.16), gtsummary (v1.7), pmsampsize (v1.1), riskCommunicator (v1.0), nhanesA (v1.1). Full renv.lock file in repository." ), Applicable = c( "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Partially (no treatment comparison)", "Yes", "Yes", "Partially (retrospective)", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No (not ML)", "No (not ML)", "No (not ML)", "Yes" ), stringsAsFactors = FALSE ) # Print the checklist cat("=" * 70, "\n") cat("TRIPOD+AI Checklist - Capstone 1: CV Risk Prediction Model\n") cat("=" * 70, "\n\n") for (i in 1:nrow(checklist)) { cat("ITEM:", checklist$Item[i], "\n") cat(" Section:", checklist$Section[i], "\n") cat(" Applicable:", checklist$Applicable[i], "\n") cat(" How addressed:", trimws(checklist$How_Addressed[i]), "\n\n") } # Summary cat("\n=== SUMMARY ===\n") n_applicable <- sum(checklist$Applicable == "Yes") n_partial <- sum(grepl("Partially", checklist$Applicable)) n_na <- sum(checklist$Applicable == "No (not ML)") cat("Items fully addressed:", n_applicable, "\n") cat("Items partially applicable:", n_partial, "\n") cat("Items not applicable (ML-specific):", n_na, "\n") cat("Total items:", nrow(checklist), "\n") cat("\nNote: AI-specific items (AI-1 through AI-3) are not applicable because\n") cat("this capstone uses logistic regression, not machine learning. If a\n") cat("gradient boosted model or neural network were used instead, these\n") cat("items would need to be addressed with details on feature engineering,\n") cat("hyperparameter search strategy, and SHAP/LIME explanations.\n") ``` #### Python ```{python} #| label: sol-ch19-ex4-py #| eval: false #| code-fold: true #| code-summary: "Show Python solution" # ============================================================================= # Chapter 19 - Exercise 4: TRIPOD+AI Checklist (Conceptual) # Completed for Capstone 1: Cardiovascular Risk Prediction Model # ============================================================================= # This exercise asks you to complete the TRIPOD+AI checklist for one of the # capstone projects. Below is a completed checklist for Capstone 1: # "Development and external validation of a cardiovascular risk prediction # model using Framingham methodology, validated in NHANES data." checklist = [ { "item": "1. Title", "section": "Title page", "applicable": "Yes", "how": ( "'Development and external validation of a logistic regression " "model for predicting 10-year cardiovascular disease risk: a " "Framingham-NHANES study.' Identifies study type (development + " "external validation) and modelling approach." ) }, { "item": "2. Abstract", "section": "Abstract", "applicable": "Yes", "how": ( "Structured abstract with: objective, study design, data sources, " "outcome (10-year CVD), predictors, sample sizes, C-statistic and " "calibration results for both development and validation cohorts." ) }, { "item": "3a. Background/rationale", "section": "Introduction", "applicable": "Yes", "how": ( "Existing CVD risk models (Framingham, QRISK, SCORE2) and their " "limitations. Gap: need for updated validation in contemporary " "US population." ) }, { "item": "3b. Objectives", "section": "Introduction", "applicable": "Yes", "how": ( "To develop a 10-year CVD risk prediction model using Framingham " "data and externally validate it in NHANES 2017-2020." ) }, { "item": "4a. Source of data", "section": "Methods", "applicable": "Yes", "how": ( "Development: Framingham Heart Study teaching dataset " "(riskCommunicator R package). Validation: NHANES 2017-2020 " "(nhanesA package). Both publicly available." ) }, { "item": "4b. Dates of study", "section": "Methods", "applicable": "Yes", "how": ( "Framingham: original cohort with follow-up through 2005. " "NHANES: 2017-2020 survey cycles." ) }, { "item": "5a. Key eligibility criteria", "section": "Methods", "applicable": "Yes", "how": ( "Adults aged 30-74, free of CVD at baseline, with complete data " "on key predictors. Exclude prior MI, stroke, heart failure." ) }, { "item": "5b. Treatments received", "section": "Methods", "applicable": "Partially", "how": ( "Not applicable (prediction model, not treatment comparison). " "However, BP treatment status included as a predictor." ) }, { "item": "6a. Outcome definition", "section": "Methods", "applicable": "Yes", "how": ( "10-year CVD event: composite of MI, coronary death, stroke, " "or heart failure requiring hospitalisation." ) }, { "item": "6b. Outcome timing", "section": "Methods", "applicable": "Yes", "how": ( "10-year follow-up from baseline examination. Censored at " "death from non-CVD causes or loss to follow-up." ) }, { "item": "6c. Blinding of outcome", "section": "Methods", "applicable": "Partially", "how": ( "Retrospective analysis. Framingham outcome adjudication was " "by committee blinded to risk factors." ) }, { "item": "7. Predictors", "section": "Methods", "applicable": "Yes", "how": ( "Pre-specified: age, sex, total cholesterol, HDL cholesterol, " "systolic BP, BP treatment, smoking, diabetes. No data-driven " "selection." ) }, { "item": "8. Sample size", "section": "Methods", "applicable": "Yes", "how": ( "Justified using Riley et al. (2020) criteria via pmsampsize: " "8 predictors, prevalence 10%, target R^2 0.15, requiring " "minimum ~800 events." ) }, { "item": "9. Missing data", "section": "Methods", "applicable": "Yes", "how": ( "Development: <5% missing. Validation: up to 12%. Handled by " "MICE (50 datasets, 20 iterations). Complete-case sensitivity " "analysis performed." ) }, { "item": "10a. Model development", "section": "Methods", "applicable": "Yes", "how": ( "Logistic regression for 10-year risk. Pre-specified predictors. " "Continuous predictors with restricted cubic splines." ) }, { "item": "10b. Model specification", "section": "Methods", "applicable": "Yes", "how": "Full model equation in supplementary materials." }, { "item": "10c. Predictor selection", "section": "Methods", "applicable": "Yes", "how": ( "All predictors pre-specified from clinical knowledge. " "No stepwise selection." ) }, { "item": "10d. Performance measures", "section": "Methods", "applicable": "Yes", "how": ( "Discrimination: C-statistic with 95% CI. Calibration: " "calibration plot, calibration slope, calibration-in-the-large." ) }, { "item": "11. Risk groups", "section": "Methods", "applicable": "Yes", "how": ( "Low (<5%), moderate (5-10%), high (10-20%), very high (>20%) " "10-year CVD risk categories." ) }, { "item": "12a. Internal validation", "section": "Methods", "applicable": "Yes", "how": ( "500 bootstrap resamples for optimism-corrected C-statistic " "and calibration slope." ) }, { "item": "12b. External validation", "section": "Methods", "applicable": "Yes", "how": ( "Model applied to NHANES without recalibration. C-statistic " "and calibration reported. Recalibration of intercept explored." ) }, { "item": "13. Fairness assessment (AI-specific)", "section": "Methods", "applicable": "Yes", "how": ( "Performance reported by sex and race/ethnicity. Differences " ">0.05 C-statistic flagged." ) }, { "item": "14. Results: participants", "section": "Results", "applicable": "Yes", "how": ( "Flow diagram with sample selection. Table 1 stratified by " "CVD event status." ) }, { "item": "15. Results: model development", "section": "Results", "applicable": "Yes", "how": ( "Full regression table with coefficients, ORs, 95% CIs, " "and p-values." ) }, { "item": "16. Results: model performance", "section": "Results", "applicable": "Yes", "how": ( "Table of discrimination and calibration metrics. Calibration " "plots and decision curve analysis figures." ) }, { "item": "17. Results: model updating", "section": "Results", "applicable": "Yes", "how": ( "Recalibration of intercept in NHANES. Full model updating " "explored but not recommended." ) }, { "item": "18. Discussion: interpretation", "section": "Discussion", "applicable": "Yes", "how": ( "Comparison with Framingham, QRISK3, PCE models. Clinical " "utility via decision curve analysis." ) }, { "item": "19. Discussion: limitations", "section": "Discussion", "applicable": "Yes", "how": ( "Framingham may not represent diverse populations. NHANES " "validation limited by follow-up. Missing data assumptions." ) }, { "item": "20. Discussion: implications", "section": "Discussion", "applicable": "Yes", "how": ( "Clinical implications for CVD risk assessment. Need for " "recalibration in non-US populations." ) }, { "item": "21. Code and data", "section": "Supplement", "applicable": "Yes", "how": ( "Code on GitHub. Data publicly available via R packages. " "Reproducible via renv/conda." ) }, { "item": "AI-1. Data preprocessing", "section": "Methods/Supplement", "applicable": "No (logistic regression)", "how": ( "Not applicable for logistic regression. Data cleaning and " "spline transformations described in supplement." ) }, { "item": "AI-2. Hyperparameter tuning", "section": "Methods/Supplement", "applicable": "No (logistic regression)", "how": ( "Not applicable. Number of RCS knots pre-specified based on " "Harrell's recommendations." ) }, { "item": "AI-3. Model explainability", "section": "Results/Supplement", "applicable": "No (logistic regression)", "how": ( "Not applicable (logistic regression is inherently " "interpretable). Nomogram provided." ) }, { "item": "AI-4. Software and hardware", "section": "Methods", "applicable": "Yes", "how": ( "Python 3.11 with scikit-learn, lifelines, statsmodels, " "tableone. Full requirements.txt in repository." ) }, ] # --- Print the checklist --- print("=" * 70) print("TRIPOD+AI Checklist") print("Capstone 1: Cardiovascular Risk Prediction Model") print("=" * 70) for entry in checklist: print(f"\nITEM: {entry['item']}") print(f" Section: {entry['section']}") print(f" Applicable: {entry['applicable']}") print(f" How addressed: {entry['how']}") # --- Summary --- n_yes = sum(1 for e in checklist if e['applicable'] == 'Yes') n_partial = sum(1 for e in checklist if 'Partially' in e['applicable']) n_no = sum(1 for e in checklist if e['applicable'].startswith('No')) print(f"\n{'=' * 70}") print("SUMMARY") print(f"{'=' * 70}") print(f"Items fully addressed: {n_yes}") print(f"Items partially applicable: {n_partial}") print(f"Items not applicable (ML-specific): {n_no}") print(f"Total items: {len(checklist)}") print(""" Note on AI-specific items: AI-specific items (AI-1 through AI-3) are not applicable because this capstone uses logistic regression, not machine learning. If a gradient boosted model or neural network were used, these items would require: - AI-1: Feature engineering pipeline, normalization, encoding - AI-2: Hyperparameter search strategy (grid, random, Bayesian) - AI-3: SHAP values, partial dependence plots, or LIME explanations The TRIPOD+AI checklist is available from the EQUATOR Network: https://www.equator-network.org/reporting-guidelines/tripod-ai/ Key points for completing TRIPOD+AI: 1. NEVER report AUROC/C-statistic without calibration 2. Always include a calibration plot 3. Fairness assessment across demographic subgroups is now required 4. Code and data availability statements are mandatory 5. For AI/ML models, explainability methods must be described """) ``` :::