12 Decision Trees, Random Forests, and Gradient Boosting

12.1 Decision Trees (CART)

A decision tree is perhaps the most intuitive machine learning model. It makes predictions by asking a series of yes/no questions about the input features, splitting the data at each step into increasingly homogeneous groups. The result looks like an inverted tree — hence the name.

12.1.1 How Trees Split

The algorithm works top-down, greedily selecting the best split at each node:

Consider every feature and every possible split point.
For each candidate split, measure how much it improves the “purity” of the resulting child nodes.
Choose the split that maximizes purity improvement.
Repeat recursively for each child node until a stopping criterion is met (e.g., minimum node size, maximum depth, or no further improvement).

12.1.2 Splitting Criteria

For classification trees, the two most common measures of impurity are:

Gini impurity: For a node with $K$ classes and class proportions $p_1, \ldots, p_K$:

\[G = \sum_{k=1}^{K} p_k (1 - p_k) = 1 - \sum_{k=1}^{K} p_k^2\]

Gini impurity is 0 when a node is pure (all one class) and maximized when classes are equally mixed. For a binary problem, the maximum Gini is 0.5 (at $p = 0.5$).

Entropy (information gain):

\[H = -\sum_{k=1}^{K} p_k \log_2(p_k)\]

Entropy is 0 when the node is pure and maximized at $\log_2(K)$ when classes are equally distributed.

In practice, Gini and entropy produce very similar trees. Gini is the default in most implementations because it is slightly faster to compute (no logarithm).

For regression trees, the splitting criterion is typically the reduction in variance (or equivalently, the reduction in residual sum of squares).

12.1.3 Clinical Example: Hospital Readmission

Let us build a decision tree to predict 30-day hospital readmission — a common quality metric tied to reimbursement penalties in the United States.

Code

# Simulate hospital readmission data
set.seed(42)
n <- 1000

readmit_data <- tibble(
  age = rnorm(n, 68, 12),
  length_of_stay = rpois(n, 5) + 1,
  num_comorbidities = rpois(n, 3),
  prior_admissions = rpois(n, 1),
  discharge_hemoglobin = rnorm(n, 11, 2),
  discharge_creatinine = rlnorm(n, 0.2, 0.5),
  has_diabetes = rbinom(n, 1, 0.35),
  has_chf = rbinom(n, 1, 0.25),
  readmitted = factor(
    rbinom(n, 1, plogis(-3 + 0.02 * (rnorm(n, 68, 12) - 68) +
                          0.15 * rpois(n, 1) +
                          0.1 * rpois(n, 3) +
                          0.3 * rbinom(n, 1, 0.25) -
                          0.1 * rnorm(n, 11, 2))),
    labels = c("No", "Yes")
  )
)

cat("Readmission rate:", mean(readmit_data$readmitted == "Yes"), "\n")
cat("N =", nrow(readmit_data), "\n")

Code

# Fit a decision tree
tree_model <- rpart(
  readmitted ~ age + length_of_stay + num_comorbidities +
    prior_admissions + discharge_hemoglobin +
    discharge_creatinine + has_diabetes + has_chf,
  data = readmit_data,
  method = "class",
  control = rpart.control(cp = 0.01, maxdepth = 5, minsplit = 20)
)

# Visualize
rpart.plot(tree_model, type = 4, extra = 106, under = TRUE,
           box.palette = "RdYlGn", roundint = FALSE,
           main = "Decision Tree: 30-Day Readmission")

Code

import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import cross_val_score, StratifiedKFold
import matplotlib.pyplot as plt

np.random.seed(42)
n = 1000

df = pd.DataFrame({
    'age': np.random.normal(68, 12, n),
    'length_of_stay': np.random.poisson(5, n) + 1,
    'num_comorbidities': np.random.poisson(3, n),
    'prior_admissions': np.random.poisson(1, n),
    'discharge_hemoglobin': np.random.normal(11, 2, n),
    'discharge_creatinine': np.random.lognormal(0.2, 0.5, n),
    'has_diabetes': np.random.binomial(1, 0.35, n),
    'has_chf': np.random.binomial(1, 0.25, n)
})
y = np.random.binomial(1, 0.18, n)

feature_names = list(df.columns)

# Fit decision tree
tree = DecisionTreeClassifier(max_depth=5, min_samples_split=20,
                               min_samples_leaf=10, random_state=42)
tree.fit(df, y)

fig, ax = plt.subplots(figsize=(16, 8))
plot_tree(tree, feature_names=feature_names, class_names=['No', 'Yes'],
          filled=True, rounded=True, ax=ax, fontsize=8, max_depth=3)
ax.set_title("Decision Tree: 30-Day Readmission")
plt.tight_layout()
plt.show()

12.1.4 Reading a Decision Tree

Each node in the tree displays:

The splitting rule (e.g., “age >= 72”)
The predicted class (the majority class in that node)
The class probabilities
The percentage of observations that reach this node

To classify a new patient, start at the root and follow the branches based on the patient’s feature values until you reach a terminal node (leaf). The prediction is the majority class of that leaf.

12.1.5 Pruning: Controlling Tree Complexity

An unpruned tree will grow until every leaf is pure (or nearly so), perfectly memorizing the training data — a textbook case of overfitting. Pruning cuts back the tree to reduce complexity.

Two strategies:

Pre-pruning (early stopping): Set constraints before growing the tree — maximum depth, minimum samples per leaf, minimum information gain to split. Simple but can stop too early.
Post-pruning (cost-complexity pruning): Grow a full tree, then prune branches that provide minimal improvement. The complexity parameter (cp) in R’s rpart controls this: higher cp = more pruning. Select cp via cross-validation.

Code

# View the CP table
printcp(tree_model)

# Plot CV error vs cp
plotcp(tree_model)

# Prune to optimal cp
optimal_cp <- tree_model$cptable[which.min(tree_model$cptable[, "xerror"]), "CP"]
pruned_tree <- prune(tree_model, cp = optimal_cp)

cat("Optimal cp:", optimal_cp, "\n")
cat("Number of terminal nodes (full):", sum(tree_model$frame$var == "<leaf>"), "\n")
cat("Number of terminal nodes (pruned):", sum(pruned_tree$frame$var == "<leaf>"), "\n")

Code

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
import numpy as np

# Compare trees of different depths
depths = [2, 3, 5, 7, 10, 15, None]
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

results = []
for d in depths:
    t = DecisionTreeClassifier(max_depth=d, random_state=42)
    scores = cross_val_score(t, df, y, cv=cv, scoring='roc_auc')
    depth_label = str(d) if d is not None else "None (full)"
    results.append({'max_depth': depth_label,
                    'mean_auc': scores.mean(),
                    'std_auc': scores.std()})
    print(f"max_depth={depth_label:>6s}: AUC = {scores.mean():.3f} (+/- {scores.std():.3f})")

best = max(results, key=lambda x: x['mean_auc'])
print(f"\nBest: max_depth={best['max_depth']} with AUC={best['mean_auc']:.3f}")

12.2 Why Single Trees Are Unstable

Decision trees have a critical weakness: high variance. Small changes in the training data can produce a completely different tree. This instability means that:

Predictions are unreliable for individual trees.
Results may not replicate across datasets.
Performance on new data is often poor.

The solution is to combine many trees into an ensemble. Two main strategies exist: bagging (reduce variance by averaging) and boosting (reduce bias by learning sequentially).

12.3 Bagging: Bootstrap Aggregating

Bagging (Bootstrap AGGregatING), introduced by Leo Breiman in 1996, reduces variance by:

Drawing $B$ bootstrap samples (random samples with replacement) from the training data.
Fitting a separate decision tree to each bootstrap sample.
Averaging predictions (regression) or taking a majority vote (classification) across all $B$ trees.

By averaging many high-variance, low-bias trees, bagging dramatically reduces the overall variance without increasing bias. The key insight: the variance of an average of $B$ identically distributed random variables decreases with $B$ (provided they are not perfectly correlated).

12.4 Random Forests

Random forests extend bagging with one additional trick: at each split, instead of considering all features, the algorithm randomly selects a subset of features (typically $\sqrt{p}$ for classification or $p/3$ for regression, where $p$ is the total number of features).

This feature randomization decorrelates the trees. In bagging without feature randomization, if one feature is very strong, most trees will split on it first, making the trees correlated. Random forests force trees to consider different features, producing more diverse trees whose average is more stable.

12.4.1 Key Hyperparameters

Parameter	What It Controls	Typical Default
`num.trees` / `n_estimators`	Number of trees in the forest	500–2000
`mtry` / `max_features`	Features considered at each split	$\sqrt{p}$ (classification), $p/3$ (regression)
`min.node.size` / `min_samples_leaf`	Minimum observations in a leaf	1 (classification), 5 (regression)
`max.depth`	Maximum tree depth	Unlimited (let trees grow fully)

12.4.2 Out-of-Bag (OOB) Error

Each bootstrap sample includes about 63% of the original observations (due to sampling with replacement). The remaining 37% — the out-of-bag observations — were not used to build that particular tree. Random forests use these OOB observations as a built-in validation set: each observation is predicted by the trees that did not include it in their bootstrap sample.

The OOB error is a nearly unbiased estimate of the test error — no separate validation set required. This is one of the most elegant features of random forests.

12.4.3 Fitting a Random Forest

Code

library(ranger)

# Fit random forest
set.seed(42)
rf_model <- ranger(
  readmitted ~ age + length_of_stay + num_comorbidities +
    prior_admissions + discharge_hemoglobin +
    discharge_creatinine + has_diabetes + has_chf,
  data = readmit_data,
  num.trees = 500,
  mtry = 3,
  min.node.size = 10,
  importance = "impurity",
  probability = TRUE,   # predict probabilities
  seed = 42
)

cat("OOB prediction error:", rf_model$prediction.error, "\n")

Code

# Variable importance
importance_df <- tibble(
  Variable = names(rf_model$variable.importance),
  Importance = rf_model$variable.importance
) %>%
  arrange(desc(Importance))

ggplot(importance_df, aes(x = reorder(Variable, Importance), y = Importance)) +
  geom_col(fill = "#2E86AB") +
  coord_flip() +
  labs(x = NULL, y = "Importance (Gini impurity reduction)",
       title = "Random Forest Variable Importance") +
  theme_minimal(base_size = 14)

Code

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

np.random.seed(42)

rf = RandomForestClassifier(
    n_estimators=500,
    max_features='sqrt',
    min_samples_leaf=10,
    oob_score=True,
    random_state=42,
    n_jobs=-1
)
rf.fit(df, y)

print(f"OOB accuracy: {rf.oob_score_:.3f}")

# Cross-validated AUC
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
auc_scores = cross_val_score(rf, df, y, cv=cv, scoring='roc_auc')
print(f"CV AUC: {auc_scores.mean():.3f} (+/- {auc_scores.std():.3f})")

Code

importances = pd.Series(rf.feature_importances_, index=feature_names)
importances = importances.sort_values(ascending=True)

fig, ax = plt.subplots(figsize=(8, 5))
importances.plot(kind='barh', ax=ax, color='#2E86AB')
ax.set_xlabel("Importance (Gini impurity reduction)")
ax.set_title("Random Forest Variable Importance")
plt.tight_layout()
plt.show()

12.4.4 Variable Importance: Two Measures

Impurity-based importance (default): Total reduction in Gini impurity (or variance) from splits on that feature, summed across all trees. Fast but can be biased toward high-cardinality features.
Permutation importance: For each feature, randomly shuffle its values and measure how much the OOB (or test) error increases. More reliable but slower. This is the recommended approach for inference.

Variable Importance Is Not Causal

A feature with high importance in a random forest is predictive of the outcome, not necessarily causally related to it. A correlated proxy (e.g., number of medications as a proxy for disease severity) may appear important even though intervening on it would have no effect. Do not confuse prediction with causation.

12.5 Gradient Boosting

While bagging reduces variance by averaging independent trees, gradient boosting reduces bias by building trees sequentially, where each new tree corrects the mistakes of the previous ensemble.

12.5.1 The Intuition: Learning from Mistakes

Fit a simple tree to the data.
Compute the residuals (errors) from this tree.
Fit a new tree to predict these residuals.
Add this new tree to the ensemble (scaled by a learning rate).
Repeat steps 2–4 for $B$ iterations.

Each new tree focuses on the observations the current ensemble gets wrong. Over many iterations, the ensemble becomes increasingly accurate.

12.5.2 The Mathematics (Brief)

Gradient boosting minimizes a loss function $L(y, \hat{y})$ by gradient descent in function space. At iteration $m$:

\[\hat{f}_m(x) = \hat{f}_{m-1}(x) + \eta \cdot h_m(x)\]

where $h_m(x)$ is a tree fitted to the negative gradient of the loss (the “pseudo-residuals”) and $\eta$ is the learning rate (also called the shrinkage parameter).

12.5.3 Key Hyperparameters

Parameter	What It Controls	Typical Range
`n_estimators` / `nrounds`	Number of boosting iterations	100–5000
`learning_rate` / `eta`	Step size for each tree	0.01–0.3
`max_depth`	Depth of each tree	3–8 (shallow trees!)
`subsample`	Fraction of data used per tree	0.5–1.0
`colsample_bytree`	Fraction of features per tree	0.5–1.0
`min_child_weight`	Minimum sum of instance weights in a leaf	1–10
`reg_alpha` (L1) / `reg_lambda` (L2)	Regularization	0–10

The Learning Rate / Number of Trees Tradeoff

A smaller learning rate requires more trees but generally produces better results. A common strategy: set the learning rate to 0.01–0.05, then tune the number of trees via early stopping (stop when validation error stops improving).

12.5.4 XGBoost and LightGBM

XGBoost (eXtreme Gradient Boosting) by Chen and Guestrin (2016) added regularization, efficient computation, and handling of missing values to gradient boosting, making it the dominant algorithm for tabular data competitions and many clinical prediction tasks.

LightGBM by Microsoft further improves efficiency with gradient-based one-side sampling (GOSS) and exclusive feature bundling. It is often faster than XGBoost on large datasets.

Code

library(xgboost)

# Prepare data for xgboost (requires numeric matrix)
X_train <- readmit_data %>%
  select(age, length_of_stay, num_comorbidities, prior_admissions,
         discharge_hemoglobin, discharge_creatinine, has_diabetes, has_chf) %>%
  as.matrix()

y_train <- as.numeric(readmit_data$readmitted == "Yes")

dtrain <- xgb.DMatrix(data = X_train, label = y_train)

# Fit XGBoost with cross-validation to find optimal nrounds
set.seed(42)
xgb_cv <- xgb.cv(
  params = list(
    objective = "binary:logistic",
    eval_metric = "auc",
    eta = 0.05,
    max_depth = 4,
    subsample = 0.8,
    colsample_bytree = 0.8
  ),
  data = dtrain,
  nrounds = 1000,
  nfold = 5,
  early_stopping_rounds = 50,
  verbose = 0
)

cat("Best iteration:", xgb_cv$best_iteration, "\n")
cat("Best CV AUC:", max(xgb_cv$evaluation_log$test_auc_mean), "\n")

Code

# Fit final model with optimal nrounds
xgb_model <- xgb.train(
  params = list(
    objective = "binary:logistic",
    eval_metric = "auc",
    eta = 0.05,
    max_depth = 4,
    subsample = 0.8,
    colsample_bytree = 0.8
  ),
  data = dtrain,
  nrounds = xgb_cv$best_iteration,
  verbose = 0
)

# Variable importance
importance_matrix <- xgb.importance(model = xgb_model)
xgb.plot.importance(importance_matrix, top_n = 8,
                     main = "XGBoost Variable Importance")

Code

import xgboost as xgb
from sklearn.model_selection import cross_val_score, StratifiedKFold
import numpy as np

np.random.seed(42)

xgb_model = xgb.XGBClassifier(
    n_estimators=1000,
    learning_rate=0.05,
    max_depth=4,
    subsample=0.8,
    colsample_bytree=0.8,
    eval_metric='auc',
    early_stopping_rounds=50,
    random_state=42,
    use_label_encoder=False
)

# Cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(
    xgb.XGBClassifier(
        n_estimators=500, learning_rate=0.05, max_depth=4,
        subsample=0.8, colsample_bytree=0.8,
        random_state=42, use_label_encoder=False, eval_metric='logloss'
    ),
    df, y, cv=cv, scoring='roc_auc'
)
print(f"XGBoost CV AUC: {scores.mean():.3f} (+/- {scores.std():.3f})")

Code

# Fit and plot importance
xgb_final = xgb.XGBClassifier(
    n_estimators=500, learning_rate=0.05, max_depth=4,
    subsample=0.8, colsample_bytree=0.8,
    random_state=42, use_label_encoder=False, eval_metric='logloss'
)
xgb_final.fit(df, y)

fig, ax = plt.subplots(figsize=(8, 5))
xgb.plot_importance(xgb_final, ax=ax, importance_type='gain',
                     title='XGBoost Feature Importance (Gain)')
plt.tight_layout()
plt.show()

12.6 Hyperparameter Tuning

All ensemble methods have hyperparameters that must be tuned. Two common strategies:

12.6.1 Grid Search

Exhaustively try every combination of specified parameter values. Reliable but computationally expensive — the number of combinations grows multiplicatively.

12.6.2 Random Search

Randomly sample hyperparameter combinations. Bergstra and Bengio (2012) showed that random search is more efficient than grid search because not all hyperparameters are equally important. Random search explores more values of the important parameters for the same computational budget.

Code

library(tidymodels)

# Define the model with tunable parameters
rf_spec <- rand_forest(
  trees = 500,
  mtry = tune(),
  min_n = tune()
) %>%
  set_engine("ranger") %>%
  set_mode("classification")

# Create recipe
rf_recipe <- recipe(readmitted ~ ., data = readmit_data)

# Workflow
rf_wf <- workflow() %>%
  add_model(rf_spec) %>%
  add_recipe(rf_recipe)

# Define tuning grid
rf_grid <- grid_random(
  mtry(range = c(2, 7)),
  min_n(range = c(5, 30)),
  size = 20
)

# Cross-validation
set.seed(42)
folds <- vfold_cv(readmit_data, v = 5, strata = readmitted)

# Tune (this may take a moment)
rf_tune_results <- tune_grid(
  rf_wf,
  resamples = folds,
  grid = rf_grid,
  metrics = metric_set(roc_auc)
)

# Best parameters
show_best(rf_tune_results, metric = "roc_auc", n = 5)

Code

# Select best and finalize
best_params <- select_best(rf_tune_results, metric = "roc_auc")
cat("Best mtry:", best_params$mtry, "\n")
cat("Best min_n:", best_params$min_n, "\n")

final_wf <- finalize_workflow(rf_wf, best_params)
final_fit <- fit(final_wf, data = readmit_data)

Code

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold
import numpy as np

np.random.seed(42)

# Define parameter distributions
param_dist = {
    'n_estimators': [100, 200, 500],
    'max_features': ['sqrt', 'log2', 3, 5],
    'min_samples_leaf': [5, 10, 15, 20, 30],
    'max_depth': [5, 10, 15, 20, None]
}

rf = RandomForestClassifier(random_state=42, n_jobs=-1)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Random search
search = RandomizedSearchCV(
    rf, param_dist, n_iter=20, cv=cv, scoring='roc_auc',
    random_state=42, n_jobs=-1, verbose=0
)
search.fit(df, y)

print(f"Best AUC: {search.best_score_:.3f}")
print(f"Best parameters: {search.best_params_}")

12.7 Comparing Methods: When to Use What

Method	Strengths	Weaknesses	Best For
Single tree	Interpretable, handles non-linearity	Unstable, overfits	Exploration, simple rules
Random forest	Robust, low tuning needed, OOB error	Less interpretable than single tree	General-purpose prediction
Gradient boosting	Often highest accuracy, flexible	More tuning, can overfit, slower	Maximizing prediction accuracy
Logistic regression	Interpretable, well-understood inference	Assumes linearity (without transforms)	Inference, small samples, reporting ORs

Practical Advice for Clinical Prediction Models

Start with logistic regression as your baseline. If you need better prediction and have enough data (typically hundreds of events), try a random forest. If you want to squeeze out the last bit of performance and are willing to tune carefully, use XGBoost. Always compare against the logistic regression baseline — if the improvement is marginal, prefer the simpler model.

12.8 Clinical Example: Readmission Risk Stratification

Let us bring everything together by comparing all three methods on the hospital readmission dataset.

Code

library(tidymodels)

set.seed(42)
folds <- vfold_cv(readmit_data, v = 10, strata = readmitted)

# Logistic regression
lr_spec <- logistic_reg() %>% set_engine("glm")
lr_wf <- workflow() %>%
  add_model(lr_spec) %>%
  add_recipe(recipe(readmitted ~ ., data = readmit_data))
lr_res <- fit_resamples(lr_wf, resamples = folds,
                        metrics = metric_set(roc_auc))

# Random forest
rf_spec <- rand_forest(trees = 500, mtry = 3, min_n = 10) %>%
  set_engine("ranger") %>%
  set_mode("classification")
rf_wf <- workflow() %>%
  add_model(rf_spec) %>%
  add_recipe(recipe(readmitted ~ ., data = readmit_data))
rf_res <- fit_resamples(rf_wf, resamples = folds,
                        metrics = metric_set(roc_auc))

# Boosted trees (via xgboost engine in tidymodels)
xgb_spec <- boost_tree(trees = 500, tree_depth = 4, learn_rate = 0.05,
                        min_n = 10) %>%
  set_engine("xgboost") %>%
  set_mode("classification")
xgb_wf <- workflow() %>%
  add_model(xgb_spec) %>%
  add_recipe(recipe(readmitted ~ ., data = readmit_data))
xgb_res <- fit_resamples(xgb_wf, resamples = folds,
                         metrics = metric_set(roc_auc))

# Collect and compare results
bind_rows(
  collect_metrics(lr_res) %>% mutate(model = "Logistic Regression"),
  collect_metrics(rf_res) %>% mutate(model = "Random Forest"),
  collect_metrics(xgb_res) %>% mutate(model = "XGBoost")
) %>%
  select(model, .metric, mean, std_err) %>%
  arrange(desc(mean))

Code

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
import numpy as np
import pandas as pd

np.random.seed(42)
cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

models = {
    'Logistic Regression': make_pipeline(
        StandardScaler(), LogisticRegression(max_iter=1000)
    ),
    'Random Forest': RandomForestClassifier(
        n_estimators=500, max_features='sqrt', min_samples_leaf=10,
        random_state=42, n_jobs=-1
    ),
    'Gradient Boosting': GradientBoostingClassifier(
        n_estimators=500, learning_rate=0.05, max_depth=4,
        subsample=0.8, random_state=42
    )
}

results = []
for name, model in models.items():
    scores = cross_val_score(model, df, y, cv=cv, scoring='roc_auc')
    results.append({
        'Model': name,
        'Mean AUC': f"{scores.mean():.3f}",
        'Std': f"{scores.std():.3f}"
    })
    print(f"{name:25s}: AUC = {scores.mean():.3f} (+/- {scores.std():.3f})")

print(pd.DataFrame(results).to_string(index=False))

12.9 Exercises

12.9.1 Exercise 1: Build and Prune a Classification Tree

Using the readmission dataset (or a dataset of your choice):

Fit a full, unpruned classification tree. How many terminal nodes does it have?
Use cross-validation to find the optimal complexity parameter (cp in R) or maximum depth (in Python).
Prune the tree and plot it. How does it compare to the full tree?
What are the top 3 splitting variables? Do they make clinical sense?

Code

# Fit a full tree (set cp very low to avoid early pruning)
full_tree <- rpart(readmitted ~ ., data = readmit_data, method = "class",
                   control = rpart.control(cp = 0.001))

# Your code:
# 1. Count terminal nodes
# 2. Use printcp() and plotcp() to find optimal cp
# 3. Prune and plot
# 4. Examine variable importance: full_tree$variable.importance

Code

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score

# Your code:
# 1. Fit trees with different max_depth values
# 2. Cross-validate each
# 3. Plot the optimal tree
# 4. Examine feature_importances_

12.9.2 Exercise 2: Random Forest vs XGBoost Tuning Challenge

Split the readmission data into 80% training and 20% test sets.
Using the training set with 5-fold cross-validation:
1. Tune a random forest (try different mtry/max_features and min_n/min_samples_leaf values).
2. Tune an XGBoost model (try different learning_rate, max_depth, and n_estimators values).
Select the best hyperparameters for each model.
Evaluate both models on the held-out test set. Which performs better?
Create variable importance plots for both models. Do they agree on the most important features?

Code

# Split data
set.seed(42)
split <- initial_split(readmit_data, prop = 0.8, strata = readmitted)
train <- training(split)
test <- testing(split)

# Your code: define models with tune(), create grids, run tune_grid()
# Finalize models, predict on test set, compare

Code

from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb

X_train, X_test, y_train, y_test = train_test_split(
    df, y, test_size=0.2, stratify=y, random_state=42
)

# Your code: define param_dist for RF and XGBoost
# Run RandomizedSearchCV for each
# Evaluate on test set with roc_auc_score

12.9.3 Exercise 3: Interpreting a Clinical Prediction Model

You have built a gradient-boosted model for 30-day hospital readmission. Your hospital administration asks you to present the results.

Write a non-technical summary (3–4 sentences) explaining what the model does and how well it performs, suitable for a hospital quality committee.
The model identifies “number of prior admissions” and “discharge creatinine” as the two most important predictors. Explain to a clinical audience what this means and does not mean (recall the caution about variable importance and causation).
How would you propose to validate this model before deploying it in clinical practice? What could go wrong if you skip external validation?

12.10 Summary

Method	How It Works	Key Advantage	Key Risk
Decision tree	Sequential yes/no splits	Highly interpretable	Overfits, unstable
Random forest	Average of many decorrelated trees	Robust, minimal tuning	Less interpretable
Gradient boosting	Sequential trees correcting errors	Often best accuracy	Can overfit, many hyperparameters

The progression from single trees to random forests to gradient boosting follows a clear logic:

Single trees are interpretable but unstable and prone to overfitting.
Random forests reduce variance by averaging many decorrelated trees, with the OOB error providing a built-in validation estimate.
Gradient boosting reduces both bias and variance through sequential learning, often achieving the best predictive accuracy at the cost of more tuning.

For clinical prediction models, always compare against a logistic regression baseline. Use cross-validation for honest evaluation, tune hyperparameters systematically, and remember that the most complex model is not always the best choice.

12.11 References and Further Reading

The foundational paper on random forests is Breiman (2001), “Random Forests,” published in Machine Learning. This paper introduced the algorithm, the concept of out-of-bag error estimation, and permutation-based variable importance. It remains one of the most cited papers in machine learning.

For XGBoost, the key reference is Chen and Guestrin (2016), “XGBoost: A Scalable Tree Boosting System,” presented at KDD 2016. This paper describes the algorithmic innovations (regularized objective, approximate split finding, sparsity-aware computation) that made XGBoost the dominant method for tabular data.

Smits et al. (2026) provide an accessible treatment of tree-based methods in the context of clinical research, including practical guidance on when ensemble methods offer meaningful advantages over logistic regression and how to interpret variable importance in clinical terms.

For a thorough treatment of the CART algorithm, see Breiman, Friedman, Olshen, and Stone’s Classification and Regression Trees (1984, Wadsworth). Although the original text is somewhat dated, the core ideas remain current.

Hastie, Tibshirani, and Friedman’s The Elements of Statistical Learning (Springer, 2nd edition, 2009) covers bagging (Section 8.7), random forests (Chapter 15), and boosting (Chapter 10) with mathematical rigor. James et al.’s An Introduction to Statistical Learning (ISLR, 2nd edition, 2021) provides a gentler introduction in Chapters 8 and covers tree-based methods with R lab exercises.

Boehmke and Greenwell’s Hands-On Machine Learning with R dedicates several chapters to tree-based methods with practical tidymodels code, and is an excellent companion for the R exercises in this chapter.

For the gradient boosting framework, Friedman (2001), “Greedy Function Approximation: A Gradient Boosting Machine,” in Annals of Statistics, is the foundational statistical treatment. The paper introduced the general gradient boosting framework that XGBoost and LightGBM build upon.

Regarding clinical applications and best practices, the TRIPOD statement (Collins et al., 2015) provides guidelines for reporting clinical prediction models, including those based on machine learning. Any readmission model (or similar clinical prediction tool) developed with these methods should be reported following TRIPOD guidelines and externally validated before clinical deployment.

# Decision Trees, Random Forests, and Gradient Boosting {#sec-trees-ensembles} ```{r} #| label: setup-trees #| include: false library(tidyverse) library(tidymodels) library(rpart) library(rpart.plot) library(ranger) library(xgboost) library(vip) library(knitr) opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE) ``` ## Decision Trees (CART) A **decision tree** is perhaps the most intuitive machine learning model. It makes predictions by asking a series of yes/no questions about the input features, splitting the data at each step into increasingly homogeneous groups. The result looks like an inverted tree — hence the name. ### How Trees Split The algorithm works top-down, greedily selecting the best split at each node: 1. Consider every feature and every possible split point. 2. For each candidate split, measure how much it improves the "purity" of the resulting child nodes. 3. Choose the split that maximizes purity improvement. 4. Repeat recursively for each child node until a stopping criterion is met (e.g., minimum node size, maximum depth, or no further improvement). ### Splitting Criteria For **classification trees**, the two most common measures of impurity are: **Gini impurity**: For a node with $K$ classes and class proportions $p_1, \ldots, p_K$: $$G = \sum_{k=1}^{K} p_k (1 - p_k) = 1 - \sum_{k=1}^{K} p_k^2$$ Gini impurity is 0 when a node is pure (all one class) and maximized when classes are equally mixed. For a binary problem, the maximum Gini is 0.5 (at $p = 0.5$). **Entropy (information gain)**: $$H = -\sum_{k=1}^{K} p_k \log_2(p_k)$$ Entropy is 0 when the node is pure and maximized at $\log_2(K)$ when classes are equally distributed. In practice, Gini and entropy produce very similar trees. Gini is the default in most implementations because it is slightly faster to compute (no logarithm). For **regression trees**, the splitting criterion is typically the **reduction in variance** (or equivalently, the reduction in residual sum of squares). ### Clinical Example: Hospital Readmission Let us build a decision tree to predict 30-day hospital readmission — a common quality metric tied to reimbursement penalties in the United States. ::: {.panel-tabset} #### R ```{r} #| label: tree-data-r # Simulate hospital readmission data set.seed(42) n <- 1000 readmit_data <- tibble( age = rnorm(n, 68, 12), length_of_stay = rpois(n, 5) + 1, num_comorbidities = rpois(n, 3), prior_admissions = rpois(n, 1), discharge_hemoglobin = rnorm(n, 11, 2), discharge_creatinine = rlnorm(n, 0.2, 0.5), has_diabetes = rbinom(n, 1, 0.35), has_chf = rbinom(n, 1, 0.25), readmitted = factor( rbinom(n, 1, plogis(-3 + 0.02 * (rnorm(n, 68, 12) - 68) + 0.15 * rpois(n, 1) + 0.1 * rpois(n, 3) + 0.3 * rbinom(n, 1, 0.25) - 0.1 * rnorm(n, 11, 2))), labels = c("No", "Yes") ) ) cat("Readmission rate:", mean(readmit_data$readmitted == "Yes"), "\n") cat("N =", nrow(readmit_data), "\n") ``` ```{r} #| label: tree-fit-r #| fig-cap: "A classification tree for 30-day readmission. Each node shows the predicted class, probability, and percentage of observations." # Fit a decision tree tree_model <- rpart( readmitted ~ age + length_of_stay + num_comorbidities + prior_admissions + discharge_hemoglobin + discharge_creatinine + has_diabetes + has_chf, data = readmit_data, method = "class", control = rpart.control(cp = 0.01, maxdepth = 5, minsplit = 20) ) # Visualize rpart.plot(tree_model, type = 4, extra = 106, under = TRUE, box.palette = "RdYlGn", roundint = FALSE, main = "Decision Tree: 30-Day Readmission") ``` #### Python ```{python} #| label: tree-fit-python import numpy as np import pandas as pd from sklearn.tree import DecisionTreeClassifier, plot_tree from sklearn.model_selection import cross_val_score, StratifiedKFold import matplotlib.pyplot as plt np.random.seed(42) n = 1000 df = pd.DataFrame({ 'age': np.random.normal(68, 12, n), 'length_of_stay': np.random.poisson(5, n) + 1, 'num_comorbidities': np.random.poisson(3, n), 'prior_admissions': np.random.poisson(1, n), 'discharge_hemoglobin': np.random.normal(11, 2, n), 'discharge_creatinine': np.random.lognormal(0.2, 0.5, n), 'has_diabetes': np.random.binomial(1, 0.35, n), 'has_chf': np.random.binomial(1, 0.25, n) }) y = np.random.binomial(1, 0.18, n) feature_names = list(df.columns) # Fit decision tree tree = DecisionTreeClassifier(max_depth=5, min_samples_split=20, min_samples_leaf=10, random_state=42) tree.fit(df, y) fig, ax = plt.subplots(figsize=(16, 8)) plot_tree(tree, feature_names=feature_names, class_names=['No', 'Yes'], filled=True, rounded=True, ax=ax, fontsize=8, max_depth=3) ax.set_title("Decision Tree: 30-Day Readmission") plt.tight_layout() plt.show() ``` ::: ### Reading a Decision Tree Each node in the tree displays: - The **splitting rule** (e.g., "age >= 72") - The **predicted class** (the majority class in that node) - The **class probabilities** - The **percentage of observations** that reach this node To classify a new patient, start at the root and follow the branches based on the patient's feature values until you reach a terminal node (leaf). The prediction is the majority class of that leaf. ### Pruning: Controlling Tree Complexity An unpruned tree will grow until every leaf is pure (or nearly so), perfectly memorizing the training data — a textbook case of overfitting. **Pruning** cuts back the tree to reduce complexity. Two strategies: 1. **Pre-pruning** (early stopping): Set constraints *before* growing the tree — maximum depth, minimum samples per leaf, minimum information gain to split. Simple but can stop too early. 2. **Post-pruning** (cost-complexity pruning): Grow a full tree, then prune branches that provide minimal improvement. The **complexity parameter (cp)** in R's `rpart` controls this: higher cp = more pruning. Select cp via cross-validation. ::: {.panel-tabset} #### R ```{r} #| label: prune-r #| fig-cap: "Cross-validation error as a function of the complexity parameter. The optimal cp minimizes the cross-validated error." # View the CP table printcp(tree_model) # Plot CV error vs cp plotcp(tree_model) # Prune to optimal cp optimal_cp <- tree_model$cptable[which.min(tree_model$cptable[, "xerror"]), "CP"] pruned_tree <- prune(tree_model, cp = optimal_cp) cat("Optimal cp:", optimal_cp, "\n") cat("Number of terminal nodes (full):", sum(tree_model$frame$var == "<leaf>"), "\n") cat("Number of terminal nodes (pruned):", sum(pruned_tree$frame$var == "<leaf>"), "\n") ``` #### Python ```{python} #| label: prune-python from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import cross_val_score import numpy as np # Compare trees of different depths depths = [2, 3, 5, 7, 10, 15, None] cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) results = [] for d in depths: t = DecisionTreeClassifier(max_depth=d, random_state=42) scores = cross_val_score(t, df, y, cv=cv, scoring='roc_auc') depth_label = str(d) if d is not None else "None (full)" results.append({'max_depth': depth_label, 'mean_auc': scores.mean(), 'std_auc': scores.std()}) print(f"max_depth={depth_label:>6s}: AUC = {scores.mean():.3f} (+/- {scores.std():.3f})") best = max(results, key=lambda x: x['mean_auc']) print(f"\nBest: max_depth={best['max_depth']} with AUC={best['mean_auc']:.3f}") ``` ::: ## Why Single Trees Are Unstable Decision trees have a critical weakness: **high variance**. Small changes in the training data can produce a completely different tree. This instability means that: - Predictions are unreliable for individual trees. - Results may not replicate across datasets. - Performance on new data is often poor. The solution is to combine many trees into an **ensemble**. Two main strategies exist: **bagging** (reduce variance by averaging) and **boosting** (reduce bias by learning sequentially). ## Bagging: Bootstrap Aggregating **Bagging** (Bootstrap AGGregatING), introduced by Leo Breiman in 1996, reduces variance by: 1. Drawing $B$ bootstrap samples (random samples with replacement) from the training data. 2. Fitting a separate decision tree to each bootstrap sample. 3. Averaging predictions (regression) or taking a majority vote (classification) across all $B$ trees. By averaging many high-variance, low-bias trees, bagging dramatically reduces the overall variance without increasing bias. The key insight: the variance of an average of $B$ identically distributed random variables decreases with $B$ (provided they are not perfectly correlated). ## Random Forests **Random forests** extend bagging with one additional trick: at each split, instead of considering all features, the algorithm randomly selects a **subset of features** (typically $\sqrt{p}$ for classification or $p/3$ for regression, where $p$ is the total number of features). This feature randomization **decorrelates the trees**. In bagging without feature randomization, if one feature is very strong, most trees will split on it first, making the trees correlated. Random forests force trees to consider different features, producing more diverse trees whose average is more stable. ### Key Hyperparameters | Parameter | What It Controls | Typical Default | |-----------|-----------------|----------------| | `num.trees` / `n_estimators` | Number of trees in the forest | 500–2000 | | `mtry` / `max_features` | Features considered at each split | $\sqrt{p}$ (classification), $p/3$ (regression) | | `min.node.size` / `min_samples_leaf` | Minimum observations in a leaf | 1 (classification), 5 (regression) | | `max.depth` | Maximum tree depth | Unlimited (let trees grow fully) | ### Out-of-Bag (OOB) Error Each bootstrap sample includes about 63% of the original observations (due to sampling with replacement). The remaining 37% — the **out-of-bag** observations — were not used to build that particular tree. Random forests use these OOB observations as a built-in validation set: each observation is predicted by the trees that did *not* include it in their bootstrap sample. The OOB error is a nearly unbiased estimate of the test error — no separate validation set required. This is one of the most elegant features of random forests. ### Fitting a Random Forest ::: {.panel-tabset} #### R ```{r} #| label: rf-r library(ranger) # Fit random forest set.seed(42) rf_model <- ranger( readmitted ~ age + length_of_stay + num_comorbidities + prior_admissions + discharge_hemoglobin + discharge_creatinine + has_diabetes + has_chf, data = readmit_data, num.trees = 500, mtry = 3, min.node.size = 10, importance = "impurity", probability = TRUE, # predict probabilities seed = 42 ) cat("OOB prediction error:", rf_model$prediction.error, "\n") ``` ```{r} #| label: rf-importance-r #| fig-cap: "Variable importance from the random forest model, measured by Gini impurity reduction." # Variable importance importance_df <- tibble( Variable = names(rf_model$variable.importance), Importance = rf_model$variable.importance ) %>% arrange(desc(Importance)) ggplot(importance_df, aes(x = reorder(Variable, Importance), y = Importance)) + geom_col(fill = "#2E86AB") + coord_flip() + labs(x = NULL, y = "Importance (Gini impurity reduction)", title = "Random Forest Variable Importance") + theme_minimal(base_size = 14) ``` #### Python ```{python} #| label: rf-python from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import cross_val_score, StratifiedKFold import numpy as np import pandas as pd import matplotlib.pyplot as plt np.random.seed(42) rf = RandomForestClassifier( n_estimators=500, max_features='sqrt', min_samples_leaf=10, oob_score=True, random_state=42, n_jobs=-1 ) rf.fit(df, y) print(f"OOB accuracy: {rf.oob_score_:.3f}") # Cross-validated AUC cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) auc_scores = cross_val_score(rf, df, y, cv=cv, scoring='roc_auc') print(f"CV AUC: {auc_scores.mean():.3f} (+/- {auc_scores.std():.3f})") ``` ```{python} #| label: rf-importance-python #| fig-cap: "Variable importance from the random forest, measured by mean decrease in Gini impurity." importances = pd.Series(rf.feature_importances_, index=feature_names) importances = importances.sort_values(ascending=True) fig, ax = plt.subplots(figsize=(8, 5)) importances.plot(kind='barh', ax=ax, color='#2E86AB') ax.set_xlabel("Importance (Gini impurity reduction)") ax.set_title("Random Forest Variable Importance") plt.tight_layout() plt.show() ``` ::: ### Variable Importance: Two Measures 1. **Impurity-based importance** (default): Total reduction in Gini impurity (or variance) from splits on that feature, summed across all trees. Fast but can be biased toward high-cardinality features. 2. **Permutation importance**: For each feature, randomly shuffle its values and measure how much the OOB (or test) error increases. More reliable but slower. This is the recommended approach for inference. ::: {.callout-warning} ## Variable Importance Is Not Causal A feature with high importance in a random forest is *predictive* of the outcome, not necessarily *causally* related to it. A correlated proxy (e.g., number of medications as a proxy for disease severity) may appear important even though intervening on it would have no effect. Do not confuse prediction with causation. ::: ## Gradient Boosting While bagging reduces variance by averaging independent trees, **gradient boosting** reduces bias by building trees *sequentially*, where each new tree corrects the mistakes of the previous ensemble. ### The Intuition: Learning from Mistakes 1. Fit a simple tree to the data. 2. Compute the **residuals** (errors) from this tree. 3. Fit a new tree to predict these residuals. 4. Add this new tree to the ensemble (scaled by a learning rate). 5. Repeat steps 2–4 for $B$ iterations. Each new tree focuses on the observations the current ensemble gets wrong. Over many iterations, the ensemble becomes increasingly accurate. ### The Mathematics (Brief) Gradient boosting minimizes a loss function $L(y, \hat{y})$ by gradient descent in function space. At iteration $m$: $$\hat{f}_m(x) = \hat{f}_{m-1}(x) + \eta \cdot h_m(x)$$ where $h_m(x)$ is a tree fitted to the negative gradient of the loss (the "pseudo-residuals") and $\eta$ is the **learning rate** (also called the shrinkage parameter). ### Key Hyperparameters | Parameter | What It Controls | Typical Range | |-----------|-----------------|---------------| | `n_estimators` / `nrounds` | Number of boosting iterations | 100–5000 | | `learning_rate` / `eta` | Step size for each tree | 0.01–0.3 | | `max_depth` | Depth of each tree | 3–8 (shallow trees!) | | `subsample` | Fraction of data used per tree | 0.5–1.0 | | `colsample_bytree` | Fraction of features per tree | 0.5–1.0 | | `min_child_weight` | Minimum sum of instance weights in a leaf | 1–10 | | `reg_alpha` (L1) / `reg_lambda` (L2) | Regularization | 0–10 | ::: {.callout-tip} ## The Learning Rate / Number of Trees Tradeoff A smaller learning rate requires more trees but generally produces better results. A common strategy: set the learning rate to 0.01–0.05, then tune the number of trees via early stopping (stop when validation error stops improving). ::: ### XGBoost and LightGBM **XGBoost** (eXtreme Gradient Boosting) by Chen and Guestrin (2016) added regularization, efficient computation, and handling of missing values to gradient boosting, making it the dominant algorithm for tabular data competitions and many clinical prediction tasks. **LightGBM** by Microsoft further improves efficiency with gradient-based one-side sampling (GOSS) and exclusive feature bundling. It is often faster than XGBoost on large datasets. ::: {.panel-tabset} #### R ```{r} #| label: xgb-r library(xgboost) # Prepare data for xgboost (requires numeric matrix) X_train <- readmit_data %>% select(age, length_of_stay, num_comorbidities, prior_admissions, discharge_hemoglobin, discharge_creatinine, has_diabetes, has_chf) %>% as.matrix() y_train <- as.numeric(readmit_data$readmitted == "Yes") dtrain <- xgb.DMatrix(data = X_train, label = y_train) # Fit XGBoost with cross-validation to find optimal nrounds set.seed(42) xgb_cv <- xgb.cv( params = list( objective = "binary:logistic", eval_metric = "auc", eta = 0.05, max_depth = 4, subsample = 0.8, colsample_bytree = 0.8 ), data = dtrain, nrounds = 1000, nfold = 5, early_stopping_rounds = 50, verbose = 0 ) cat("Best iteration:", xgb_cv$best_iteration, "\n") cat("Best CV AUC:", max(xgb_cv$evaluation_log$test_auc_mean), "\n") ``` ```{r} #| label: xgb-fit-r # Fit final model with optimal nrounds xgb_model <- xgb.train( params = list( objective = "binary:logistic", eval_metric = "auc", eta = 0.05, max_depth = 4, subsample = 0.8, colsample_bytree = 0.8 ), data = dtrain, nrounds = xgb_cv$best_iteration, verbose = 0 ) # Variable importance importance_matrix <- xgb.importance(model = xgb_model) xgb.plot.importance(importance_matrix, top_n = 8, main = "XGBoost Variable Importance") ``` #### Python ```{python} #| label: xgb-python import xgboost as xgb from sklearn.model_selection import cross_val_score, StratifiedKFold import numpy as np np.random.seed(42) xgb_model = xgb.XGBClassifier( n_estimators=1000, learning_rate=0.05, max_depth=4, subsample=0.8, colsample_bytree=0.8, eval_metric='auc', early_stopping_rounds=50, random_state=42, use_label_encoder=False ) # Cross-validation cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) scores = cross_val_score( xgb.XGBClassifier( n_estimators=500, learning_rate=0.05, max_depth=4, subsample=0.8, colsample_bytree=0.8, random_state=42, use_label_encoder=False, eval_metric='logloss' ), df, y, cv=cv, scoring='roc_auc' ) print(f"XGBoost CV AUC: {scores.mean():.3f} (+/- {scores.std():.3f})") ``` ```{python} #| label: xgb-importance-python #| fig-cap: "XGBoost variable importance based on the number of times each feature is used in splits (weight)." # Fit and plot importance xgb_final = xgb.XGBClassifier( n_estimators=500, learning_rate=0.05, max_depth=4, subsample=0.8, colsample_bytree=0.8, random_state=42, use_label_encoder=False, eval_metric='logloss' ) xgb_final.fit(df, y) fig, ax = plt.subplots(figsize=(8, 5)) xgb.plot_importance(xgb_final, ax=ax, importance_type='gain', title='XGBoost Feature Importance (Gain)') plt.tight_layout() plt.show() ``` ::: ## Hyperparameter Tuning All ensemble methods have hyperparameters that must be tuned. Two common strategies: ### Grid Search Exhaustively try every combination of specified parameter values. Reliable but computationally expensive — the number of combinations grows multiplicatively. ### Random Search Randomly sample hyperparameter combinations. Bergstra and Bengio (2012) showed that random search is more efficient than grid search because not all hyperparameters are equally important. Random search explores more values of the important parameters for the same computational budget. ::: {.panel-tabset} #### R ```{r} #| label: tune-r library(tidymodels) # Define the model with tunable parameters rf_spec <- rand_forest( trees = 500, mtry = tune(), min_n = tune() ) %>% set_engine("ranger") %>% set_mode("classification") # Create recipe rf_recipe <- recipe(readmitted ~ ., data = readmit_data) # Workflow rf_wf <- workflow() %>% add_model(rf_spec) %>% add_recipe(rf_recipe) # Define tuning grid rf_grid <- grid_random( mtry(range = c(2, 7)), min_n(range = c(5, 30)), size = 20 ) # Cross-validation set.seed(42) folds <- vfold_cv(readmit_data, v = 5, strata = readmitted) # Tune (this may take a moment) rf_tune_results <- tune_grid( rf_wf, resamples = folds, grid = rf_grid, metrics = metric_set(roc_auc) ) # Best parameters show_best(rf_tune_results, metric = "roc_auc", n = 5) ``` ```{r} #| label: tune-finalize-r # Select best and finalize best_params <- select_best(rf_tune_results, metric = "roc_auc") cat("Best mtry:", best_params$mtry, "\n") cat("Best min_n:", best_params$min_n, "\n") final_wf <- finalize_workflow(rf_wf, best_params) final_fit <- fit(final_wf, data = readmit_data) ``` #### Python ```{python} #| label: tune-python from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold import numpy as np np.random.seed(42) # Define parameter distributions param_dist = { 'n_estimators': [100, 200, 500], 'max_features': ['sqrt', 'log2', 3, 5], 'min_samples_leaf': [5, 10, 15, 20, 30], 'max_depth': [5, 10, 15, 20, None] } rf = RandomForestClassifier(random_state=42, n_jobs=-1) cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) # Random search search = RandomizedSearchCV( rf, param_dist, n_iter=20, cv=cv, scoring='roc_auc', random_state=42, n_jobs=-1, verbose=0 ) search.fit(df, y) print(f"Best AUC: {search.best_score_:.3f}") print(f"Best parameters: {search.best_params_}") ``` ::: ## Comparing Methods: When to Use What | Method | Strengths | Weaknesses | Best For | |--------|-----------|------------|----------| | **Single tree** | Interpretable, handles non-linearity | Unstable, overfits | Exploration, simple rules | | **Random forest** | Robust, low tuning needed, OOB error | Less interpretable than single tree | General-purpose prediction | | **Gradient boosting** | Often highest accuracy, flexible | More tuning, can overfit, slower | Maximizing prediction accuracy | | **Logistic regression** | Interpretable, well-understood inference | Assumes linearity (without transforms) | Inference, small samples, reporting ORs | ::: {.callout-note} ## Practical Advice for Clinical Prediction Models Start with logistic regression as your baseline. If you need better prediction and have enough data (typically hundreds of events), try a random forest. If you want to squeeze out the last bit of performance and are willing to tune carefully, use XGBoost. Always compare against the logistic regression baseline — if the improvement is marginal, prefer the simpler model. ::: ## Clinical Example: Readmission Risk Stratification Let us bring everything together by comparing all three methods on the hospital readmission dataset. ::: {.panel-tabset} #### R ```{r} #| label: comparison-r library(tidymodels) set.seed(42) folds <- vfold_cv(readmit_data, v = 10, strata = readmitted) # Logistic regression lr_spec <- logistic_reg() %>% set_engine("glm") lr_wf <- workflow() %>% add_model(lr_spec) %>% add_recipe(recipe(readmitted ~ ., data = readmit_data)) lr_res <- fit_resamples(lr_wf, resamples = folds, metrics = metric_set(roc_auc)) # Random forest rf_spec <- rand_forest(trees = 500, mtry = 3, min_n = 10) %>% set_engine("ranger") %>% set_mode("classification") rf_wf <- workflow() %>% add_model(rf_spec) %>% add_recipe(recipe(readmitted ~ ., data = readmit_data)) rf_res <- fit_resamples(rf_wf, resamples = folds, metrics = metric_set(roc_auc)) # Boosted trees (via xgboost engine in tidymodels) xgb_spec <- boost_tree(trees = 500, tree_depth = 4, learn_rate = 0.05, min_n = 10) %>% set_engine("xgboost") %>% set_mode("classification") xgb_wf <- workflow() %>% add_model(xgb_spec) %>% add_recipe(recipe(readmitted ~ ., data = readmit_data)) xgb_res <- fit_resamples(xgb_wf, resamples = folds, metrics = metric_set(roc_auc)) # Collect and compare results bind_rows( collect_metrics(lr_res) %>% mutate(model = "Logistic Regression"), collect_metrics(rf_res) %>% mutate(model = "Random Forest"), collect_metrics(xgb_res) %>% mutate(model = "XGBoost") ) %>% select(model, .metric, mean, std_err) %>% arrange(desc(mean)) ``` #### Python ```{python} #| label: comparison-python from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier from sklearn.model_selection import cross_val_score, StratifiedKFold from sklearn.preprocessing import StandardScaler from sklearn.pipeline import make_pipeline import numpy as np import pandas as pd np.random.seed(42) cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42) models = { 'Logistic Regression': make_pipeline( StandardScaler(), LogisticRegression(max_iter=1000) ), 'Random Forest': RandomForestClassifier( n_estimators=500, max_features='sqrt', min_samples_leaf=10, random_state=42, n_jobs=-1 ), 'Gradient Boosting': GradientBoostingClassifier( n_estimators=500, learning_rate=0.05, max_depth=4, subsample=0.8, random_state=42 ) } results = [] for name, model in models.items(): scores = cross_val_score(model, df, y, cv=cv, scoring='roc_auc') results.append({ 'Model': name, 'Mean AUC': f"{scores.mean():.3f}", 'Std': f"{scores.std():.3f}" }) print(f"{name:25s}: AUC = {scores.mean():.3f} (+/- {scores.std():.3f})") print(pd.DataFrame(results).to_string(index=False)) ``` ::: ## Exercises ### Exercise 1: Build and Prune a Classification Tree Using the readmission dataset (or a dataset of your choice): 1. Fit a full, unpruned classification tree. How many terminal nodes does it have? 2. Use cross-validation to find the optimal complexity parameter (cp in R) or maximum depth (in Python). 3. Prune the tree and plot it. How does it compare to the full tree? 4. What are the top 3 splitting variables? Do they make clinical sense? ::: {.panel-tabset} #### R Starter Code ```{r} #| label: ex1-tree-r #| eval: false # Fit a full tree (set cp very low to avoid early pruning) full_tree <- rpart(readmitted ~ ., data = readmit_data, method = "class", control = rpart.control(cp = 0.001)) # Your code: # 1. Count terminal nodes # 2. Use printcp() and plotcp() to find optimal cp # 3. Prune and plot # 4. Examine variable importance: full_tree$variable.importance ``` #### Python Starter Code ```{python} #| label: ex1-tree-python #| eval: false from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import cross_val_score # Your code: # 1. Fit trees with different max_depth values # 2. Cross-validate each # 3. Plot the optimal tree # 4. Examine feature_importances_ ``` ::: ### Exercise 2: Random Forest vs XGBoost Tuning Challenge 1. Split the readmission data into 80% training and 20% test sets. 2. Using the training set with 5-fold cross-validation: a. Tune a random forest (try different `mtry`/`max_features` and `min_n`/`min_samples_leaf` values). b. Tune an XGBoost model (try different `learning_rate`, `max_depth`, and `n_estimators` values). 3. Select the best hyperparameters for each model. 4. Evaluate both models on the held-out test set. Which performs better? 5. Create variable importance plots for both models. Do they agree on the most important features? ::: {.panel-tabset} #### R Starter Code ```{r} #| label: ex2-tune-r #| eval: false # Split data set.seed(42) split <- initial_split(readmit_data, prop = 0.8, strata = readmitted) train <- training(split) test <- testing(split) # Your code: define models with tune(), create grids, run tune_grid() # Finalize models, predict on test set, compare ``` #### Python Starter Code ```{python} #| label: ex2-tune-python #| eval: false from sklearn.model_selection import train_test_split, RandomizedSearchCV from sklearn.ensemble import RandomForestClassifier import xgboost as xgb X_train, X_test, y_train, y_test = train_test_split( df, y, test_size=0.2, stratify=y, random_state=42 ) # Your code: define param_dist for RF and XGBoost # Run RandomizedSearchCV for each # Evaluate on test set with roc_auc_score ``` ::: ### Exercise 3: Interpreting a Clinical Prediction Model You have built a gradient-boosted model for 30-day hospital readmission. Your hospital administration asks you to present the results. 1. Write a non-technical summary (3–4 sentences) explaining what the model does and how well it performs, suitable for a hospital quality committee. 2. The model identifies "number of prior admissions" and "discharge creatinine" as the two most important predictors. Explain to a clinical audience what this means and *does not* mean (recall the caution about variable importance and causation). 3. How would you propose to validate this model before deploying it in clinical practice? What could go wrong if you skip external validation? ## Summary | Method | How It Works | Key Advantage | Key Risk | |--------|-------------|---------------|----------| | Decision tree | Sequential yes/no splits | Highly interpretable | Overfits, unstable | | Random forest | Average of many decorrelated trees | Robust, minimal tuning | Less interpretable | | Gradient boosting | Sequential trees correcting errors | Often best accuracy | Can overfit, many hyperparameters | The progression from single trees to random forests to gradient boosting follows a clear logic: - **Single trees** are interpretable but unstable and prone to overfitting. - **Random forests** reduce variance by averaging many decorrelated trees, with the OOB error providing a built-in validation estimate. - **Gradient boosting** reduces both bias and variance through sequential learning, often achieving the best predictive accuracy at the cost of more tuning. For clinical prediction models, always compare against a logistic regression baseline. Use cross-validation for honest evaluation, tune hyperparameters systematically, and remember that the most complex model is not always the best choice. ## References and Further Reading The foundational paper on random forests is Breiman (2001), "Random Forests," published in *Machine Learning*. This paper introduced the algorithm, the concept of out-of-bag error estimation, and permutation-based variable importance. It remains one of the most cited papers in machine learning. For XGBoost, the key reference is Chen and Guestrin (2016), "XGBoost: A Scalable Tree Boosting System," presented at KDD 2016. This paper describes the algorithmic innovations (regularized objective, approximate split finding, sparsity-aware computation) that made XGBoost the dominant method for tabular data. Smits et al. (2026) provide an accessible treatment of tree-based methods in the context of clinical research, including practical guidance on when ensemble methods offer meaningful advantages over logistic regression and how to interpret variable importance in clinical terms. For a thorough treatment of the CART algorithm, see Breiman, Friedman, Olshen, and Stone's *Classification and Regression Trees* (1984, Wadsworth). Although the original text is somewhat dated, the core ideas remain current. Hastie, Tibshirani, and Friedman's *The Elements of Statistical Learning* (Springer, 2nd edition, 2009) covers bagging (Section 8.7), random forests (Chapter 15), and boosting (Chapter 10) with mathematical rigor. James et al.'s *An Introduction to Statistical Learning* (ISLR, 2nd edition, 2021) provides a gentler introduction in Chapters 8 and covers tree-based methods with R lab exercises. Boehmke and Greenwell's *Hands-On Machine Learning with R* dedicates several chapters to tree-based methods with practical tidymodels code, and is an excellent companion for the R exercises in this chapter. For the gradient boosting framework, Friedman (2001), "Greedy Function Approximation: A Gradient Boosting Machine," in *Annals of Statistics*, is the foundational statistical treatment. The paper introduced the general gradient boosting framework that XGBoost and LightGBM build upon. Regarding clinical applications and best practices, the TRIPOD statement (Collins et al., 2015) provides guidelines for reporting clinical prediction models, including those based on machine learning. Any readmission model (or similar clinical prediction tool) developed with these methods should be reported following TRIPOD guidelines and externally validated before clinical deployment.