6  Modelling Non-Linear Relationships: Splines, Fractional Polynomials, and Why Categorisation Fails

NoteLearning Objectives

By the end of this chapter, you will be able to:

  1. Explain why categorising continuous variables leads to information loss and biased estimates
  2. Recognise when a linear term is insufficient for modelling a continuous predictor
  3. Describe fractional polynomials and their role in flexible modelling
  4. Understand how splines work, including the concepts of knots, cubic splines, and restricted cubic splines
  5. Implement and interpret restricted cubic spline models in both R and Python
  6. Compare linear, categorical, and spline models using appropriate fit statistics
  7. Apply the practical recommendations from Lopez-Ayala et al. (2025) in your own analyses
NoteWhat is different about this chapter

This chapter introduces tools you probably haven’t seen before. The pace is faster than the previous chapters, and the concepts (splines, knot placement) may feel unfamiliar. Budget extra time here.

Why it matters for your research: If you build a prediction model or run a regression without properly handling continuous variables, your results will be less accurate — or wrong. Systematic reviews show that most published clinical prediction models categorise continuous variables inappropriately. This chapter teaches you the right way.

6.1 Introduction

Imagine you are studying the relationship between body mass index (BMI) and mortality. You have data on thousands of patients, and you need to build a regression model. What do you do with BMI?

If you are like most researchers in clinical medicine, you probably split BMI into categories: underweight (<18.5), normal (18.5–24.9), overweight (25–29.9), and obese (30+). You then include these categories as dummy variables in your regression model. This approach feels natural — it mirrors how we think clinically, and the results are easy to present in a table.

This is almost always the wrong approach.

Categorising continuous variables is one of the most common and most damaging analytic practices in clinical research. It throws away information, introduces bias, reduces statistical power, and can lead to completely wrong conclusions about the shape of relationships between predictors and outcomes.

This chapter draws heavily on a landmark tutorial by Lopez-Ayala et al. (2025), published in the BMJ in 2025, which provides a comprehensive guide to modelling continuous variables correctly. We will follow their framework closely, supplementing it with additional clinical examples and implementation guidance.

6.2 The Problem with Categorising Continuous Variables

6.2.1 What Happens When You Categorise

When you convert a continuous variable into categories, you make three implicit assumptions, all of which are almost certainly wrong:

  1. Within each category, the relationship is flat. A patient with a BMI of 18.6 is treated identically to a patient with a BMI of 24.8 — despite a difference of over 6 kg/m2.

  2. At the category boundary, there is a sudden jump. A patient with a BMI of 24.9 is treated as fundamentally different from a patient with a BMI of 25.0 — despite a difference of only 0.1 kg/m2.

  3. The cut-points are meaningful. The chosen boundaries (often based on clinical convention or percentiles) capture the true biology of the relationship.

None of these assumptions hold for most biological relationships. Biology is smooth, not stepped.

6.2.2 The Consequences Are Serious

Lopez-Ayala et al. (2025) enumerate several consequences of categorisation:

  • Loss of statistical power. Categorising a continuous variable into two groups is equivalent to discarding roughly one-third of your data (Royston et al. 2006). Even with more categories, substantial information is lost.

  • Biased effect estimates. Because the relationship within each category is forced to be flat, the estimated effects are systematically wrong. The direction and magnitude of the bias depend on where you place the cut-points.

  • Residual confounding. When a categorised variable is used as a confounder in an adjusted model, the adjustment is incomplete. The true confounding effect varies smoothly, but your model only adjusts in crude steps. This is sometimes called residual confounding due to categorisation.

  • Arbitrary cut-points change conclusions. Different choices of cut-points can lead to different — even contradictory — conclusions from the same data. This is a form of researcher degrees of freedom that inflates false positive rates.

  • Loss of predictive accuracy. Prediction models that categorise continuous predictors perform worse than those that model them appropriately.

WarningA Common But Flawed Justification

Researchers often justify categorisation by saying “it’s easier to interpret” or “clinicians think in categories.” While clinical decision-making does involve thresholds (e.g., treat if blood pressure > 140 mmHg), the modelling of relationships should reflect reality, not convenience. You can always convert a continuous model’s output into clinically actionable categories after modelling — but you cannot recover the information lost by categorising before modelling.

6.2.3 A Visual Demonstration

Consider the true relationship between a biomarker (say, C-reactive protein) and the log-odds of a diagnosis. Suppose the true relationship is U-shaped: both very low and very high values are associated with increased risk.

If you categorise CRP into tertiles (low, medium, high), you will estimate a monotonically increasing relationship — completely missing the U-shape. The low-risk group is contaminated by some high-risk individuals at the low end, and the comparison between groups obscures the smooth curve.

Code
library(ggplot2)

set.seed(42)
x <- seq(0, 10, length.out = 300)
# True U-shaped relationship
y_true <- 0.3 * (x - 5)^2 - 2

df <- data.frame(x = x, y_true = y_true)

# Categorise into tertiles
df$category <- cut(df$x, breaks = quantile(df$x, probs = c(0, 1/3, 2/3, 1)),
                   include.lowest = TRUE, labels = c("Low", "Medium", "High"))
cat_means <- aggregate(y_true ~ category, data = df, FUN = mean)

df$y_cat <- cat_means$y_true[match(df$category, cat_means$category)]

ggplot(df, aes(x = x)) +
  geom_line(aes(y = y_true), linewidth = 1.2, colour = "#2c3e50") +
  geom_step(aes(y = y_cat), linewidth = 1, colour = "#e74c3c", linetype = "dashed") +
  labs(x = "Biomarker Level", y = "Log-Odds of Outcome",
       subtitle = "Black curve = true relationship; Red steps = categorised estimate") +
  theme_minimal(base_size = 14) +
  theme(plot.subtitle = element_text(size = 11))
Figure 6.1: True non-linear relationship (solid curve) versus what categorisation into tertiles recovers (stepped line). Categorisation misses the U-shape entirely.

6.3 When Linear Terms Are Not Enough

6.3.1 The Linearity Assumption

In standard logistic or linear regression, including a continuous predictor \(x\) as a simple linear term assumes:

\[\text{logit}(p) = \beta_0 + \beta_1 x\]

This means that each one-unit increase in \(x\) is associated with the same change in the log-odds, regardless of where on the \(x\) scale you start. For BMI, this would mean that going from 20 to 21 has the same effect as going from 35 to 36.

For many clinical variables, this assumption is wrong:

  • BMI and mortality: U-shaped or J-shaped. Both very low and very high BMI are associated with increased mortality.
  • Alcohol consumption and cardiovascular risk: J-shaped. Moderate consumption may be associated with lower risk than either abstinence or heavy drinking (though this remains debated).
  • Age and many outcomes: Often non-linear, with steeper increases at older ages.
  • Blood pressure and stroke risk: The relationship steepens at higher pressures.
  • CSF glucose and bacterial meningitis: A threshold effect where low values are strongly predictive but the relationship flattens at higher values (we will explore this in detail below).

6.3.2 Detecting Non-Linearity

Before choosing a modelling strategy, it helps to assess whether non-linearity is present:

  1. Visual inspection. Plot the predictor against the outcome using a loess (locally estimated scatterplot smoothing) curve or a generalised additive model (GAM) smooth. If the relationship is not a straight line, you need a non-linear term.

  2. Likelihood ratio test. Fit a model with a linear term and a model with a non-linear term (e.g., a spline). Compare them using a likelihood ratio test. A significant result indicates that the non-linear model fits substantially better.

  3. AIC/BIC comparison. The Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) are single-number summaries that balance how well a model fits against how many parameters it uses. Each starts from the model’s log-likelihood (a measure of fit) and then adds a penalty for every extra parameter, so a more complex model is only rewarded if it improves the fit by more than its added complexity “costs.” This guards against the temptation to keep adding flexibility just to chase the data. Lower values indicate a better fit-versus-complexity balance, so you compare the linear and non-linear models and prefer the one with the smaller AIC (or BIC). BIC penalises extra parameters more heavily than AIC, so it tends to favour simpler models, especially in large samples. The values are only meaningful relative to each other on the same dataset — an AIC of 480 is not “good” or “bad” in isolation, it is only lower or higher than a competing model’s.

  4. Residual plots. Plot residuals against the predictor. A systematic pattern (e.g., a curve) suggests non-linearity.

TipPractical Rule

As Lopez-Ayala et al. (2025) recommend: when in doubt, use a non-linear term. The cost of modelling a truly linear relationship with a spline is minimal (you lose very little power), but the cost of forcing a non-linear relationship into a linear term can be severe (biased estimates, wrong conclusions).

6.4 Fractional Polynomials

6.4.1 The Idea

Fractional polynomials (FPs), developed by Royston and Altman (1994), extend conventional polynomials by allowing fractional and negative powers. Instead of being restricted to \(x, x^2, x^3, \ldots\), fractional polynomials choose from the set of powers:

\[\{-2, -1, -0.5, 0, 0.5, 1, 2, 3\}\]

where \(x^0\) is defined as \(\log(x)\).

A first-degree fractional polynomial (FP1) has the form:

\[\beta_0 + \beta_1 x^{p_1}\]

A second-degree fractional polynomial (FP2) has the form:

\[\beta_0 + \beta_1 x^{p_1} + \beta_2 x^{p_2}\]

where \(p_1\) and \(p_2\) are selected from the allowed power set. If \(p_1 = p_2\), the second term becomes \(x^{p_1} \log(x)\) (the so-called “repeated powers” convention).

6.4.2 How FPs Work

The algorithm searches through all combinations of powers from the allowed set and selects the combination that best fits the data (by maximum likelihood or AIC). For an FP2, there are 36 possible combinations of powers, making the search computationally straightforward. Where does 36 come from? The allowed power set has 8 values. An FP2 chooses two powers, and the order does not matter, so there are \(\binom{8}{2} = 28\) pairs of distinct powers, plus 8 cases where the two powers are equal (the “repeated powers” form \(x^{p}\log x\) described above) — giving \(28 + 8 = 36\) candidate models in total. The algorithm simply fits all 36 and keeps the best.

The key advantage of fractional polynomials is that they can capture a wide range of curve shapes — monotone increasing, monotone decreasing, U-shaped, J-shaped, S-shaped — with only one or two parameters beyond the linear term. (A relationship is monotone if it moves in a single direction across the whole range of the predictor: monotone increasing means the outcome only ever rises as the predictor rises, monotone decreasing means it only ever falls — in neither case does the curve reverse direction, even if its steepness changes.)

6.4.3 Strengths and Limitations

Strengths:

  • Parsimonious: only 1–2 additional parameters
  • Can capture many common curve shapes
  • Well-established methodology with formal model selection procedures (the FP selection algorithm, also called MFP for multivariable fractional polynomials)
  • Particularly useful when sample size is modest

Limitations:

  • Behaviour at the extremes of the predictor range can be erratic (the curves are governed by global functions)
  • Cannot capture complex local features (e.g., a bump in the middle of the range)
  • The set of allowed shapes, while broad, is still constrained

Lopez-Ayala et al. (2025) note that fractional polynomials and splines are complementary approaches, and in many cases they give similar results. However, for most applications in clinical research, restricted cubic splines have become the recommended default due to their greater flexibility and better behaviour at the boundaries.

6.6 Clinical Example: CSF Glucose and Bacterial Meningitis

This example is adapted from Lopez-Ayala et al. (2025), who use it to illustrate the full workflow for modelling a continuous predictor.

6.6.1 The Clinical Question

In patients presenting with suspected meningitis, cerebrospinal fluid (CSF) glucose is one of several laboratory values used to distinguish bacterial from viral meningitis. Low CSF glucose is a hallmark of bacterial infection, as bacteria consume glucose. But what is the precise shape of this relationship? And is it appropriate to use a simple cut-point (e.g., CSF glucose < 2.2 mmol/L)?

6.6.2 Step-by-Step Analysis

Step 1: Visualise the raw relationship.

Before fitting any model, plot the outcome against the predictor using a smoothing function (loess or GAM). This gives you a preliminary sense of the shape.

Code
library(ggplot2)
library(rms)

set.seed(2025)
n <- 400
csf_glucose <- rlnorm(n, meanlog = log(3), sdlog = 0.6)
csf_glucose <- pmin(csf_glucose, 10)

# True relationship: steep drop in probability below ~2, then flat
logit_p <- 3 - 4 * pmax(0, 2.5 - csf_glucose) - 0.3 * csf_glucose
p_bact <- plogis(logit_p)
bacterial <- rbinom(n, 1, p_bact)

df_csf <- data.frame(csf_glucose = csf_glucose, bacterial = bacterial)

ggplot(df_csf, aes(x = csf_glucose, y = bacterial)) +
  geom_jitter(height = 0.05, alpha = 0.3, width = 0.05) +
  geom_smooth(method = "loess", se = TRUE, colour = "#2c3e50") +
  labs(x = "CSF Glucose (mmol/L)", y = "Probability of Bacterial Meningitis",
       subtitle = "Loess smooth suggests a non-linear (threshold-like) relationship") +
  theme_minimal(base_size = 14)
Figure 6.3: Simulated relationship between CSF glucose and probability of bacterial meningitis, illustrating the non-linear (threshold-like) pattern.

Step 2: Fit competing models.

We fit three models and compare them:

  • Model 1 (Linear): bacterial ~ csf_glucose
  • Model 2 (Categorised): bacterial ~ I(csf_glucose < 2.2) (a common clinical cut-point)
  • Model 3 (RCS): bacterial ~ rcs(csf_glucose, 4)
Code
library(rms)

dd_csf <- datadist(df_csf)
options(datadist = "dd_csf")

# Model 1: Linear
fit_linear <- lrm(bacterial ~ csf_glucose, data = df_csf)

# Model 2: Categorised
df_csf$glucose_low <- as.numeric(df_csf$csf_glucose < 2.2)
fit_cat <- lrm(bacterial ~ glucose_low, data = df_csf)

# Model 3: RCS with 4 knots
fit_rcs <- lrm(bacterial ~ rcs(csf_glucose, 4), data = df_csf)

# Compare AIC
aic_vals <- c(
  Linear = AIC(fit_linear),
  Categorised = AIC(fit_cat),
  RCS_4knots = AIC(fit_rcs)
)

cat("AIC Comparison:\n")
AIC Comparison:
Code
print(round(aic_vals, 1))
     Linear Categorised  RCS_4knots 
      456.3       366.7       339.0 
Code
cat("\nLower AIC = better fit (penalised for complexity)\n")

Lower AIC = better fit (penalised for complexity)

Step 3: Test for non-linearity.

Code
# The anova() function in rms tests the overall effect and non-linearity separately
cat("ANOVA for RCS model:\n")
ANOVA for RCS model:
Code
print(anova(fit_rcs))
                Wald Statistics          Response: bacterial 

 Factor      Chi-Square d.f. P     
 csf_glucose 91.23      3    <.0001
  Nonlinear  83.01      2    <.0001
 TOTAL       91.23      3    <.0001

The ANOVA output from rms provides two key tests:

  • Total effect (all spline terms combined): tests whether CSF glucose has any association with the outcome
  • Nonlinear component (spline terms beyond the linear): tests whether the relationship departs from linearity

If the non-linear component is significant, this confirms that a linear term is insufficient.

Step 4: Plot the fitted curves from all three models.

Code
x_seq <- seq(min(df_csf$csf_glucose), max(df_csf$csf_glucose), length.out = 200)

# Predictions from each model
pred_linear <- predict(fit_linear, newdata = data.frame(csf_glucose = x_seq),
                       type = "fitted")
pred_cat <- predict(fit_cat,
                    newdata = data.frame(glucose_low = as.numeric(x_seq < 2.2)),
                    type = "fitted")

pred_rcs_obj <- Predict(fit_rcs, csf_glucose = x_seq)

plot_df <- data.frame(
  csf_glucose = rep(x_seq, 3),
  probability = c(pred_linear, pred_cat, as.data.frame(pred_rcs_obj)$yhat),
  Model = rep(c("Linear", "Categorised (<2.2)", "RCS (4 knots)"), each = length(x_seq))
)

# Convert RCS from log-odds to probability for fair comparison
plot_df$probability[plot_df$Model == "RCS (4 knots)"] <-
  plogis(plot_df$probability[plot_df$Model == "RCS (4 knots)"])

ggplot(plot_df, aes(x = csf_glucose, y = probability, colour = Model)) +
  geom_line(linewidth = 1.1) +
  scale_colour_manual(values = c("Linear" = "#3498db",
                                  "Categorised (<2.2)" = "#e74c3c",
                                  "RCS (4 knots)" = "#2c3e50")) +
  labs(x = "CSF Glucose (mmol/L)", y = "Predicted Probability of Bacterial Meningitis") +
  theme_minimal(base_size = 14) +
  theme(legend.position = "bottom")
Figure 6.4: Comparison of linear (blue), categorised (red), and RCS (black) model fits for the CSF glucose–bacterial meningitis relationship.

6.6.3 What This Example Teaches Us

  1. The linear model misses the threshold-like nature of the relationship. It estimates a constant decrease in probability per unit increase in glucose, which underestimates the risk at very low glucose levels and overestimates it at intermediate levels.

  2. The categorised model captures the broad direction (low glucose = higher risk) but loses all nuance. It cannot tell you that a glucose of 0.5 mmol/L carries a very different risk than a glucose of 2.0 mmol/L — both are simply “low.”

  3. The RCS model captures the steep decline in risk at low glucose levels and the flattening at higher levels, providing the most accurate and informative picture.

6.7 Practical Recommendations

Lopez-Ayala et al. (2025) provide a practical framework (their Box 2) that we adapt here:

TipWorkflow for Modelling Continuous Predictors
  1. Keep continuous variables continuous. Do not categorise unless there is a strong a priori biological or clinical reason (and even then, think twice).

  2. Start by visualising the predictor–outcome relationship using a smoother (loess, GAM, or a spline with many knots).

  3. Fit a restricted cubic spline with 3–5 knots as the default modelling strategy. Use 3 knots when sample size is small or the relationship is expected to be simple; 4–5 knots when sample size is adequate and the relationship may be complex.

  4. Test for non-linearity using a likelihood ratio test or the Wald test from anova(fit) in rms. If non-linearity is not significant, you may simplify to a linear term — but the spline model will give very similar results anyway, so simplification is optional.

  5. Present results as plots showing the predicted outcome across the range of the predictor, with confidence intervals. Supplement with specific contrasts (e.g., odds ratio for value A vs value B) as needed.

  6. Do not report individual spline coefficients in tables. They are not interpretable. Instead, report the overall test for the predictor and any non-linearity test, along with a figure.

  7. Use AIC or likelihood ratio tests to compare models (linear vs spline, different numbers of knots).

6.8 Implementation

The rms package by Frank Harrell is the gold standard for this in R, and patsy + statsmodels give the equivalent capability in Python. The walkthroughs below are fully self-contained and runnable: each one first simulates a small example dataset (a continuous predictor, two confounders, and a binary outcome) so you can run the code as-is, then fits and plots a restricted cubic spline. To use them on your own study, replace the simulated dat with your own data frame and keep the rest unchanged.

The key function is rcs(variable, number_of_knots) inside an rms model fit such as lrm().

Code
library(rms)
library(ggplot2)

# --- Simulated example data (replace this block with your own data) ---
set.seed(1)
n <- 400
predictor   <- runif(n, 20, 45)
confounder1 <- rnorm(n)
confounder2 <- rbinom(n, 1, 0.4)
logit_p <- -8 + 0.05 * (predictor - 30)^2 + 0.3 * confounder1 + 0.5 * confounder2
dat <- data.frame(
  outcome = rbinom(n, 1, plogis(logit_p)),
  predictor = predictor,
  confounder1 = confounder1,
  confounder2 = confounder2
)

# Step 1: Tell rms about the data distribution (required for Predict/summary)
dd <- datadist(dat)
options(datadist = "dd")

# Step 2: Fit a logistic regression with the predictor modelled as an RCS(4)
fit <- lrm(outcome ~ rcs(predictor, 4) + confounder1 + confounder2, data = dat)

# Step 3: Examine the model
print(fit)         # Overall model summary
Logistic Regression Model

lrm(formula = outcome ~ rcs(predictor, 4) + confounder1 + confounder2, 
    data = dat)

                       Model Likelihood      Discrimination    Rank Discrim.    
                             Ratio Test             Indexes          Indexes    
Obs           400    LR chi2     193.30      R2       0.775    C       0.982    
 0            357    d.f.             5      R2(5,400)0.375    Dxy     0.964    
 1             43    Pr(> chi2) <0.0001    R2(5,115.1)0.805    gamma   0.964    
max |deriv| 7e-06                            Brier    0.028    tau-a   0.185    

            Coef    S.E.    Wald Z Pr(>|Z|)
Intercept   15.1163 16.5458  0.91  0.3609  
predictor   -0.8784  0.7545 -1.16  0.2443  
predictor'   1.9043  2.7241  0.70  0.4845  
predictor'' -1.9463  6.4234 -0.30  0.7619  
confounder1  0.2734  0.2730  1.00  0.3165  
confounder2  0.1893  0.5931  0.32  0.7496  
Code
anova(fit)         # Tests for each predictor (total effect + non-linearity)
                Wald Statistics          Response: outcome 

 Factor      Chi-Square d.f. P     
 predictor   47.13      3    <.0001
  Nonlinear  24.85      2    <.0001
 confounder1  1.00      1    0.3165
 confounder2  0.10      1    0.7496
 TOTAL       47.73      5    <.0001
Code
# Step 4: Plot the fitted relationship across the range of the predictor
pred_df <- as.data.frame(Predict(fit, predictor))
ggplot(pred_df, aes(x = predictor, y = yhat)) +
  geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.2, fill = "#3498db") +
  geom_line(linewidth = 1) +
  labs(x = "Predictor", y = "Log-Odds of Outcome") +
  theme_minimal(base_size = 13)

Code
# Step 5: A specific contrast --- log-odds ratio for predictor 30 vs 25
summary(fit, predictor = c(25, 30))
             Effects              Response : outcome 

 Factor      Low     High     Diff.  Effect    S.E.    Lower 0.95 Upper 0.95
 predictor   25.0000 30.00000 5.0000 -2.469800 1.18720 -4.7967000 -0.14295  
  Odds Ratio 25.0000 30.00000 5.0000  0.084601      NA  0.0082572  0.86679  
 confounder1 -0.6428  0.68494 1.3277  0.363050 0.36245 -0.3473400  1.07340  
  Odds Ratio -0.6428  0.68494 1.3277  1.437700      NA  0.7065600  2.92540  
 confounder2  0.0000  1.00000 1.0000  0.189310 0.59313 -0.9732000  1.35180  
  Odds Ratio  0.0000  1.00000 1.0000  1.208400      NA  0.3778700  3.86450  

patsy’s cr() builds a natural (restricted) cubic spline basis; df equals the number of knots minus one, so df=3 corresponds to a 4-knot RCS.

Code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
from scipy.special import expit

# --- Simulated example data (replace this block with your own data) ---
rng = np.random.default_rng(1)
n = 400
predictor = rng.uniform(20, 45, n)
confounder1 = rng.normal(size=n)
confounder2 = rng.binomial(1, 0.4, n)
logit_p = -8 + 0.05 * (predictor - 30) ** 2 + 0.3 * confounder1 + 0.5 * confounder2
dat = pd.DataFrame({
    "outcome": rng.binomial(1, expit(logit_p)),
    "predictor": predictor,
    "confounder1": confounder1,
    "confounder2": confounder2,
})

# Step 1: Fit using the formula interface. patsy's cr() builds a natural
# (restricted) cubic spline basis; df=3 corresponds to a 4-knot RCS. Using the
# formula API means statsmodels remembers the spline knots and reapplies them
# automatically when we predict on new data below.
model = smf.glm(
    "outcome ~ cr(predictor, df=3) + confounder1 + confounder2",
    data=dat, family=sm.families.Binomial(),
)
result = model.fit()
print(result.summary())
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                outcome   No. Observations:                  400
Model:                            GLM   Df Residuals:                      395
Model Family:                Binomial   Df Model:                            4
Link Function:                  Logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -28.986
Date:                Wed, 10 Jun 2026   Deviance:                       57.973
Time:                        15:07:36   Pearson chi2:                     74.3
No. Iterations:                   100   Pseudo R-squ. (CS):             0.3538
Covariance Type:            nonrobust                                         
==========================================================================================
                             coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------------
Intercept              -5.303e+12   9.85e+13     -0.054      0.957   -1.98e+14    1.88e+14
cr(predictor, df=3)[0]  5.303e+12   9.85e+13      0.054      0.957   -1.88e+14    1.98e+14
cr(predictor, df=3)[1]  5.303e+12   9.85e+13      0.054      0.957   -1.88e+14    1.98e+14
cr(predictor, df=3)[2]  5.303e+12   9.85e+13      0.054      0.957   -1.88e+14    1.98e+14
confounder1                0.9267      0.335      2.770      0.006       0.271       1.582
confounder2                1.2288      0.705      1.743      0.081      -0.153       2.610
==========================================================================================
Code
# Step 2: Plot the fitted relationship across the range of the predictor
# (holding the confounders fixed at reference values).
x_range = np.linspace(dat["predictor"].min(), dat["predictor"].max(), 200)
grid = pd.DataFrame({"predictor": x_range, "confounder1": 0.0, "confounder2": 0})
pred_prob = result.predict(grid)

plt.figure(figsize=(8, 5))
plt.plot(x_range, pred_prob, "k-", linewidth=2)
plt.xlabel("Predictor")
plt.ylabel("Predicted Probability of Outcome")
plt.title("Non-linear relationship modelled with a restricted cubic spline")
plt.tight_layout()
plt.show()

6.9 Comparison with Other Approaches

The table below summarises the main options. As a quick reminder of the two flexible approaches: fractional polynomials (introduced earlier in this chapter) model the predictor using a small number of non-integer powers (such as \(\sqrt{x}\), \(1/x\), or \(\log x\)) chosen to fit the data, giving a smooth global curve; restricted cubic splines instead stitch together local cubic pieces between knots and force straight lines in the tails. Both avoid the pitfalls of categorisation, but they bend the curve in different ways.

Table 6.1: Comparison of approaches for modelling continuous variables
Approach Pros Cons
Linear term Simple, 1 df Assumes linearity — often wrong
Categorisation “Easy” to present Information loss, bias, arbitrary cut-points
Quadratic polynomial Can capture U-shapes, 2 df Symmetric, poor boundary behaviour
Fractional polynomials Flexible, parsimonious Global functions; limited local flexibility
Restricted cubic splines Flexible, good boundary behaviour, well-supported Coefficients not directly interpretable
GAM (smoothing splines) Very flexible, automatic smoothing Can overfit; penalty selection needed

For most clinical research, restricted cubic splines with 3–5 knots represent the best balance of flexibility, parsimony, and practical usability.

6.10 Exercises

6.10.1 Exercise 1: Recognising the Problem

Consider a study of the association between systolic blood pressure (SBP) and stroke risk. The researchers categorise SBP into: normal (<120), elevated (120–129), Stage 1 hypertension (130–139), and Stage 2 hypertension (>=140).

Questions:

  1. What assumptions does this categorisation impose on the SBP–stroke relationship?
  2. A patient with SBP of 131 mmHg is classified identically to a patient with SBP of 139 mmHg. Why is this problematic?
  3. Suggest a better modelling approach and explain how you would implement it.

6.10.2 Exercise 2: Fitting and Interpreting Spline Models

Using the built-in PBC dataset (primary biliary cholangitis) from the survival package in R, model the relationship between serum bilirubin and transplant status (a binary outcome: transplanted or not).

Code
library(rms)
library(ggplot2)

# Load the PBC data
data(pbc, package = "survival")
pbc <- pbc[!is.na(pbc$trt), ]  # Remove incomplete cases

# Create a binary outcome: transplanted (status == 1) vs not
pbc$transplant <- as.numeric(pbc$status == 1)

# Set up datadist
dd <- datadist(pbc)
options(datadist = "dd")

# Task 1: Fit a logistic regression with bilirubin as a linear term
fit_linear <- lrm(transplant ~ bili, data = pbc)

# Task 2: Fit a logistic regression with bilirubin modelled using RCS (4 knots)
fit_rcs <- lrm(transplant ~ rcs(bili, 4), data = pbc)

# Task 3: Test for non-linearity
anova(fit_rcs)

# Task 4: Compare AIC
AIC(fit_linear)
AIC(fit_rcs)

# Task 5: Plot the relationship
p <- Predict(fit_rcs, bili)
plot(p, xlab = "Serum Bilirubin (mg/dL)",
     ylab = "Log-Odds of Transplant")

# Task 6: What does the plot tell you about the bilirubin-transplant
#          relationship? Is it linear?
Code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from patsy import dmatrix

# Simulate data similar to PBC bilirubin-transplant relationship
np.random.seed(42)
n = 300
bili = np.random.exponential(scale=2, size=n)
bili = np.clip(bili, 0.3, 25)
logit_p = -3 + 0.1 * bili + 0.005 * bili**2  # Non-linear true relationship
prob = 1 / (1 + np.exp(-logit_p))
transplant = np.random.binomial(1, prob)

df = pd.DataFrame({'bili': bili, 'transplant': transplant})

# Task 1: Fit logistic regression with linear bilirubin
X_linear = sm.add_constant(df[['bili']])
fit_linear = sm.GLM(df['transplant'], X_linear,
                     family=sm.families.Binomial()).fit()
print("Linear model:")
print(fit_linear.summary())

# Task 2: Create spline basis and fit logistic regression with splines
spline_basis = dmatrix("cr(bili, df=3)", data=df, return_type='dataframe')
fit_spline = sm.GLM(df['transplant'], spline_basis,
                      family=sm.families.Binomial()).fit()
print("\nSpline model:")
print(fit_spline.summary())

# Task 3: Compare AIC
print(f"\nAIC Linear: {fit_linear.aic:.1f}")
print(f"AIC Spline: {fit_spline.aic:.1f}")

# Task 4: Plot the predicted log-odds across bilirubin range
bili_range = np.linspace(df['bili'].min(), df['bili'].max(), 200)
spline_pred = dmatrix("cr(bili_range, df=3)",
                       {"bili_range": bili_range},
                       return_type='dataframe')
pred_prob = fit_spline.predict(spline_pred)

plt.figure(figsize=(8, 5))
plt.plot(bili_range, pred_prob, 'k-', linewidth=2)
plt.xlabel('Serum Bilirubin (mg/dL)')
plt.ylabel('Predicted Probability of Transplant')
plt.title('Non-linear effect of bilirubin on transplant (RCS)')
plt.tight_layout()
plt.show()

6.10.3 Exercise 3: Categorisation vs Splines Head-to-Head

Simulate data with a known non-linear (U-shaped) relationship and compare the performance of categorised and spline models.

Code
library(rms)
library(ggplot2)

set.seed(2025)
n <- 1000
x <- rnorm(n, mean = 50, sd = 10)

# True U-shaped relationship
logit_p <- -5 + 0.004 * (x - 50)^2
y <- rbinom(n, 1, plogis(logit_p))

sim_data <- data.frame(x = x, y = y)
dd <- datadist(sim_data)
options(datadist = "dd")

# Model 1: Categorise into quartiles
sim_data$x_cat <- cut(sim_data$x,
                       breaks = quantile(sim_data$x, c(0, 0.25, 0.5, 0.75, 1)),
                       include.lowest = TRUE)
fit_cat <- lrm(y ~ x_cat, data = sim_data)

# Model 2: Linear term
fit_lin <- lrm(y ~ x, data = sim_data)

# Model 3: RCS with 5 knots
fit_rcs <- lrm(y ~ rcs(x, 5), data = sim_data)

# Questions:
# a) Compare AIC of the three models. Which fits best?
# b) Plot all three fitted curves on the same graph along with the
#    true logit function. Which model best recovers the truth?
# c) What happens if you change the quartile cut-points to tertiles?
#    Does it affect the categorised model's conclusions?

cat("AIC values:\n")
cat("Categorised:", AIC(fit_cat), "\n")
cat("Linear:     ", AIC(fit_lin), "\n")
cat("RCS(5):     ", AIC(fit_rcs), "\n")
Code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from patsy import dmatrix
from scipy.special import expit

np.random.seed(2025)
n = 1000
x = np.random.normal(50, 10, size=n)

# True U-shaped relationship
logit_p = -5 + 0.004 * (x - 50)**2
y = np.random.binomial(1, expit(logit_p))

df = pd.DataFrame({'x': x, 'y': y})

# Model 1: Categorise into quartiles
df['x_cat'] = pd.qcut(df['x'], q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
X_cat = pd.get_dummies(df['x_cat'], drop_first=True).astype(float)
X_cat = sm.add_constant(X_cat)
fit_cat = sm.GLM(df['y'], X_cat, family=sm.families.Binomial()).fit()

# Model 2: Linear
X_lin = sm.add_constant(df[['x']])
fit_lin = sm.GLM(df['y'], X_lin, family=sm.families.Binomial()).fit()

# Model 3: RCS
X_rcs = dmatrix("cr(x, df=4)", data=df, return_type='dataframe')
fit_rcs = sm.GLM(df['y'], X_rcs, family=sm.families.Binomial()).fit()

print(f"AIC Categorised: {fit_cat.aic:.1f}")
print(f"AIC Linear:      {fit_lin.aic:.1f}")
print(f"AIC RCS:         {fit_rcs.aic:.1f}")

# Plot all three fitted curves
x_range = np.linspace(df['x'].min(), df['x'].max(), 200)

# True curve
true_logit = -5 + 0.004 * (x_range - 50)**2
true_prob = expit(true_logit)

# RCS predictions
X_rcs_pred = dmatrix("cr(x_range, df=4)", {"x_range": x_range},
                      return_type='dataframe')
rcs_prob = fit_rcs.predict(X_rcs_pred)

# Linear predictions
X_lin_pred = sm.add_constant(pd.DataFrame({'x': x_range}))
lin_prob = fit_lin.predict(X_lin_pred)

plt.figure(figsize=(10, 6))
plt.plot(x_range, true_prob, 'k--', linewidth=2, label='True relationship')
plt.plot(x_range, rcs_prob, 'b-', linewidth=2, label='RCS')
plt.plot(x_range, lin_prob, 'r-', linewidth=1.5, label='Linear')
plt.xlabel('Predictor (x)')
plt.ylabel('Predicted Probability')
plt.legend()
plt.title('Comparing modelling strategies for a U-shaped relationship')
plt.tight_layout()
plt.show()

6.11 Summary

Key Concept Details
Do not categorise Categorising continuous variables loses information, introduces bias, and reduces power
Linear terms Assume a constant effect per unit change; often wrong for biological variables
Fractional polynomials Use non-integer powers for flexible global curves; parsimonious but limited locally
Splines Piecewise polynomials joined at knots; flexible and locally adaptive
Restricted cubic splines Splines forced to be linear in the tails; recommended default with 3–5 knots
Interpretation Use plots and contrasts, not individual coefficients
Testing Overall association test (\(k-1\) df); non-linearity test (\(k-2\) df)

References

Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. 2010. “Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software 33 (1): 1–22.
Harrell, Frank E. 2015. Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis. 2nd ed. Springer.
Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. Springer.
Hoerl, Arthur E, and Robert W Kennard. 1970. “Ridge Regression: Biased Estimation for Nonorthogonal Problems.” Technometrics 12 (1): 55–67.
Lopez-Ayala, Pedro, Richard D Riley, Gary S Collins, and Tobias Zimmermann. 2025. “Dealing with Continuous Variables and Modelling Non-Linear Associations in Healthcare Data: Practical Guide.” BMJ 390: e082440. https://doi.org/10.1136/bmj-2024-082440.
Royston, Patrick, and Douglas G Altman. 1994. “Regression Using Fractional Polynomials of Continuous Covariates: Parsimonious Parametric Modelling.” Journal of the Royal Statistical Society: Series C 43 (3): 429–67.
Royston, Patrick, Douglas G Altman, and Willi Sauerbrei. 2006. “Dichotomizing Continuous Predictors in Multiple Regression: A Bad Idea.” Statistics in Medicine 25 (1): 127–41. https://doi.org/10.1002/sim.2331.
Steyerberg, Ewout W. 2019. Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. 2nd ed. Springer.
Tibshirani, Robert. 1996. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society: Series B 58 (1): 267–88.
Zou, Hui, and Trevor Hastie. 2005. “Regularization and Variable Selection via the Elastic Net.” Journal of the Royal Statistical Society: Series B 67 (2): 301–20.