20  Causal Inference for Observational Health Data

20.1 Introduction

Most clinical research relies on observational data — electronic health records, claims databases, registries, and cohort studies. Yet the questions we care about most are causal: Does this drug reduce mortality? Would this surgery improve outcomes compared with conservative management? Answering causal questions with observational data requires careful methodology, because the fundamental problem is that patients who receive a treatment are systematically different from those who do not.

This chapter equips you with the conceptual frameworks and practical tools to move from “Drug X is associated with lower mortality” to a defensible causal claim — or, equally important, to recognise when the data cannot support one.

ImportantThe stakes are real

Confounding by indication — where sicker patients receive more aggressive treatment — has led to published studies concluding that treatments cause harm when they actually help, and vice versa. The Women’s Health Initiative overturned decades of observational evidence on hormone replacement therapy. Causal inference methods do not guarantee correct answers, but they make your assumptions explicit and testable.

20.2 Correlation vs Causation in Medicine

20.2.1 Why simple regression is not enough

Consider a hospital database showing that patients who receive mechanical ventilation have higher mortality than those who do not. A naive analyst might conclude ventilation is harmful. But the confounding is obvious: ventilation is given to the sickest patients.

This is confounding by indication — the indication for treatment is itself a risk factor for the outcome. Standard regression can adjust for measured confounders, but:

  1. You must know which variables to include (and which to exclude).
  2. You must model the relationships correctly (linearity, interactions).
  3. You cannot adjust for unmeasured confounders.

Causal inference methods address problems 1 and 2 more systematically, and provide tools to assess sensitivity to problem 3.

20.2.2 The potential outcomes framework

The modern causal inference framework, formalised by Rubin (1974) and extended by Hernan and Robins, rests on potential outcomes:

  • \(Y^{a=1}\): the outcome a patient would have under treatment
  • \(Y^{a=0}\): the outcome a patient would have under control

The individual causal effect is \(Y^{a=1} - Y^{a=0}\), but we can never observe both for the same person. This is the fundamental problem of causal inference.

We can, however, estimate the average treatment effect (ATE):

\[ \text{ATE} = E[Y^{a=1}] - E[Y^{a=0}] \]

or the average treatment effect on the treated (ATT):

\[ \text{ATT} = E[Y^{a=1} - Y^{a=0} \mid A=1] \]

The key assumption that allows us to estimate these from observational data is exchangeability (no unmeasured confounding): conditional on measured covariates \(L\), treatment assignment is independent of potential outcomes.

\[ Y^a \perp\!\!\!\perp A \mid L \]

20.3 Directed Acyclic Graphs (DAGs)

20.3.1 What is a DAG?

A directed acyclic graph is a visual representation of your causal assumptions. Each node represents a variable. Each arrow represents a direct causal effect. “Acyclic” means no variable can cause itself through a chain of arrows.

Figure 20.1: A simple DAG showing confounding. Age affects both statin use and cardiovascular death.

20.3.2 Key structures in DAGs

There are three fundamental structures you must recognise:

1. Confounders (common causes)

A variable that causes both treatment and outcome. You MUST adjust for confounders to remove bias.

L --> A --> Y
L ---------> Y

If L is diabetes severity, A is insulin therapy, and Y is HbA1c, then diabetes severity confounds the insulin-HbA1c relationship.

2. Mediators (intermediate variables)

A variable on the causal pathway from treatment to outcome. You should NOT adjust for mediators if you want the total effect.

A --> M --> Y

If A is exercise, M is weight loss, and Y is blood pressure, then weight loss mediates the exercise-BP relationship. Adjusting for weight loss would block part of the effect you want to estimate.

3. Colliders (common effects)

A variable caused by both treatment and outcome (or by variables on each path). You should NOT adjust for colliders — doing so creates a spurious association (collider bias).

A --> C <-- Y

If A is obesity, Y is genetic predisposition to diabetes, and C is being in a diabetes clinic, then conditioning on clinic attendance (a collider) creates a spurious negative association between obesity and genetic risk.

TipThe “d-separation” rule of thumb

A path between treatment and outcome is blocked if:

  • It passes through a variable you condition on (confounder or mediator), OR
  • It passes through a collider you do NOT condition on.

A path is open if it is not blocked. Bias exists when there are open non-causal paths (backdoor paths) between treatment and outcome.

20.3.3 Drawing your own DAG

Before any causal analysis, draw a DAG. Steps:

  1. List all variables relevant to treatment assignment and the outcome.
  2. Use clinical knowledge to draw arrows (not data — DAGs encode assumptions).
  3. Identify all backdoor paths from treatment to outcome.
  4. Determine the minimal adjustment set that blocks all backdoor paths without opening new ones.

Tools like DAGitty automate step 4 once you have drawn the DAG.

20.4 Propensity Score Methods

20.4.1 The propensity score

The propensity score is the probability of receiving treatment given observed covariates:

\[ e(L) = P(A=1 \mid L) \]

Rosenbaum and Rubin (1983) showed that, if exchangeability holds conditional on \(L\), it also holds conditional on the scalar \(e(L)\). This reduces a high-dimensional adjustment problem to a single dimension.

20.4.2 Estimating the propensity score

We typically estimate \(e(L)\) using logistic regression (or more flexible methods like gradient boosting):

Code
# Simulate clinical data
set.seed(42)
n <- 2000

dat <- tibble(
  age = rnorm(n, 65, 10),
  diabetes = rbinom(n, 1, 0.3),
  egfr = rnorm(n, 60, 15),
  smoking = rbinom(n, 1, 0.25),
  # Treatment assignment depends on covariates (confounding)
  ps_true = plogis(-2 + 0.03 * age + 0.5 * diabetes - 0.02 * egfr + 0.4 * smoking),
  treatment = rbinom(n, 1, ps_true),
  # Outcome depends on treatment AND covariates
  mortality = rbinom(n, 1, plogis(-3 + 0.04 * age + 0.6 * diabetes -
                                    0.03 * egfr + 0.3 * smoking -
                                    0.5 * treatment))
)

# Estimate propensity score via logistic regression
ps_model <- glm(treatment ~ age + diabetes + egfr + smoking,
                data = dat, family = binomial)
dat$ps <- predict(ps_model, type = "response")

# Visualise overlap
ggplot(dat, aes(x = ps, fill = factor(treatment))) +
  geom_density(alpha = 0.5) +
  labs(x = "Propensity Score", fill = "Treatment",
       title = "Propensity Score Distribution by Treatment Group") +
  theme_minimal()
WarningPositivity assumption

Every patient must have a non-zero probability of receiving either treatment. If some patients have propensity scores near 0 or 1, the overlap assumption is violated and estimates become unstable. Always check the propensity score distributions visually.

20.4.3 Propensity score matching

Matching creates a pseudo-randomised subset by pairing treated and control patients with similar propensity scores.

Code
# Using MatchIt for nearest-neighbour matching with caliper
m_out <- matchit(treatment ~ age + diabetes + egfr + smoking,
                 data = dat,
                 method = "nearest",
                 distance = "glm",        # logistic regression PS
                 caliper = 0.2,           # 0.2 SD of logit PS
                 ratio = 1)               # 1:1 matching

summary(m_out)

# Check covariate balance using cobalt
love.plot(m_out,
          thresholds = c(m = 0.1),    # SMD threshold of 0.1
          binary = "std",
          title = "Covariate Balance: Before and After Matching")

# Extract matched data and estimate treatment effect
m_data <- match.data(m_out)

# Outcome model in matched sample
outcome_model <- glm(mortality ~ treatment,
                     data = m_data,
                     family = binomial,
                     weights = weights)

tidy(outcome_model, conf.int = TRUE, exponentiate = TRUE)

Checking balance is the most important step after matching. A standardised mean difference (SMD) below 0.1 for all covariates is the conventional threshold for acceptable balance.

20.4.4 Inverse probability of treatment weighting (IPTW)

Instead of discarding unmatched patients, IPTW reweights the entire sample so that the distribution of covariates is balanced between groups:

\[ w_i = \frac{A_i}{e(L_i)} + \frac{1 - A_i}{1 - e(L_i)} \]

This estimates the ATE. For the ATT, the weights are:

\[ w_i = A_i + \frac{(1 - A_i) \cdot e(L_i)}{1 - e(L_i)} \]

Code
# Using WeightIt
w_out <- weightit(treatment ~ age + diabetes + egfr + smoking,
                  data = dat,
                  method = "ps",
                  estimand = "ATE")

# Check balance
bal.tab(w_out, thresholds = c(m = 0.1))

# Stabilised weights (recommended to reduce variance)
summary(w_out)

# Estimate treatment effect with IPTW
library(survey)
d_weighted <- svydesign(ids = ~1, weights = w_out$weights, data = dat)
svyglm(mortality ~ treatment, design = d_weighted, family = quasibinomial) |>
  tidy(conf.int = TRUE, exponentiate = TRUE)
NoteStabilised weights

Extreme weights can inflate variance. Stabilised weights replace the numerator 1 with \(P(A=1)\) for treated and \(P(A=0)\) for untreated. The WeightIt package does this by default. Always check that weights are not excessively large (e.g., max weight > 20 warrants investigation).

20.4.5 Doubly robust estimation

Doubly robust methods combine a propensity score model and an outcome model. The estimate is consistent if either model is correctly specified (but not necessarily both). This provides an extra layer of protection against model misspecification.

The augmented inverse probability weighted (AIPW) estimator is:

\[ \hat{\tau}_{DR} = \frac{1}{n} \sum_{i=1}^n \left[ \frac{A_i Y_i}{e(L_i)} - \frac{A_i - e(L_i)}{e(L_i)} \hat{\mu}_1(L_i) \right] - \frac{1}{n} \sum_{i=1}^n \left[ \frac{(1-A_i) Y_i}{1-e(L_i)} + \frac{A_i - e(L_i)}{1-e(L_i)} \hat{\mu}_0(L_i) \right] \]

where \(\hat{\mu}_a(L)\) is the predicted outcome under treatment \(a\).

In practice, packages like AIPW in R or DoWhy in Python handle this computation.

20.5 Target Trial Emulation

20.5.1 The framework

Target trial emulation, developed by Hernan and Robins, asks: “What randomised trial would we like to conduct?” and then designs the observational analysis to mimic that trial as closely as possible.

The key components to specify are:

Protocol component Target trial Observational emulation
Eligibility criteria Defined inclusion/exclusion Same criteria applied to database
Treatment strategies e.g., start drug A vs drug B Same, defined at “time zero”
Treatment assignment Random Adjusted via PS methods or g-estimation
Start of follow-up Randomisation date Date eligibility + treatment criteria met
Outcome e.g., 5-year mortality Same
Causal contrast Intention-to-treat or per-protocol Same (with appropriate methods)
Analysis plan Intention-to-treat analysis Clone-censor-weight or similar

20.5.2 Why this reduces bias

Target trial emulation directly addresses several common biases:

Immortal time bias: In a naive observational study comparing drug users to non-users, the time between cohort entry and first prescription is “immortal” — the patient must survive to receive the drug. This artificially inflates survival in the treated group. Target trial emulation avoids this by aligning “time zero” (start of follow-up) with treatment assignment.

Selection bias from prevalent user designs: Studying patients already on a drug (prevalent users) conflates treatment initiation effects with survival conditioning. Target trial emulation uses a new-user (incident user) design.

Vague eligibility and time zero: When eligibility criteria and the start of follow-up are not clearly defined, analyses can be plagued by look-ahead bias. The target trial framework forces explicit specification.

TipThe TARGET reporting guideline (2025)

The TARGET (TrAnsparent ReportinG of observational studies Emulating a Target trial) guideline provides a structured checklist for reporting target trial emulation studies. It was published in the BMJ in 2025 and should be followed for any observational study making causal claims.

20.5.3 Example: Statin initiation and cardiovascular events

Suppose we want to estimate the effect of initiating statins within 6 months of a type 2 diabetes diagnosis on 5-year cardiovascular events.

Target trial specification:

  • Eligibility: Adults aged 40–75 with new T2DM diagnosis, no prior CVD, no prior statin use
  • Treatment strategies: (1) Initiate statin within 6 months, (2) Do not initiate statin within 6 months
  • Time zero: Date of T2DM diagnosis
  • Outcome: First major adverse cardiovascular event (MACE) within 5 years
  • Analysis: Intention-to-treat using IPTW for confounding, clone-censor-weight for treatment strategy adherence

20.6 Regression Discontinuity Design

Regression discontinuity (RD) exploits arbitrary thresholds in clinical decision-making. When treatment is assigned based on whether a continuous variable exceeds a cutoff, patients just above and just below the cutoff are essentially randomly assigned.

Clinical example: Many guidelines recommend antihypertensive therapy when systolic blood pressure exceeds 140 mmHg. Patients with SBP of 141 are very similar to patients with SBP of 139, but the former are much more likely to receive treatment. Comparing outcomes near the cutoff provides a quasi-experimental estimate.

Key requirements for a valid RD design:

  1. Treatment probability changes sharply at the cutoff (first stage)
  2. Other covariates do not jump at the cutoff (continuity assumption)
  3. Patients cannot precisely manipulate their score (no bunching)

RD designs are less common in clinical research than propensity score methods, but when applicable, they provide strong evidence because the assumptions are more testable.

20.7 Implementation in Python

Python’s causal inference ecosystem is maturing. Here we demonstrate propensity score matching and IPTW using dowhy and causalinference.

Code
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from scipy import stats
import matplotlib.pyplot as plt

# Simulate clinical data
np.random.seed(42)
n = 2000

age = np.random.normal(65, 10, n)
diabetes = np.random.binomial(1, 0.3, n)
egfr = np.random.normal(60, 15, n)
smoking = np.random.binomial(1, 0.25, n)

# Confounded treatment assignment
ps_true = 1 / (1 + np.exp(-(-2 + 0.03*age + 0.5*diabetes - 0.02*egfr + 0.4*smoking)))
treatment = np.random.binomial(1, ps_true)

# Outcome
mortality_prob = 1 / (1 + np.exp(-(-3 + 0.04*age + 0.6*diabetes -
                                     0.03*egfr + 0.3*smoking -
                                     0.5*treatment)))
mortality = np.random.binomial(1, mortality_prob)

df = pd.DataFrame({
    'age': age, 'diabetes': diabetes, 'egfr': egfr,
    'smoking': smoking, 'treatment': treatment, 'mortality': mortality
})

# --- Propensity Score Estimation ---
covariates = ['age', 'diabetes', 'egfr', 'smoking']
ps_model = LogisticRegression(max_iter=1000)
ps_model.fit(df[covariates], df['treatment'])
df['ps'] = ps_model.predict_proba(df[covariates])[:, 1]

# --- IPTW ---
# Compute ATE weights
df['iptw'] = np.where(
    df['treatment'] == 1,
    1 / df['ps'],
    1 / (1 - df['ps'])
)

# Stabilised weights
p_treat = df['treatment'].mean()
df['sw'] = np.where(
    df['treatment'] == 1,
    p_treat / df['ps'],
    (1 - p_treat) / (1 - df['ps'])
)

# Check balance: weighted standardised mean differences
def weighted_smd(data, var, treatment_col, weight_col):
    treated = data[data[treatment_col] == 1]
    control = data[data[treatment_col] == 0]
    mean_t = np.average(treated[var], weights=treated[weight_col])
    mean_c = np.average(control[var], weights=control[weight_col])
    sd_pooled = np.sqrt((treated[var].var() + control[var].var()) / 2)
    return (mean_t - mean_c) / sd_pooled

print("Covariate balance (weighted SMD):")
for var in covariates:
    smd = weighted_smd(df, var, 'treatment', 'sw')
    print(f"  {var}: {smd:.4f}")

# --- Weighted outcome estimation ---
from statsmodels.api import WLS, add_constant

X = add_constant(df['treatment'])
wls_model = WLS(df['mortality'], X, weights=df['sw']).fit()
print("\nIPTW Treatment Effect Estimate:")
print(f"  Coefficient: {wls_model.params[1]:.4f}")
print(f"  95% CI: ({wls_model.conf_int()[1][0]:.4f}, {wls_model.conf_int()[1][1]:.4f})")

20.7.1 Using DoWhy for causal analysis

Code
# DoWhy provides a structured causal inference workflow
# Install: pip install dowhy

import dowhy
from dowhy import CausalModel

# Step 1: Model - encode causal assumptions
model = CausalModel(
    data=df,
    treatment='treatment',
    outcome='mortality',
    common_causes=['age', 'diabetes', 'egfr', 'smoking']
)

# Step 2: Identify - find valid adjustment strategy
identified_estimand = model.identify_effect()
print(identified_estimand)

# Step 3: Estimate - using propensity score weighting
estimate = model.estimate_effect(
    identified_estimand,
    method_name="backdoor.propensity_score_weighting",
    method_params={"weighting_scheme": "ips_stabilized_weight"}
)
print(f"Causal Estimate: {estimate.value:.4f}")

# Step 4: Refute - sensitivity analysis
refutation = model.refute_estimate(
    identified_estimand, estimate,
    method_name="random_common_cause"
)
print(refutation)

20.8 Sensitivity Analysis for Unmeasured Confounding

No observational study can prove the absence of unmeasured confounding. Sensitivity analysis quantifies how strong unmeasured confounding would need to be to explain away the observed effect.

The E-value (VanderWeele and Ding, 2017) provides a simple summary: it is the minimum strength of association (on the risk ratio scale) that an unmeasured confounder would need to have with both the treatment and the outcome to fully explain away the observed treatment-outcome association.

\[ \text{E-value} = RR + \sqrt{RR \times (RR - 1)} \]

where \(RR\) is the observed risk ratio. The E-value for the confidence interval bound closest to the null is also reported.

Code
# E-value calculation
library(EValue)

# Suppose our IPTW analysis found RR = 0.65 (95% CI: 0.50, 0.85)
# Convert protective RR to above-null: 1/0.65 = 1.538
evalues.RR(est = 1/0.65, lo = 1/0.85, hi = 1/0.50)

20.9 Practical Guidance: Choosing Your Method

Scenario Recommended approach
Binary treatment, many controls available Propensity score matching
Binary treatment, need full sample IPTW
Concern about model misspecification Doubly robust estimation
Time-varying treatment Marginal structural models (IPTW over time)
Natural threshold for treatment Regression discontinuity
Emulating a specific trial design Target trial emulation framework

20.10 Exercises

20.10.1 Exercise 1: DAG construction (Conceptual)

You are studying the relationship between ACE inhibitor use and acute kidney injury (AKI) in hospitalised patients.

  1. List at least five variables that might be relevant to this relationship.
  2. Draw a DAG (on paper or using DAGitty) encoding your causal assumptions.
  3. Identify the minimal sufficient adjustment set for estimating the effect of ACE inhibitors on AKI.
  4. Identify at least one potential collider. What would happen if you adjusted for it?

20.10.2 Exercise 2: Propensity score matching in R

Using the simulated dataset below, perform propensity score matching and estimate the treatment effect.

Code
set.seed(123)
n <- 1500
exercise_dat <- tibble(
  age = rnorm(n, 70, 8),
  creatinine = rnorm(n, 1.2, 0.4),
  heart_failure = rbinom(n, 1, 0.35),
  prior_mi = rbinom(n, 1, 0.20),
  # Confounded treatment (beta-blocker use)
  treatment = rbinom(n, 1, plogis(-1 + 0.02*age + 0.3*heart_failure +
                                     0.5*prior_mi - 0.8*creatinine)),
  # Outcome: 1-year mortality
  death_1yr = rbinom(n, 1, plogis(-2 + 0.05*age + 0.4*heart_failure +
                                     0.6*prior_mi + 1.0*creatinine -
                                     0.7*treatment))
)
  1. Estimate propensity scores using logistic regression.
  2. Perform 1:1 nearest-neighbour matching with a caliper of 0.2 SD.
  3. Create a Love plot to assess balance before and after matching.
  4. Estimate the ATT for beta-blocker use on 1-year mortality.
  5. Calculate the E-value for your estimate. Interpret it.

20.10.3 Exercise 3: IPTW in Python

Using the same data-generating process as Exercise 2 (replicate it in Python):

Code
import numpy as np
import pandas as pd

np.random.seed(123)
n = 1500
age = np.random.normal(70, 8, n)
creatinine = np.random.normal(1.2, 0.4, n)
heart_failure = np.random.binomial(1, 0.35, n)
prior_mi = np.random.binomial(1, 0.20, n)

ps_true = 1 / (1 + np.exp(-(-1 + 0.02*age + 0.3*heart_failure +
                               0.5*prior_mi - 0.8*creatinine)))
treatment = np.random.binomial(1, ps_true)

mort_prob = 1 / (1 + np.exp(-(-2 + 0.05*age + 0.4*heart_failure +
                                 0.6*prior_mi + 1.0*creatinine -
                                 0.7*treatment)))
death_1yr = np.random.binomial(1, mort_prob)

df = pd.DataFrame({
    'age': age, 'creatinine': creatinine, 'heart_failure': heart_failure,
    'prior_mi': prior_mi, 'treatment': treatment, 'death_1yr': death_1yr
})
  1. Fit a propensity score model and compute stabilised IPTW weights.
  2. Assess covariate balance using weighted standardised mean differences.
  3. Estimate the ATE using weighted regression.
  4. Perform a sensitivity analysis: add a simulated unmeasured confounder and show how the estimate changes.

20.10.4 Exercise 4: Target trial emulation (Conceptual)

You have access to a large electronic health records database and want to estimate the effect of early metformin initiation (within 3 months of type 2 diabetes diagnosis) vs delayed initiation (3–12 months after diagnosis) on 5-year cardiovascular events.

  1. Write a complete target trial protocol table (eligibility, treatment strategies, assignment, start of follow-up, outcome, causal contrast, analysis plan).
  2. Identify potential sources of immortal time bias in a naive analysis.
  3. Describe how the new-user, active comparator design addresses confounding by indication.
  4. What unmeasured confounders might remain? How would you conduct a sensitivity analysis?

20.11 Summary

Concept Key point
Confounding by indication Sicker patients get more treatment — naive comparisons are biased
DAGs Encode causal assumptions visually; identify what to adjust for
Propensity score \(P(\text{treatment} \mid \text{covariates})\); reduces dimensionality
Matching Creates balanced pseudo-randomised groups; check balance
IPTW Reweights full sample; preserves sample size
Doubly robust Combines PS + outcome model; consistent if either is correct
Target trial emulation Design observational analysis to mimic an ideal RCT
E-value Quantifies robustness to unmeasured confounding

20.12 References and Further Reading

  1. Hernan MA, Robins JM. Causal Inference: What If. Chapman & Hall/CRC, 2024. Freely available at https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/. The definitive modern textbook on causal inference.

  2. Rosenbaum PR. Observational Studies. 2nd ed. Springer, 2002. The classic text on propensity score methods and sensitivity analysis.

  3. Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70(1):41–55.

  4. TARGET Guideline. TrAnsparent ReportinG of observational studies Emulating a Target trial. BMJ. 2025. Reporting checklist for target trial emulation studies.

  5. VanderWeele TJ, Ding P. Sensitivity analysis in observational research: introducing the E-value. Annals of Internal Medicine. 2017;167(4):268–274.

  6. Textor J, van der Zander B, Gilthorpe MS, et al. Robust causal inference using directed acyclic graphs: the R package ‘dagitty’. International Journal of Epidemiology. 2016;45(6):1887–1894.

  7. Ho DE, Imai K, King G, Stuart EA. MatchIt: nonparametric preprocessing for parametric causal inference. Journal of Statistical Software. 2011;42(8):1–28.

  8. Greifer N. WeightIt: weighting for covariate balance in observational studies. R package. https://ngreifer.github.io/WeightIt/.

  9. Sharma A, Kiciman E. DoWhy: An end-to-end library for causal inference. arXiv:2011.04216, 2020.

  10. Lodi S, Phillips A, Lundgren J, et al. Effect estimates in randomized trials and observational studies: comparing apples with apples. American Journal of Epidemiology. 2019;188(8):1569–1577.