20 Causal Inference for Observational Health Data
20.1 Introduction
Most clinical research relies on observational data — electronic health records, claims databases, registries, and cohort studies. Yet the questions we care about most are causal: Does this drug reduce mortality? Would this surgery improve outcomes compared with conservative management? Answering causal questions with observational data requires careful methodology, because the fundamental problem is that patients who receive a treatment are systematically different from those who do not.
This chapter equips you with the conceptual frameworks and practical tools to move from “Drug X is associated with lower mortality” to a defensible causal claim — or, equally important, to recognise when the data cannot support one.
Confounding by indication — where sicker patients receive more aggressive treatment — has led to published studies concluding that treatments cause harm when they actually help, and vice versa. The Women’s Health Initiative overturned decades of observational evidence on hormone replacement therapy. Causal inference methods do not guarantee correct answers, but they make your assumptions explicit and testable.
20.2 Correlation vs Causation in Medicine
20.2.1 Why simple regression is not enough
Consider a hospital database showing that patients who receive mechanical ventilation have higher mortality than those who do not. A naive analyst might conclude ventilation is harmful. But the confounding is obvious: ventilation is given to the sickest patients.
This is confounding by indication — the indication for treatment is itself a risk factor for the outcome. Standard regression can adjust for measured confounders, but:
- You must know which variables to include (and which to exclude).
- You must model the relationships correctly (linearity, interactions).
- You cannot adjust for unmeasured confounders.
Causal inference methods address problems 1 and 2 more systematically, and provide tools to assess sensitivity to problem 3.
20.2.2 The potential outcomes framework
The modern causal inference framework, formalised by Rubin (1974) and extended by Hernan and Robins, rests on potential outcomes:
- \(Y^{a=1}\): the outcome a patient would have under treatment
- \(Y^{a=0}\): the outcome a patient would have under control
The individual causal effect is \(Y^{a=1} - Y^{a=0}\), but we can never observe both for the same person. This is the fundamental problem of causal inference.
We can, however, estimate the average treatment effect (ATE):
\[ \text{ATE} = E[Y^{a=1}] - E[Y^{a=0}] \]
or the average treatment effect on the treated (ATT):
\[ \text{ATT} = E[Y^{a=1} - Y^{a=0} \mid A=1] \]
The key assumption that allows us to estimate these from observational data is exchangeability (no unmeasured confounding): conditional on measured covariates \(L\), treatment assignment is independent of potential outcomes.
\[ Y^a \perp\!\!\!\perp A \mid L \]
20.3 Directed Acyclic Graphs (DAGs)
20.3.1 What is a DAG?
A directed acyclic graph is a visual representation of your causal assumptions. Each node represents a variable. Each arrow represents a direct causal effect. “Acyclic” means no variable can cause itself through a chain of arrows.
20.3.2 Key structures in DAGs
There are three fundamental structures you must recognise:
1. Confounders (common causes)
A variable that causes both treatment and outcome. You MUST adjust for confounders to remove bias.
L --> A --> Y
L ---------> Y
If L is diabetes severity, A is insulin therapy, and Y is HbA1c, then diabetes severity confounds the insulin-HbA1c relationship.
2. Mediators (intermediate variables)
A variable on the causal pathway from treatment to outcome. You should NOT adjust for mediators if you want the total effect.
A --> M --> Y
If A is exercise, M is weight loss, and Y is blood pressure, then weight loss mediates the exercise-BP relationship. Adjusting for weight loss would block part of the effect you want to estimate.
3. Colliders (common effects)
A variable caused by both treatment and outcome (or by variables on each path). You should NOT adjust for colliders — doing so creates a spurious association (collider bias).
A --> C <-- Y
If A is obesity, Y is genetic predisposition to diabetes, and C is being in a diabetes clinic, then conditioning on clinic attendance (a collider) creates a spurious negative association between obesity and genetic risk.
A path between treatment and outcome is blocked if:
- It passes through a variable you condition on (confounder or mediator), OR
- It passes through a collider you do NOT condition on.
A path is open if it is not blocked. Bias exists when there are open non-causal paths (backdoor paths) between treatment and outcome.
20.3.3 Drawing your own DAG
Before any causal analysis, draw a DAG. Steps:
- List all variables relevant to treatment assignment and the outcome.
- Use clinical knowledge to draw arrows (not data — DAGs encode assumptions).
- Identify all backdoor paths from treatment to outcome.
- Determine the minimal adjustment set that blocks all backdoor paths without opening new ones.
Tools like DAGitty automate step 4 once you have drawn the DAG.
20.4 Propensity Score Methods
20.4.1 The propensity score
The propensity score is the probability of receiving treatment given observed covariates:
\[ e(L) = P(A=1 \mid L) \]
Rosenbaum and Rubin (1983) showed that, if exchangeability holds conditional on \(L\), it also holds conditional on the scalar \(e(L)\). This reduces a high-dimensional adjustment problem to a single dimension.
20.4.2 Estimating the propensity score
We typically estimate \(e(L)\) using logistic regression (or more flexible methods like gradient boosting):
Code
# Simulate clinical data
set.seed(42)
n <- 2000
dat <- tibble(
age = rnorm(n, 65, 10),
diabetes = rbinom(n, 1, 0.3),
egfr = rnorm(n, 60, 15),
smoking = rbinom(n, 1, 0.25),
# Treatment assignment depends on covariates (confounding)
ps_true = plogis(-2 + 0.03 * age + 0.5 * diabetes - 0.02 * egfr + 0.4 * smoking),
treatment = rbinom(n, 1, ps_true),
# Outcome depends on treatment AND covariates
mortality = rbinom(n, 1, plogis(-3 + 0.04 * age + 0.6 * diabetes -
0.03 * egfr + 0.3 * smoking -
0.5 * treatment))
)
# Estimate propensity score via logistic regression
ps_model <- glm(treatment ~ age + diabetes + egfr + smoking,
data = dat, family = binomial)
dat$ps <- predict(ps_model, type = "response")
# Visualise overlap
ggplot(dat, aes(x = ps, fill = factor(treatment))) +
geom_density(alpha = 0.5) +
labs(x = "Propensity Score", fill = "Treatment",
title = "Propensity Score Distribution by Treatment Group") +
theme_minimal()Every patient must have a non-zero probability of receiving either treatment. If some patients have propensity scores near 0 or 1, the overlap assumption is violated and estimates become unstable. Always check the propensity score distributions visually.
20.4.3 Propensity score matching
Matching creates a pseudo-randomised subset by pairing treated and control patients with similar propensity scores.
Code
# Using MatchIt for nearest-neighbour matching with caliper
m_out <- matchit(treatment ~ age + diabetes + egfr + smoking,
data = dat,
method = "nearest",
distance = "glm", # logistic regression PS
caliper = 0.2, # 0.2 SD of logit PS
ratio = 1) # 1:1 matching
summary(m_out)
# Check covariate balance using cobalt
love.plot(m_out,
thresholds = c(m = 0.1), # SMD threshold of 0.1
binary = "std",
title = "Covariate Balance: Before and After Matching")
# Extract matched data and estimate treatment effect
m_data <- match.data(m_out)
# Outcome model in matched sample
outcome_model <- glm(mortality ~ treatment,
data = m_data,
family = binomial,
weights = weights)
tidy(outcome_model, conf.int = TRUE, exponentiate = TRUE)Checking balance is the most important step after matching. A standardised mean difference (SMD) below 0.1 for all covariates is the conventional threshold for acceptable balance.
20.4.4 Inverse probability of treatment weighting (IPTW)
Instead of discarding unmatched patients, IPTW reweights the entire sample so that the distribution of covariates is balanced between groups:
\[ w_i = \frac{A_i}{e(L_i)} + \frac{1 - A_i}{1 - e(L_i)} \]
This estimates the ATE. For the ATT, the weights are:
\[ w_i = A_i + \frac{(1 - A_i) \cdot e(L_i)}{1 - e(L_i)} \]
Code
# Using WeightIt
w_out <- weightit(treatment ~ age + diabetes + egfr + smoking,
data = dat,
method = "ps",
estimand = "ATE")
# Check balance
bal.tab(w_out, thresholds = c(m = 0.1))
# Stabilised weights (recommended to reduce variance)
summary(w_out)
# Estimate treatment effect with IPTW
library(survey)
d_weighted <- svydesign(ids = ~1, weights = w_out$weights, data = dat)
svyglm(mortality ~ treatment, design = d_weighted, family = quasibinomial) |>
tidy(conf.int = TRUE, exponentiate = TRUE)Extreme weights can inflate variance. Stabilised weights replace the numerator 1 with \(P(A=1)\) for treated and \(P(A=0)\) for untreated. The WeightIt package does this by default. Always check that weights are not excessively large (e.g., max weight > 20 warrants investigation).
20.4.5 Doubly robust estimation
Doubly robust methods combine a propensity score model and an outcome model. The estimate is consistent if either model is correctly specified (but not necessarily both). This provides an extra layer of protection against model misspecification.
The augmented inverse probability weighted (AIPW) estimator is:
\[ \hat{\tau}_{DR} = \frac{1}{n} \sum_{i=1}^n \left[ \frac{A_i Y_i}{e(L_i)} - \frac{A_i - e(L_i)}{e(L_i)} \hat{\mu}_1(L_i) \right] - \frac{1}{n} \sum_{i=1}^n \left[ \frac{(1-A_i) Y_i}{1-e(L_i)} + \frac{A_i - e(L_i)}{1-e(L_i)} \hat{\mu}_0(L_i) \right] \]
where \(\hat{\mu}_a(L)\) is the predicted outcome under treatment \(a\).
In practice, packages like AIPW in R or DoWhy in Python handle this computation.
20.5 Target Trial Emulation
20.5.1 The framework
Target trial emulation, developed by Hernan and Robins, asks: “What randomised trial would we like to conduct?” and then designs the observational analysis to mimic that trial as closely as possible.
The key components to specify are:
| Protocol component | Target trial | Observational emulation |
|---|---|---|
| Eligibility criteria | Defined inclusion/exclusion | Same criteria applied to database |
| Treatment strategies | e.g., start drug A vs drug B | Same, defined at “time zero” |
| Treatment assignment | Random | Adjusted via PS methods or g-estimation |
| Start of follow-up | Randomisation date | Date eligibility + treatment criteria met |
| Outcome | e.g., 5-year mortality | Same |
| Causal contrast | Intention-to-treat or per-protocol | Same (with appropriate methods) |
| Analysis plan | Intention-to-treat analysis | Clone-censor-weight or similar |
20.5.2 Why this reduces bias
Target trial emulation directly addresses several common biases:
Immortal time bias: In a naive observational study comparing drug users to non-users, the time between cohort entry and first prescription is “immortal” — the patient must survive to receive the drug. This artificially inflates survival in the treated group. Target trial emulation avoids this by aligning “time zero” (start of follow-up) with treatment assignment.
Selection bias from prevalent user designs: Studying patients already on a drug (prevalent users) conflates treatment initiation effects with survival conditioning. Target trial emulation uses a new-user (incident user) design.
Vague eligibility and time zero: When eligibility criteria and the start of follow-up are not clearly defined, analyses can be plagued by look-ahead bias. The target trial framework forces explicit specification.
The TARGET (TrAnsparent ReportinG of observational studies Emulating a Target trial) guideline provides a structured checklist for reporting target trial emulation studies. It was published in the BMJ in 2025 and should be followed for any observational study making causal claims.
20.5.3 Example: Statin initiation and cardiovascular events
Suppose we want to estimate the effect of initiating statins within 6 months of a type 2 diabetes diagnosis on 5-year cardiovascular events.
Target trial specification:
- Eligibility: Adults aged 40–75 with new T2DM diagnosis, no prior CVD, no prior statin use
- Treatment strategies: (1) Initiate statin within 6 months, (2) Do not initiate statin within 6 months
- Time zero: Date of T2DM diagnosis
- Outcome: First major adverse cardiovascular event (MACE) within 5 years
- Analysis: Intention-to-treat using IPTW for confounding, clone-censor-weight for treatment strategy adherence
20.6 Regression Discontinuity Design
Regression discontinuity (RD) exploits arbitrary thresholds in clinical decision-making. When treatment is assigned based on whether a continuous variable exceeds a cutoff, patients just above and just below the cutoff are essentially randomly assigned.
Clinical example: Many guidelines recommend antihypertensive therapy when systolic blood pressure exceeds 140 mmHg. Patients with SBP of 141 are very similar to patients with SBP of 139, but the former are much more likely to receive treatment. Comparing outcomes near the cutoff provides a quasi-experimental estimate.
Key requirements for a valid RD design:
- Treatment probability changes sharply at the cutoff (first stage)
- Other covariates do not jump at the cutoff (continuity assumption)
- Patients cannot precisely manipulate their score (no bunching)
RD designs are less common in clinical research than propensity score methods, but when applicable, they provide strong evidence because the assumptions are more testable.
20.7 Implementation in Python
Python’s causal inference ecosystem is maturing. Here we demonstrate propensity score matching and IPTW using dowhy and causalinference.
Code
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from scipy import stats
import matplotlib.pyplot as plt
# Simulate clinical data
np.random.seed(42)
n = 2000
age = np.random.normal(65, 10, n)
diabetes = np.random.binomial(1, 0.3, n)
egfr = np.random.normal(60, 15, n)
smoking = np.random.binomial(1, 0.25, n)
# Confounded treatment assignment
ps_true = 1 / (1 + np.exp(-(-2 + 0.03*age + 0.5*diabetes - 0.02*egfr + 0.4*smoking)))
treatment = np.random.binomial(1, ps_true)
# Outcome
mortality_prob = 1 / (1 + np.exp(-(-3 + 0.04*age + 0.6*diabetes -
0.03*egfr + 0.3*smoking -
0.5*treatment)))
mortality = np.random.binomial(1, mortality_prob)
df = pd.DataFrame({
'age': age, 'diabetes': diabetes, 'egfr': egfr,
'smoking': smoking, 'treatment': treatment, 'mortality': mortality
})
# --- Propensity Score Estimation ---
covariates = ['age', 'diabetes', 'egfr', 'smoking']
ps_model = LogisticRegression(max_iter=1000)
ps_model.fit(df[covariates], df['treatment'])
df['ps'] = ps_model.predict_proba(df[covariates])[:, 1]
# --- IPTW ---
# Compute ATE weights
df['iptw'] = np.where(
df['treatment'] == 1,
1 / df['ps'],
1 / (1 - df['ps'])
)
# Stabilised weights
p_treat = df['treatment'].mean()
df['sw'] = np.where(
df['treatment'] == 1,
p_treat / df['ps'],
(1 - p_treat) / (1 - df['ps'])
)
# Check balance: weighted standardised mean differences
def weighted_smd(data, var, treatment_col, weight_col):
treated = data[data[treatment_col] == 1]
control = data[data[treatment_col] == 0]
mean_t = np.average(treated[var], weights=treated[weight_col])
mean_c = np.average(control[var], weights=control[weight_col])
sd_pooled = np.sqrt((treated[var].var() + control[var].var()) / 2)
return (mean_t - mean_c) / sd_pooled
print("Covariate balance (weighted SMD):")
for var in covariates:
smd = weighted_smd(df, var, 'treatment', 'sw')
print(f" {var}: {smd:.4f}")
# --- Weighted outcome estimation ---
from statsmodels.api import WLS, add_constant
X = add_constant(df['treatment'])
wls_model = WLS(df['mortality'], X, weights=df['sw']).fit()
print("\nIPTW Treatment Effect Estimate:")
print(f" Coefficient: {wls_model.params[1]:.4f}")
print(f" 95% CI: ({wls_model.conf_int()[1][0]:.4f}, {wls_model.conf_int()[1][1]:.4f})")20.7.1 Using DoWhy for causal analysis
Code
# DoWhy provides a structured causal inference workflow
# Install: pip install dowhy
import dowhy
from dowhy import CausalModel
# Step 1: Model - encode causal assumptions
model = CausalModel(
data=df,
treatment='treatment',
outcome='mortality',
common_causes=['age', 'diabetes', 'egfr', 'smoking']
)
# Step 2: Identify - find valid adjustment strategy
identified_estimand = model.identify_effect()
print(identified_estimand)
# Step 3: Estimate - using propensity score weighting
estimate = model.estimate_effect(
identified_estimand,
method_name="backdoor.propensity_score_weighting",
method_params={"weighting_scheme": "ips_stabilized_weight"}
)
print(f"Causal Estimate: {estimate.value:.4f}")
# Step 4: Refute - sensitivity analysis
refutation = model.refute_estimate(
identified_estimand, estimate,
method_name="random_common_cause"
)
print(refutation)20.8 Sensitivity Analysis for Unmeasured Confounding
No observational study can prove the absence of unmeasured confounding. Sensitivity analysis quantifies how strong unmeasured confounding would need to be to explain away the observed effect.
The E-value (VanderWeele and Ding, 2017) provides a simple summary: it is the minimum strength of association (on the risk ratio scale) that an unmeasured confounder would need to have with both the treatment and the outcome to fully explain away the observed treatment-outcome association.
\[ \text{E-value} = RR + \sqrt{RR \times (RR - 1)} \]
where \(RR\) is the observed risk ratio. The E-value for the confidence interval bound closest to the null is also reported.
20.9 Practical Guidance: Choosing Your Method
| Scenario | Recommended approach |
|---|---|
| Binary treatment, many controls available | Propensity score matching |
| Binary treatment, need full sample | IPTW |
| Concern about model misspecification | Doubly robust estimation |
| Time-varying treatment | Marginal structural models (IPTW over time) |
| Natural threshold for treatment | Regression discontinuity |
| Emulating a specific trial design | Target trial emulation framework |
20.10 Exercises
20.10.1 Exercise 1: DAG construction (Conceptual)
You are studying the relationship between ACE inhibitor use and acute kidney injury (AKI) in hospitalised patients.
- List at least five variables that might be relevant to this relationship.
- Draw a DAG (on paper or using DAGitty) encoding your causal assumptions.
- Identify the minimal sufficient adjustment set for estimating the effect of ACE inhibitors on AKI.
- Identify at least one potential collider. What would happen if you adjusted for it?
20.10.2 Exercise 2: Propensity score matching in R
Using the simulated dataset below, perform propensity score matching and estimate the treatment effect.
Code
set.seed(123)
n <- 1500
exercise_dat <- tibble(
age = rnorm(n, 70, 8),
creatinine = rnorm(n, 1.2, 0.4),
heart_failure = rbinom(n, 1, 0.35),
prior_mi = rbinom(n, 1, 0.20),
# Confounded treatment (beta-blocker use)
treatment = rbinom(n, 1, plogis(-1 + 0.02*age + 0.3*heart_failure +
0.5*prior_mi - 0.8*creatinine)),
# Outcome: 1-year mortality
death_1yr = rbinom(n, 1, plogis(-2 + 0.05*age + 0.4*heart_failure +
0.6*prior_mi + 1.0*creatinine -
0.7*treatment))
)- Estimate propensity scores using logistic regression.
- Perform 1:1 nearest-neighbour matching with a caliper of 0.2 SD.
- Create a Love plot to assess balance before and after matching.
- Estimate the ATT for beta-blocker use on 1-year mortality.
- Calculate the E-value for your estimate. Interpret it.
20.10.3 Exercise 3: IPTW in Python
Using the same data-generating process as Exercise 2 (replicate it in Python):
Code
import numpy as np
import pandas as pd
np.random.seed(123)
n = 1500
age = np.random.normal(70, 8, n)
creatinine = np.random.normal(1.2, 0.4, n)
heart_failure = np.random.binomial(1, 0.35, n)
prior_mi = np.random.binomial(1, 0.20, n)
ps_true = 1 / (1 + np.exp(-(-1 + 0.02*age + 0.3*heart_failure +
0.5*prior_mi - 0.8*creatinine)))
treatment = np.random.binomial(1, ps_true)
mort_prob = 1 / (1 + np.exp(-(-2 + 0.05*age + 0.4*heart_failure +
0.6*prior_mi + 1.0*creatinine -
0.7*treatment)))
death_1yr = np.random.binomial(1, mort_prob)
df = pd.DataFrame({
'age': age, 'creatinine': creatinine, 'heart_failure': heart_failure,
'prior_mi': prior_mi, 'treatment': treatment, 'death_1yr': death_1yr
})- Fit a propensity score model and compute stabilised IPTW weights.
- Assess covariate balance using weighted standardised mean differences.
- Estimate the ATE using weighted regression.
- Perform a sensitivity analysis: add a simulated unmeasured confounder and show how the estimate changes.
20.10.4 Exercise 4: Target trial emulation (Conceptual)
You have access to a large electronic health records database and want to estimate the effect of early metformin initiation (within 3 months of type 2 diabetes diagnosis) vs delayed initiation (3–12 months after diagnosis) on 5-year cardiovascular events.
- Write a complete target trial protocol table (eligibility, treatment strategies, assignment, start of follow-up, outcome, causal contrast, analysis plan).
- Identify potential sources of immortal time bias in a naive analysis.
- Describe how the new-user, active comparator design addresses confounding by indication.
- What unmeasured confounders might remain? How would you conduct a sensitivity analysis?
20.11 Summary
| Concept | Key point |
|---|---|
| Confounding by indication | Sicker patients get more treatment — naive comparisons are biased |
| DAGs | Encode causal assumptions visually; identify what to adjust for |
| Propensity score | \(P(\text{treatment} \mid \text{covariates})\); reduces dimensionality |
| Matching | Creates balanced pseudo-randomised groups; check balance |
| IPTW | Reweights full sample; preserves sample size |
| Doubly robust | Combines PS + outcome model; consistent if either is correct |
| Target trial emulation | Design observational analysis to mimic an ideal RCT |
| E-value | Quantifies robustness to unmeasured confounding |
20.12 References and Further Reading
Hernan MA, Robins JM. Causal Inference: What If. Chapman & Hall/CRC, 2024. Freely available at https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/. The definitive modern textbook on causal inference.
Rosenbaum PR. Observational Studies. 2nd ed. Springer, 2002. The classic text on propensity score methods and sensitivity analysis.
Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70(1):41–55.
TARGET Guideline. TrAnsparent ReportinG of observational studies Emulating a Target trial. BMJ. 2025. Reporting checklist for target trial emulation studies.
VanderWeele TJ, Ding P. Sensitivity analysis in observational research: introducing the E-value. Annals of Internal Medicine. 2017;167(4):268–274.
Textor J, van der Zander B, Gilthorpe MS, et al. Robust causal inference using directed acyclic graphs: the R package ‘dagitty’. International Journal of Epidemiology. 2016;45(6):1887–1894.
Ho DE, Imai K, King G, Stuart EA. MatchIt: nonparametric preprocessing for parametric causal inference. Journal of Statistical Software. 2011;42(8):1–28.
Greifer N. WeightIt: weighting for covariate balance in observational studies. R package. https://ngreifer.github.io/WeightIt/.
Sharma A, Kiciman E. DoWhy: An end-to-end library for causal inference. arXiv:2011.04216, 2020.
Lodi S, Phillips A, Lundgren J, et al. Effect estimates in randomized trials and observational studies: comparing apples with apples. American Journal of Epidemiology. 2019;188(8):1569–1577.