4 Statistical Inference: What We Can and Cannot Conclude

4.1 Introduction

Imagine you run a clinical trial comparing a new antihypertensive drug to placebo. After 12 weeks, the treatment group’s mean systolic blood pressure dropped by 8.3 mmHg more than the placebo group. But how confident should you be in that number? Would you see the same result if you repeated the trial? Could the difference be explained by chance alone? And even if the effect is “real,” is 8.3 mmHg enough to matter for patient health?

These are the questions that statistical inference answers. Inference is the bridge between the data you observed (your sample) and the truth you want to know (the population). This chapter will equip you to cross that bridge carefully, avoiding the many pitfalls that trip up even experienced researchers.

We will use a single running example throughout: a randomized controlled trial (RCT) of Drug X versus placebo for systolic blood pressure reduction. This trial enrolled 200 participants (100 per arm), measured blood pressure at baseline and 12 weeks, and found a mean difference of 8.3 mmHg (SD = 14.2 mmHg) favoring Drug X.

4.2 Point Estimates and Sampling Variability

4.2.1 What Is a Point Estimate?

A point estimate is a single number calculated from your sample that serves as your best guess for an unknown population parameter. Common examples:

What you want to know	Point estimate
Population mean blood pressure reduction	Sample mean ($\bar{x}$)
Population proportion who respond to treatment	Sample proportion ($\hat{p}$)
Population odds ratio for treatment effect	Sample odds ratio ($\widehat{OR}$)

In our trial, the point estimate for the mean difference in blood pressure reduction is 8.3 mmHg. This is a single number – our best guess – but it comes with uncertainty.

4.2.2 Sampling Variability

If we ran the same trial again with 200 new participants from the same population, we would almost certainly get a different point estimate. Maybe 7.1 mmHg, or 9.8 mmHg, or 6.5 mmHg. This natural fluctuation from sample to sample is called sampling variability.

The key insight: every point estimate is “wrong” in the sense that it almost never equals the exact population parameter. The question is how wrong it might be.

4.2.3 The Standard Error

The standard error (SE) quantifies how much a point estimate is expected to vary across repeated samples. For a sample mean:

\[SE(\bar{x}) = \frac{s}{\sqrt{n}}\]

where $s$ is the sample standard deviation and $n$ is the sample size.

For the difference between two independent means:

\[SE(\bar{x}_1 - \bar{x}_2) = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}\]

In our trial, assuming equal SDs of 14.2 mmHg in each group:

\[SE = \sqrt{\frac{14.2^2}{100} + \frac{14.2^2}{100}} = \sqrt{2.016 + 2.016} = \sqrt{4.033} \approx 2.01 \text{ mmHg}\]

This tells us: if we repeated this trial many times, the sample mean difference would typically vary by about 2 mmHg from trial to trial.

Important

Standard deviation vs. standard error: The standard deviation (SD = 14.2) describes variability among individual patients. The standard error (SE = 2.01) describes variability among sample means. They answer different questions. The SD is always larger than the SE because means are less variable than individual observations.

4.3 Confidence Intervals: What They Actually Mean

4.3.1 Constructing a 95% Confidence Interval

A 95% confidence interval (CI) for our mean difference is:

\[\bar{x} \pm t^* \times SE = 8.3 \pm 1.97 \times 2.01 = (4.34, \ 12.26) \text{ mmHg}\]

where $t^*$ is the critical value from the $t$-distribution (approximately 1.97 for large samples).

4.3.2 The Correct Interpretation

Here is one of the most commonly misunderstood concepts in statistics.

Common Misconception

Wrong: “There is a 95% probability that the true mean difference lies between 4.34 and 12.26 mmHg.”

Right: “If we repeated this trial many times and computed a 95% CI each time, approximately 95% of those intervals would contain the true mean difference.”

Why does this distinction matter? The true population parameter is a fixed (unknown) number – it does not have a probability of being “in” or “out” of an interval. The interval is the random quantity, not the parameter. Each time you run a new study, you get a new interval. Some of those intervals capture the truth; some do not. A 95% CI means that the procedure works 95% of the time.

In practice, a useful way to think about it: the confidence interval gives you a range of plausible values for the population parameter, given the data you observed. Values inside the interval are reasonably compatible with your data; values outside are not.

4.3.3 What the Width Tells You

The width of a confidence interval reflects precision. A narrow CI means you have a precise estimate; a wide CI means substantial uncertainty. Width depends on:

Sample size – larger $n$ gives narrower CIs
Variability – less variable data gives narrower CIs
Confidence level – 99% CIs are wider than 95% CIs (more confidence requires more hedging)

In our trial, the CI of (4.34, 12.26) is moderately wide – the true effect could be as small as about 4 mmHg or as large as 12 mmHg. This matters clinically: a 4 mmHg reduction and a 12 mmHg reduction may lead to very different treatment decisions.

4.4 P-values: What They Are, What They Are Not

4.4.1 Definition

A p-value is the probability of observing a test statistic as extreme as (or more extreme than) the one you calculated, assuming the null hypothesis is true.

For our trial, the null hypothesis is $H_0: \mu_{\text{Drug X}} - \mu_{\text{Placebo}} = 0$ (no difference). The test statistic is:

\[t = \frac{8.3 - 0}{2.01} = 4.13\]

With 198 degrees of freedom, $p < 0.001$.

4.4.2 What a P-value Is

The p-value measures the compatibility between your data and the null hypothesis
A small p-value means your data would be unlikely if the null hypothesis were true
It quantifies evidence against $H_0$, not evidence for any particular alternative

4.4.3 What a P-value Is NOT

The American Statistical Association published a landmark statement in 2016 (Wasserstein & Lazar, 2016) because p-values are so widely misinterpreted. Here are the key points:

P-value Misconceptions – From the ASA Statement

A p-value is NOT the probability that the null hypothesis is true. A p-value of 0.03 does not mean there is a 3% chance the drug doesn’t work.
A p-value is NOT the probability that the result was produced by chance. It is calculated assuming chance is the only explanation.
Statistical significance does NOT mean clinical or practical importance. A tiny, meaningless effect can have $p < 0.001$ with a large enough sample.
A large p-value does NOT mean the null hypothesis is true. It means you failed to find strong evidence against it – absence of evidence is not evidence of absence.
The p-value does NOT measure the size of an effect. It mixes effect size with sample size. Always report the effect size and confidence interval alongside the p-value.

4.4.4 The “Statistical Significance” Problem in Medicine

The threshold of $p < 0.05$ is deeply entrenched in medical research. But there is nothing magical about 0.05. It is an arbitrary convention, originally proposed as a convenient rule of thumb by Ronald Fisher, not as a rigid decision boundary.

Problems with dichotomizing at $p = 0.05$:

A result with $p = 0.049$ is treated as fundamentally different from $p = 0.051$, even though they are essentially identical in evidential strength.
Journals preferentially publish “significant” results, creating publication bias.
Researchers may unconsciously (or consciously) engage in p-hacking – trying multiple analyses until one crosses the 0.05 threshold.
In large clinical databases, nearly everything is “statistically significant” because of enormous sample sizes, regardless of clinical relevance.

A better approach: Report the point estimate, confidence interval, and p-value. Interpret them together. Judge results by their clinical importance, not just their p-value. As the ASA statement concludes: “No single index should substitute for scientific reasoning.”

4.5 Effect Sizes That Matter Clinically

4.5.1 Why Effect Sizes Matter

Statistical significance tells you whether an effect is likely to be non-zero. It says nothing about whether the effect is large enough to matter. In medicine, this distinction is critical: a drug that lowers blood pressure by 0.5 mmHg might achieve $p < 0.001$ in a trial of 50,000 people, but no clinician would prescribe it based on that effect.

4.5.2 Cohen’s d (Standardized Mean Difference)

Cohen’s $d$ expresses the difference between two means in standard deviation units:

\[d = \frac{\bar{x}_1 - \bar{x}_2}{s_{\text{pooled}}}\]

For our trial: $d = 8.3 / 14.2 = 0.58$, which is conventionally considered a “medium” effect.

Cohen’s $d$	Interpretation
0.2	Small
0.5	Medium
0.8	Large

Note

These benchmarks (from Cohen, 1988) are rough guidelines, not rigid rules. A “small” effect by Cohen’s standards might be highly clinically meaningful (e.g., a small reduction in mortality), while a “large” effect might be trivial in another context. Always interpret effect sizes in the clinical context.

4.5.3 Odds Ratios and Risk Ratios

For binary outcomes (e.g., death, hospitalization, disease incidence), effect sizes are typically reported as:

Risk Ratio (Relative Risk): \[RR = \frac{P(\text{event} \mid \text{treatment})}{P(\text{event} \mid \text{control})}\]

An RR of 0.75 means the treatment group had 75% the risk of the control group, i.e., a 25% relative risk reduction.

Odds Ratio: \[OR = \frac{\text{odds of event in treatment}}{\text{odds of event in control}} = \frac{p_1 / (1 - p_1)}{p_2 / (1 - p_2)}\]

Odds ratios are the natural output of logistic regression. When the outcome is rare (< 10%), the OR approximates the RR. When the outcome is common, the OR exaggerates the effect compared to the RR.

Risk Difference (Absolute Risk Reduction): \[RD = P(\text{event} \mid \text{control}) - P(\text{event} \mid \text{treatment})\]

If 15% of control patients are hospitalized vs. 10% of treatment patients, the RD = 0.05 (or 5 percentage points).

4.5.4 Number Needed to Treat (NNT)

The NNT is perhaps the most clinician-friendly effect size:

\[NNT = \frac{1}{RD} = \frac{1}{|p_1 - p_2|}\]

Using our example: $NNT = 1 / 0.05 = 20$. This means you need to treat 20 patients to prevent one additional hospitalization. A low NNT is desirable. For reference:

Statins for secondary prevention of MI: NNT around 30-80
Aspirin for acute MI: NNT around 40
Antibiotics for strep pharyngitis to prevent rheumatic fever: NNT around 4,000

4.5.5 Statistically Significant but Clinically Meaningless

Consider a large pragmatic trial (n = 20,000) of a new diabetes drug that lowers HbA1c by 0.1% compared to the current standard of care, with $p = 0.002$. Is this clinically meaningful?

Almost certainly not. A 0.1% reduction in HbA1c is below the threshold most endocrinologists consider clinically relevant (typically 0.3-0.5%). The p-value is tiny because the sample size is enormous – even a trivial true effect becomes “significant” with enough data.

This is why you should always report and interpret effect sizes (with confidence intervals), not just p-values.

4.6 Type I and Type II Errors

4.6.1 The Two Kinds of Mistakes

When making a decision based on a hypothesis test, there are two ways to be wrong:

	$H_0$ is true (no effect)	$H_0$ is false (real effect)
Reject $H_0$	Type I error ($\alpha$)	Correct (Power)
Fail to reject $H_0$	Correct	Type II error ($\beta$)

Type I error (false positive): You conclude the drug works, but it actually doesn’t. Probability = $\alpha$ (typically set at 0.05).
Type II error (false negative): You conclude the drug doesn’t work, but it actually does. Probability = $\beta$.

4.6.2 Statistical Power

Power = $1 - \beta$ = the probability of correctly detecting a real effect.

Power depends on:

Effect size – larger effects are easier to detect
Sample size – more data provides more power
Significance level ($\alpha$) – higher $\alpha$ gives more power (but more false positives)
Variability – less noise makes signals easier to detect

In clinical trials, we typically aim for 80% power (i.e., $\beta = 0.20$). This means we accept a 20% chance of missing a true effect – a remarkably high false-negative rate that is often underappreciated.

Clinical Implication

Many clinical trials are underpowered – they have too few participants to detect realistic effect sizes. A “negative” trial (one that fails to reach $p < 0.05$) does not necessarily mean the treatment is ineffective. It may simply mean the trial was too small. Always check the confidence interval: if it includes both clinically meaningful and null effects, the trial was inconclusive, not negative.

4.6.3 A Power Calculation Example

Suppose you are planning a trial of Drug X and expect a 5 mmHg difference (SD = 15 mmHg). How many participants per group do you need for 80% power at $\alpha = 0.05$?

\[n = \frac{2(z_{\alpha/2} + z_\beta)^2 \sigma^2}{\Delta^2} = \frac{2(1.96 + 0.84)^2 \times 15^2}{5^2} = \frac{2 \times 7.84 \times 225}{25} \approx 141 \text{ per group}\]

You need approximately 141 participants per group, or 282 total.

4.7 Multiple Testing

4.7.1 The Problem

When you perform multiple statistical tests, the probability of at least one false positive increases rapidly. If you test 20 independent hypotheses at $\alpha = 0.05$, the probability of at least one Type I error is:

\[1 - (1 - 0.05)^{20} = 1 - 0.95^{20} = 1 - 0.358 = 0.642\]

That is a 64.2% chance of at least one false positive. This is a massive problem in:

Genomics/GWAS: Testing millions of genetic variants simultaneously
Clinical trials with multiple endpoints: Primary, secondary, and exploratory outcomes
Subgroup analyses: Testing treatment effects in men vs. women, old vs. young, etc.

4.7.2 Bonferroni Correction

The simplest fix: divide $\alpha$ by the number of tests.

\[\alpha_{\text{adjusted}} = \frac{\alpha}{m}\]

where $m$ is the number of tests. If you run 20 tests, you require $p < 0.05/20 = 0.0025$ for each test.

Pros: Simple, controls the family-wise error rate (FWER) – the probability of any false positives.

Cons: Very conservative. As $m$ grows, it becomes extremely difficult to detect real effects. With 1 million SNPs in a GWAS, you need $p < 5 \times 10^{-8}$.

4.7.3 False Discovery Rate (FDR) Control

The Benjamini-Hochberg procedure (1995) controls the false discovery rate – the expected proportion of rejected hypotheses that are false positives. This is less stringent than Bonferroni and more appropriate when you expect many true positives among your tests.

Steps:

Rank all $m$ p-values from smallest to largest: $p_{(1)} \leq p_{(2)} \leq \ldots \leq p_{(m)}$
Find the largest $k$ such that $p_{(k)} \leq \frac{k}{m} \times q$, where $q$ is the desired FDR (e.g., 0.05)
Reject all hypotheses with $p \leq p_{(k)}$

In genomics, FDR control at $q = 0.05$ means: among all the genes we declare “significant,” we expect no more than 5% to be false positives. This is often a more useful guarantee than Bonferroni’s promise of no false positives.

4.7.4 When Does Multiple Testing Matter?

Scenario	Correction needed?
Pre-specified primary endpoint in a clinical trial	No (single test)
One primary + several secondary endpoints	Yes
Subgroup analyses	Yes (or treat as exploratory)
Genome-wide association study	Yes (millions of tests)
Exploratory data analysis	No formal correction, but be transparent

4.8 Exercises

Exercise 2.1: Confidence Interval Simulation

Simulate 100 trials, each with n=50 per group, true mean difference = 5, SD = 10. Compute a 95% CI for each trial. What proportion of CIs contain the true value of 5?

Code

set.seed(42)
n_sims <- 100
n_per_group <- 50
true_diff <- 5
sd_val <- 10

contains_true <- numeric(n_sims)

for (i in 1:n_sims) {
  group1 <- rnorm(n_per_group, mean = 0, sd = sd_val)
  group2 <- rnorm(n_per_group, mean = true_diff, sd = sd_val)

  test_result <- t.test(group2, group1)
  ci <- test_result$conf.int

  contains_true[i] <- (ci[1] <= true_diff & ci[2] >= true_diff)
}

cat("Proportion of CIs containing true value:", mean(contains_true), "\n")

# Bonus: visualize the CIs
library(ggplot2)
ci_data <- data.frame(
  sim = 1:n_sims,
  lower = numeric(n_sims),
  upper = numeric(n_sims),
  contains = logical(n_sims)
)

set.seed(42)
for (i in 1:n_sims) {
  group1 <- rnorm(n_per_group, mean = 0, sd = sd_val)
  group2 <- rnorm(n_per_group, mean = true_diff, sd = sd_val)
  test_result <- t.test(group2, group1)
  ci_data$lower[i] <- test_result$conf.int[1]
  ci_data$upper[i] <- test_result$conf.int[2]
  ci_data$contains[i] <- (ci_data$lower[i] <= true_diff & ci_data$upper[i] >= true_diff)
}

ggplot(ci_data, aes(x = sim, y = (lower + upper) / 2, color = contains)) +
  geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.3) +
  geom_hline(yintercept = true_diff, linetype = "dashed", color = "blue") +
  scale_color_manual(values = c("red", "darkgreen")) +
  labs(x = "Simulation", y = "Mean Difference",
       title = "95% Confidence Intervals from 100 Simulated Trials") +
  theme_minimal()

Code

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

np.random.seed(42)
n_sims = 100
n_per_group = 50
true_diff = 5
sd_val = 10

contains_true = []
lower_bounds = []
upper_bounds = []

for i in range(n_sims):
    group1 = np.random.normal(0, sd_val, n_per_group)
    group2 = np.random.normal(true_diff, sd_val, n_per_group)

    diff = np.mean(group2) - np.mean(group1)
    se = np.sqrt(np.var(group1, ddof=1)/n_per_group + np.var(group2, ddof=1)/n_per_group)
    t_crit = stats.t.ppf(0.975, df=n_per_group*2 - 2)

    lower = diff - t_crit * se
    upper = diff + t_crit * se

    lower_bounds.append(lower)
    upper_bounds.append(upper)
    contains_true.append(lower <= true_diff <= upper)

print(f"Proportion of CIs containing true value: {np.mean(contains_true):.2f}")

# Visualize
fig, ax = plt.subplots(figsize=(10, 6))
for i in range(n_sims):
    color = "green" if contains_true[i] else "red"
    ax.plot([i, i], [lower_bounds[i], upper_bounds[i]], color=color, linewidth=0.8)

ax.axhline(y=true_diff, color="blue", linestyle="--", label="True difference")
ax.set_xlabel("Simulation")
ax.set_ylabel("Mean Difference")
ax.set_title("95% Confidence Intervals from 100 Simulated Trials")
ax.legend()
plt.tight_layout()
plt.show()

Solution

You should find that approximately 95 out of 100 CIs contain the true value of 5. The exact number will vary due to randomness, but it should be close to 95. The red intervals in the plot are the ones that “missed” – they do not contain the true value. This is the correct interpretation of a 95% CI: the procedure captures the truth 95% of the time.

Exercise 2.2: P-value Interpretation

A clinical trial reports: “The new drug reduced 30-day mortality from 12% to 9% (p = 0.04).” For each statement below, indicate whether it is TRUE or FALSE and explain why.

There is a 96% probability that the drug truly reduces mortality.
There is a 4% probability the result is due to chance.
If the drug had no effect, we would see a difference this large or larger about 4% of the time.
The drug reduces mortality by 3 percentage points.

Code

# This is a conceptual exercise, but let's verify the numbers:
p_control <- 0.12
p_treatment <- 0.09

risk_difference <- p_control - p_treatment
risk_ratio <- p_treatment / p_control
nnt <- 1 / risk_difference

cat("Risk Difference:", risk_difference, "\n")
cat("Risk Ratio:", round(risk_ratio, 3), "\n")
cat("NNT:", round(nnt, 1), "\n")

Code

# This is a conceptual exercise, but let's verify the numbers:
p_control = 0.12
p_treatment = 0.09

risk_difference = p_control - p_treatment
risk_ratio = p_treatment / p_control
nnt = 1 / risk_difference

print(f"Risk Difference: {risk_difference}")
print(f"Risk Ratio: {risk_ratio:.3f}")
print(f"NNT: {nnt:.1f}")

Solution

FALSE. The p-value does not give the probability that the drug works. It does not tell you the probability of any hypothesis being true or false.
FALSE. The p-value is not the probability that the result is “due to chance.” It is calculated assuming chance is the only explanation.
TRUE. This is the correct definition of a p-value: if the null hypothesis (no difference) were true, we would observe a result this extreme or more extreme approximately 4% of the time.
TRUE (as a point estimate). The absolute risk reduction is 12% - 9% = 3 percentage points. The NNT = 1/0.03 = 33.3, meaning you need to treat about 34 patients to prevent one death. Whether this is clinically meaningful depends on the drug’s cost, side effects, and the patient population.

Exercise 2.3: Multiple Testing Correction

You are analyzing gene expression data and have tested 1,000 genes for differential expression between a treatment and control group. You find 60 genes with p < 0.05. Apply both Bonferroni correction and BH-FDR correction. How many genes remain significant under each approach?

Code

# Simulate 1000 p-values: 950 null (uniform), 50 truly differentially expressed
set.seed(123)
n_genes <- 1000
n_true <- 50

# Null p-values are uniformly distributed
p_null <- runif(n_genes - n_true, 0, 1)

# True effects: p-values will tend to be small
p_true <- rbeta(n_true, 1, 20)  # skewed toward 0

p_values <- c(p_null, p_true)

# How many nominally significant?
cat("Nominally significant (p < 0.05):", sum(p_values < 0.05), "\n")

# Bonferroni correction
p_bonferroni <- p.adjust(p_values, method = "bonferroni")
cat("Significant after Bonferroni:", sum(p_bonferroni < 0.05), "\n")

# BH-FDR correction
p_fdr <- p.adjust(p_values, method = "BH")
cat("Significant after BH-FDR:", sum(p_fdr < 0.05), "\n")

# Compare
cat("\nBonferroni is much more conservative.\n")
cat("FDR retains more discoveries while controlling the false discovery rate.\n")

Code

import numpy as np
from statsmodels.stats.multitest import multipletests

np.random.seed(123)
n_genes = 1000
n_true = 50

# Null p-values (uniform) and true effect p-values (small)
p_null = np.random.uniform(0, 1, n_genes - n_true)
p_true = np.random.beta(1, 20, n_true)

p_values = np.concatenate([p_null, p_true])

print(f"Nominally significant (p < 0.05): {np.sum(p_values < 0.05)}")

# Bonferroni correction
reject_bonf, pvals_bonf, _, _ = multipletests(p_values, alpha=0.05, method='bonferroni')
print(f"Significant after Bonferroni: {np.sum(reject_bonf)}")

# BH-FDR correction
reject_fdr, pvals_fdr, _, _ = multipletests(p_values, alpha=0.05, method='fdr_bh')
print(f"Significant after BH-FDR: {np.sum(reject_fdr)}")

print("\nBonferroni is much more conservative.")
print("FDR retains more discoveries while controlling the false discovery rate.")

Solution

With the simulated data (seed = 123), you should see approximately:

Nominally significant: ~60-110 genes (includes both true and false positives)
Bonferroni: ~15-30 genes (very conservative, few false positives but many missed true effects)
BH-FDR: ~30-50 genes (moderate, good balance of sensitivity and false positive control)

The Bonferroni correction controls the family-wise error rate (probability of any false positives), while BH-FDR controls the expected proportion of false positives among rejections. In genomics, where you expect many true positives, FDR is usually more appropriate because Bonferroni’s stringency causes you to miss too many real discoveries.

Exercise 2.4: Power Analysis

You are planning a trial to test whether a lifestyle intervention reduces HbA1c in type 2 diabetes patients. You expect a 0.4% reduction (SD = 1.2%). Compute the required sample size per group for 80% power at alpha = 0.05. Then recompute for 90% power. How does the sample size change?

Code

# Using the power.t.test function
# 80% power
result_80 <- power.t.test(
  delta = 0.4,       # expected difference
  sd = 1.2,          # standard deviation
  sig.level = 0.05,
  power = 0.80,
  type = "two.sample",
  alternative = "two.sided"
)
cat("Sample size per group (80% power):", ceiling(result_80$n), "\n")

# 90% power
result_90 <- power.t.test(
  delta = 0.4,
  sd = 1.2,
  sig.level = 0.05,
  power = 0.90,
  type = "two.sample",
  alternative = "two.sided"
)
cat("Sample size per group (90% power):", ceiling(result_90$n), "\n")

cat("\nIncreasing power from 80% to 90% requires about",
    round((ceiling(result_90$n) / ceiling(result_80$n) - 1) * 100),
    "% more participants per group.\n")

Code

from statsmodels.stats.power import TTestIndPower

analysis = TTestIndPower()

# Cohen's d = delta / sd = 0.4 / 1.2
effect_size = 0.4 / 1.2

# 80% power
n_80 = analysis.solve_power(effect_size=effect_size, alpha=0.05, power=0.80,
                            alternative='two-sided')
print(f"Sample size per group (80% power): {int(np.ceil(n_80))}")

# 90% power
n_90 = analysis.solve_power(effect_size=effect_size, alpha=0.05, power=0.90,
                            alternative='two-sided')
print(f"Sample size per group (90% power): {int(np.ceil(n_90))}")

print(f"\nIncreasing from 80% to 90% power requires about "
      f"{int(round((n_90/n_80 - 1) * 100))}% more participants per group.")

Solution

With an effect size of 0.4% and SD of 1.2% (Cohen’s d = 0.33):

80% power: approximately 143 per group (286 total)
90% power: approximately 191 per group (382 total)

Going from 80% to 90% power requires roughly 33% more participants. This illustrates the diminishing returns of increasing power – each additional percentage point of power costs more participants. Most trials settle for 80% power as a reasonable compromise.

4.9 Key Takeaways

Point estimates are always uncertain. Report confidence intervals, not just point estimates.
Confidence intervals describe the precision of your estimate. They tell you the range of plausible values for the population parameter.
P-values measure compatibility with the null hypothesis. They are not the probability that the null is true, nor the probability of a chance finding.
Statistical significance is not clinical significance. Always report and interpret effect sizes.
NNT is the most clinician-friendly effect size for binary outcomes.
Underpowered studies produce inconclusive results, not evidence that a treatment is ineffective.
Multiple testing inflates false positives. Use Bonferroni (conservative) or FDR (less conservative) corrections as appropriate.

4.10 References and Further Reading

Wasserstein, R. L., & Lazar, N. A. (2016). The ASA statement on p-values: Context, process, and purpose. The American Statistician, 70(2), 129-133.
Wasserstein, R. L., Schirm, A. L., & Lazar, N. A. (2019). Moving to a world beyond “p < 0.05.” The American Statistician, 73(sup1), 1-19.
Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Lawrence Erlbaum Associates.
Vittinghoff, E., Glidden, D. V., Shiboski, S. C., & McCulloch, C. E. (2012). Regression Methods in Biostatistics: Linear, Logistic, Survival, and Repeated Measures Models (2nd ed.). Springer.
Steyerberg, E. W. (2019). Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating (2nd ed.). Springer.
Smits, L. J. M., et al. (2026). Recommendations for clinical prediction model development, validation, and updating. BMJ.
Greenland, S., Senn, S. J., Rothman, K. J., et al. (2016). Statistical tests, P values, confidence intervals, and power: A guide to misinterpretations. European Journal of Epidemiology, 31(4), 337-350.
Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B, 57(1), 289-300.
Altman, D. G., & Bland, J. M. (1995). Absence of evidence is not evidence of absence. BMJ, 311(7003), 485.

# Statistical Inference: What We Can and Cannot Conclude {#sec-inference} ## Introduction Imagine you run a clinical trial comparing a new antihypertensive drug to placebo. After 12 weeks, the treatment group's mean systolic blood pressure dropped by 8.3 mmHg more than the placebo group. But how confident should you be in that number? Would you see the same result if you repeated the trial? Could the difference be explained by chance alone? And even if the effect is "real," is 8.3 mmHg enough to matter for patient health? These are the questions that **statistical inference** answers. Inference is the bridge between the data you *observed* (your sample) and the truth you *want to know* (the population). This chapter will equip you to cross that bridge carefully, avoiding the many pitfalls that trip up even experienced researchers. We will use a single running example throughout: a randomized controlled trial (RCT) of **Drug X** versus placebo for systolic blood pressure reduction. This trial enrolled 200 participants (100 per arm), measured blood pressure at baseline and 12 weeks, and found a mean difference of 8.3 mmHg (SD = 14.2 mmHg) favoring Drug X. ## Point Estimates and Sampling Variability ### What Is a Point Estimate? A **point estimate** is a single number calculated from your sample that serves as your best guess for an unknown population parameter. Common examples: | What you want to know | Point estimate | |---|---| | Population mean blood pressure reduction | Sample mean ($\bar{x}$) | | Population proportion who respond to treatment | Sample proportion ($\hat{p}$) | | Population odds ratio for treatment effect | Sample odds ratio ($\widehat{OR}$) | In our trial, the point estimate for the mean difference in blood pressure reduction is **8.3 mmHg**. This is a single number -- our best guess -- but it comes with uncertainty. ### Sampling Variability If we ran the same trial again with 200 new participants from the same population, we would almost certainly get a *different* point estimate. Maybe 7.1 mmHg, or 9.8 mmHg, or 6.5 mmHg. This natural fluctuation from sample to sample is called **sampling variability**. The key insight: **every point estimate is "wrong" in the sense that it almost never equals the exact population parameter.** The question is *how wrong* it might be. ### The Standard Error The **standard error (SE)** quantifies how much a point estimate is expected to vary across repeated samples. For a sample mean: $$SE(\bar{x}) = \frac{s}{\sqrt{n}}$$ where $s$ is the sample standard deviation and $n$ is the sample size. For the difference between two independent means: $$SE(\bar{x}_1 - \bar{x}_2) = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}$$ In our trial, assuming equal SDs of 14.2 mmHg in each group: $$SE = \sqrt{\frac{14.2^2}{100} + \frac{14.2^2}{100}} = \sqrt{2.016 + 2.016} = \sqrt{4.033} \approx 2.01 \text{ mmHg}$$ This tells us: if we repeated this trial many times, the sample mean difference would typically vary by about 2 mmHg from trial to trial. ::: {.callout-important} **Standard deviation vs. standard error:** The standard deviation (SD = 14.2) describes variability among *individual patients*. The standard error (SE = 2.01) describes variability among *sample means*. They answer different questions. The SD is always larger than the SE because means are less variable than individual observations. ::: ## Confidence Intervals: What They Actually Mean ### Constructing a 95% Confidence Interval A **95% confidence interval (CI)** for our mean difference is: $$\bar{x} \pm t^* \times SE = 8.3 \pm 1.97 \times 2.01 = (4.34, \ 12.26) \text{ mmHg}$$ where $t^*$ is the critical value from the $t$-distribution (approximately 1.97 for large samples). ### The Correct Interpretation Here is one of the most commonly misunderstood concepts in statistics. ::: {.callout-warning title="Common Misconception"} **Wrong:** "There is a 95% probability that the true mean difference lies between 4.34 and 12.26 mmHg." **Right:** "If we repeated this trial many times and computed a 95% CI each time, approximately 95% of those intervals would contain the true mean difference." ::: Why does this distinction matter? The true population parameter is a fixed (unknown) number -- it does not have a probability of being "in" or "out" of an interval. The *interval* is the random quantity, not the parameter. Each time you run a new study, you get a new interval. Some of those intervals capture the truth; some do not. A 95% CI means that the *procedure* works 95% of the time. In practice, a useful way to think about it: the confidence interval gives you a range of plausible values for the population parameter, given the data you observed. Values inside the interval are reasonably compatible with your data; values outside are not. ### What the Width Tells You The width of a confidence interval reflects **precision**. A narrow CI means you have a precise estimate; a wide CI means substantial uncertainty. Width depends on: 1. **Sample size** -- larger $n$ gives narrower CIs 2. **Variability** -- less variable data gives narrower CIs 3. **Confidence level** -- 99% CIs are wider than 95% CIs (more confidence requires more hedging) In our trial, the CI of (4.34, 12.26) is moderately wide -- the true effect could be as small as about 4 mmHg or as large as 12 mmHg. This matters clinically: a 4 mmHg reduction and a 12 mmHg reduction may lead to very different treatment decisions. ## P-values: What They Are, What They Are Not ### Definition A **p-value** is the probability of observing a test statistic as extreme as (or more extreme than) the one you calculated, *assuming the null hypothesis is true*. For our trial, the null hypothesis is $H_0: \mu_{\text{Drug X}} - \mu_{\text{Placebo}} = 0$ (no difference). The test statistic is: $$t = \frac{8.3 - 0}{2.01} = 4.13$$ With 198 degrees of freedom, $p < 0.001$. ### What a P-value Is - The p-value measures the **compatibility between your data and the null hypothesis** - A small p-value means your data would be unlikely if the null hypothesis were true - It quantifies evidence against $H_0$, not evidence *for* any particular alternative ### What a P-value Is NOT The American Statistical Association published a landmark statement in 2016 (Wasserstein & Lazar, 2016) because p-values are so widely misinterpreted. Here are the key points: ::: {.callout-warning title="P-value Misconceptions -- From the ASA Statement"} 1. **A p-value is NOT the probability that the null hypothesis is true.** A p-value of 0.03 does *not* mean there is a 3% chance the drug doesn't work. 2. **A p-value is NOT the probability that the result was produced by chance.** It is calculated *assuming* chance is the only explanation. 3. **Statistical significance does NOT mean clinical or practical importance.** A tiny, meaningless effect can have $p < 0.001$ with a large enough sample. 4. **A large p-value does NOT mean the null hypothesis is true.** It means you failed to find strong evidence against it -- absence of evidence is not evidence of absence. 5. **The p-value does NOT measure the size of an effect.** It mixes effect size with sample size. Always report the effect size and confidence interval alongside the p-value. ::: ### The "Statistical Significance" Problem in Medicine The threshold of $p < 0.05$ is deeply entrenched in medical research. But there is nothing magical about 0.05. It is an arbitrary convention, originally proposed as a convenient rule of thumb by Ronald Fisher, not as a rigid decision boundary. **Problems with dichotomizing at $p = 0.05$:** - A result with $p = 0.049$ is treated as fundamentally different from $p = 0.051$, even though they are essentially identical in evidential strength. - Journals preferentially publish "significant" results, creating **publication bias**. - Researchers may unconsciously (or consciously) engage in **p-hacking** -- trying multiple analyses until one crosses the 0.05 threshold. - In large clinical databases, nearly everything is "statistically significant" because of enormous sample sizes, regardless of clinical relevance. **A better approach:** Report the point estimate, confidence interval, and p-value. Interpret them together. Judge results by their clinical importance, not just their p-value. As the ASA statement concludes: "No single index should substitute for scientific reasoning." ## Effect Sizes That Matter Clinically ### Why Effect Sizes Matter Statistical significance tells you whether an effect is likely to be non-zero. It says nothing about whether the effect is *large enough to matter*. In medicine, this distinction is critical: a drug that lowers blood pressure by 0.5 mmHg might achieve $p < 0.001$ in a trial of 50,000 people, but no clinician would prescribe it based on that effect. ### Cohen's d (Standardized Mean Difference) Cohen's $d$ expresses the difference between two means in standard deviation units: $$d = \frac{\bar{x}_1 - \bar{x}_2}{s_{\text{pooled}}}$$ For our trial: $d = 8.3 / 14.2 = 0.58$, which is conventionally considered a "medium" effect. | Cohen's $d$ | Interpretation | |---|---| | 0.2 | Small | | 0.5 | Medium | | 0.8 | Large | ::: {.callout-note} These benchmarks (from Cohen, 1988) are rough guidelines, not rigid rules. A "small" effect by Cohen's standards might be highly clinically meaningful (e.g., a small reduction in mortality), while a "large" effect might be trivial in another context. Always interpret effect sizes in the clinical context. ::: ### Odds Ratios and Risk Ratios For binary outcomes (e.g., death, hospitalization, disease incidence), effect sizes are typically reported as: **Risk Ratio (Relative Risk):** $$RR = \frac{P(\text{event} \mid \text{treatment})}{P(\text{event} \mid \text{control})}$$ An RR of 0.75 means the treatment group had 75% the risk of the control group, i.e., a 25% relative risk reduction. **Odds Ratio:** $$OR = \frac{\text{odds of event in treatment}}{\text{odds of event in control}} = \frac{p_1 / (1 - p_1)}{p_2 / (1 - p_2)}$$ Odds ratios are the natural output of logistic regression. When the outcome is rare (< 10%), the OR approximates the RR. When the outcome is common, the OR exaggerates the effect compared to the RR. **Risk Difference (Absolute Risk Reduction):** $$RD = P(\text{event} \mid \text{control}) - P(\text{event} \mid \text{treatment})$$ If 15% of control patients are hospitalized vs. 10% of treatment patients, the RD = 0.05 (or 5 percentage points). ### Number Needed to Treat (NNT) The NNT is perhaps the most clinician-friendly effect size: $$NNT = \frac{1}{RD} = \frac{1}{|p_1 - p_2|}$$ Using our example: $NNT = 1 / 0.05 = 20$. This means you need to treat 20 patients to prevent one additional hospitalization. A low NNT is desirable. For reference: - Statins for secondary prevention of MI: NNT around 30-80 - Aspirin for acute MI: NNT around 40 - Antibiotics for strep pharyngitis to prevent rheumatic fever: NNT around 4,000 ### Statistically Significant but Clinically Meaningless Consider a large pragmatic trial (n = 20,000) of a new diabetes drug that lowers HbA1c by 0.1% compared to the current standard of care, with $p = 0.002$. Is this clinically meaningful? Almost certainly not. A 0.1% reduction in HbA1c is below the threshold most endocrinologists consider clinically relevant (typically 0.3-0.5%). The p-value is tiny because the sample size is enormous -- even a trivial true effect becomes "significant" with enough data. This is why you should **always** report and interpret effect sizes (with confidence intervals), not just p-values. ## Type I and Type II Errors ### The Two Kinds of Mistakes When making a decision based on a hypothesis test, there are two ways to be wrong: | | $H_0$ is true (no effect) | $H_0$ is false (real effect) | |---|---|---| | **Reject $H_0$** | Type I error ($\alpha$) | Correct (Power) | | **Fail to reject $H_0$** | Correct | Type II error ($\beta$) | - **Type I error (false positive):** You conclude the drug works, but it actually doesn't. Probability = $\alpha$ (typically set at 0.05). - **Type II error (false negative):** You conclude the drug doesn't work, but it actually does. Probability = $\beta$. ### Statistical Power **Power** = $1 - \beta$ = the probability of correctly detecting a real effect. Power depends on: 1. **Effect size** -- larger effects are easier to detect 2. **Sample size** -- more data provides more power 3. **Significance level ($\alpha$)** -- higher $\alpha$ gives more power (but more false positives) 4. **Variability** -- less noise makes signals easier to detect In clinical trials, we typically aim for **80% power** (i.e., $\beta = 0.20$). This means we accept a 20% chance of missing a true effect -- a remarkably high false-negative rate that is often underappreciated. ::: {.callout-tip title="Clinical Implication"} Many clinical trials are **underpowered** -- they have too few participants to detect realistic effect sizes. A "negative" trial (one that fails to reach $p < 0.05$) does not necessarily mean the treatment is ineffective. It may simply mean the trial was too small. Always check the confidence interval: if it includes both clinically meaningful and null effects, the trial was inconclusive, not negative. ::: ### A Power Calculation Example Suppose you are planning a trial of Drug X and expect a 5 mmHg difference (SD = 15 mmHg). How many participants per group do you need for 80% power at $\alpha = 0.05$? $$n = \frac{2(z_{\alpha/2} + z_\beta)^2 \sigma^2}{\Delta^2} = \frac{2(1.96 + 0.84)^2 \times 15^2}{5^2} = \frac{2 \times 7.84 \times 225}{25} \approx 141 \text{ per group}$$ You need approximately 141 participants per group, or 282 total. ## Multiple Testing ### The Problem When you perform multiple statistical tests, the probability of at least one false positive increases rapidly. If you test 20 independent hypotheses at $\alpha = 0.05$, the probability of at least one Type I error is: $$1 - (1 - 0.05)^{20} = 1 - 0.95^{20} = 1 - 0.358 = 0.642$$ That is a **64.2% chance** of at least one false positive. This is a massive problem in: - **Genomics/GWAS:** Testing millions of genetic variants simultaneously - **Clinical trials with multiple endpoints:** Primary, secondary, and exploratory outcomes - **Subgroup analyses:** Testing treatment effects in men vs. women, old vs. young, etc. ### Bonferroni Correction The simplest fix: divide $\alpha$ by the number of tests. $$\alpha_{\text{adjusted}} = \frac{\alpha}{m}$$ where $m$ is the number of tests. If you run 20 tests, you require $p < 0.05/20 = 0.0025$ for each test. **Pros:** Simple, controls the **family-wise error rate** (FWER) -- the probability of *any* false positives. **Cons:** Very conservative. As $m$ grows, it becomes extremely difficult to detect real effects. With 1 million SNPs in a GWAS, you need $p < 5 \times 10^{-8}$. ### False Discovery Rate (FDR) Control The **Benjamini-Hochberg procedure** (1995) controls the **false discovery rate** -- the expected proportion of rejected hypotheses that are false positives. This is less stringent than Bonferroni and more appropriate when you expect many true positives among your tests. Steps: 1. Rank all $m$ p-values from smallest to largest: $p_{(1)} \leq p_{(2)} \leq \ldots \leq p_{(m)}$ 2. Find the largest $k$ such that $p_{(k)} \leq \frac{k}{m} \times q$, where $q$ is the desired FDR (e.g., 0.05) 3. Reject all hypotheses with $p \leq p_{(k)}$ In genomics, FDR control at $q = 0.05$ means: among all the genes we declare "significant," we expect no more than 5% to be false positives. This is often a more useful guarantee than Bonferroni's promise of *no* false positives. ### When Does Multiple Testing Matter? | Scenario | Correction needed? | |---|---| | Pre-specified primary endpoint in a clinical trial | No (single test) | | One primary + several secondary endpoints | Yes | | Subgroup analyses | Yes (or treat as exploratory) | | Genome-wide association study | Yes (millions of tests) | | Exploratory data analysis | No formal correction, but be transparent | ## Exercises ::: {.callout-tip title="Exercise 2.1: Confidence Interval Simulation"} Simulate 100 trials, each with n=50 per group, true mean difference = 5, SD = 10. Compute a 95% CI for each trial. What proportion of CIs contain the true value of 5? ::: {.panel-tabset} ## R ```{r} #| eval: false set.seed(42) n_sims <- 100 n_per_group <- 50 true_diff <- 5 sd_val <- 10 contains_true <- numeric(n_sims) for (i in 1:n_sims) { group1 <- rnorm(n_per_group, mean = 0, sd = sd_val) group2 <- rnorm(n_per_group, mean = true_diff, sd = sd_val) test_result <- t.test(group2, group1) ci <- test_result$conf.int contains_true[i] <- (ci[1] <= true_diff & ci[2] >= true_diff) } cat("Proportion of CIs containing true value:", mean(contains_true), "\n") # Bonus: visualize the CIs library(ggplot2) ci_data <- data.frame( sim = 1:n_sims, lower = numeric(n_sims), upper = numeric(n_sims), contains = logical(n_sims) ) set.seed(42) for (i in 1:n_sims) { group1 <- rnorm(n_per_group, mean = 0, sd = sd_val) group2 <- rnorm(n_per_group, mean = true_diff, sd = sd_val) test_result <- t.test(group2, group1) ci_data$lower[i] <- test_result$conf.int[1] ci_data$upper[i] <- test_result$conf.int[2] ci_data$contains[i] <- (ci_data$lower[i] <= true_diff & ci_data$upper[i] >= true_diff) } ggplot(ci_data, aes(x = sim, y = (lower + upper) / 2, color = contains)) + geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.3) + geom_hline(yintercept = true_diff, linetype = "dashed", color = "blue") + scale_color_manual(values = c("red", "darkgreen")) + labs(x = "Simulation", y = "Mean Difference", title = "95% Confidence Intervals from 100 Simulated Trials") + theme_minimal() ``` ## Python ```{python} #| eval: false import numpy as np from scipy import stats import matplotlib.pyplot as plt np.random.seed(42) n_sims = 100 n_per_group = 50 true_diff = 5 sd_val = 10 contains_true = [] lower_bounds = [] upper_bounds = [] for i in range(n_sims): group1 = np.random.normal(0, sd_val, n_per_group) group2 = np.random.normal(true_diff, sd_val, n_per_group) diff = np.mean(group2) - np.mean(group1) se = np.sqrt(np.var(group1, ddof=1)/n_per_group + np.var(group2, ddof=1)/n_per_group) t_crit = stats.t.ppf(0.975, df=n_per_group*2 - 2) lower = diff - t_crit * se upper = diff + t_crit * se lower_bounds.append(lower) upper_bounds.append(upper) contains_true.append(lower <= true_diff <= upper) print(f"Proportion of CIs containing true value: {np.mean(contains_true):.2f}") # Visualize fig, ax = plt.subplots(figsize=(10, 6)) for i in range(n_sims): color = "green" if contains_true[i] else "red" ax.plot([i, i], [lower_bounds[i], upper_bounds[i]], color=color, linewidth=0.8) ax.axhline(y=true_diff, color="blue", linestyle="--", label="True difference") ax.set_xlabel("Simulation") ax.set_ylabel("Mean Difference") ax.set_title("95% Confidence Intervals from 100 Simulated Trials") ax.legend() plt.tight_layout() plt.show() ``` ::: ::: {.callout-note collapse="true" title="Solution"} You should find that approximately 95 out of 100 CIs contain the true value of 5. The exact number will vary due to randomness, but it should be close to 95. The red intervals in the plot are the ones that "missed" -- they do not contain the true value. This is the correct interpretation of a 95% CI: the *procedure* captures the truth 95% of the time. ::: ::: ::: {.callout-tip title="Exercise 2.2: P-value Interpretation"} A clinical trial reports: "The new drug reduced 30-day mortality from 12% to 9% (p = 0.04)." For each statement below, indicate whether it is TRUE or FALSE and explain why. 1. There is a 96% probability that the drug truly reduces mortality. 2. There is a 4% probability the result is due to chance. 3. If the drug had no effect, we would see a difference this large or larger about 4% of the time. 4. The drug reduces mortality by 3 percentage points. ::: {.panel-tabset} ## R ```{r} #| eval: false # This is a conceptual exercise, but let's verify the numbers: p_control <- 0.12 p_treatment <- 0.09 risk_difference <- p_control - p_treatment risk_ratio <- p_treatment / p_control nnt <- 1 / risk_difference cat("Risk Difference:", risk_difference, "\n") cat("Risk Ratio:", round(risk_ratio, 3), "\n") cat("NNT:", round(nnt, 1), "\n") ``` ## Python ```{python} #| eval: false # This is a conceptual exercise, but let's verify the numbers: p_control = 0.12 p_treatment = 0.09 risk_difference = p_control - p_treatment risk_ratio = p_treatment / p_control nnt = 1 / risk_difference print(f"Risk Difference: {risk_difference}") print(f"Risk Ratio: {risk_ratio:.3f}") print(f"NNT: {nnt:.1f}") ``` ::: ::: {.callout-note collapse="true" title="Solution"} 1. **FALSE.** The p-value does not give the probability that the drug works. It does not tell you the probability of any hypothesis being true or false. 2. **FALSE.** The p-value is not the probability that the result is "due to chance." It is calculated *assuming* chance is the only explanation. 3. **TRUE.** This is the correct definition of a p-value: if the null hypothesis (no difference) were true, we would observe a result this extreme or more extreme approximately 4% of the time. 4. **TRUE (as a point estimate).** The absolute risk reduction is 12% - 9% = 3 percentage points. The NNT = 1/0.03 = 33.3, meaning you need to treat about 34 patients to prevent one death. Whether this is clinically meaningful depends on the drug's cost, side effects, and the patient population. ::: ::: ::: {.callout-tip title="Exercise 2.3: Multiple Testing Correction"} You are analyzing gene expression data and have tested 1,000 genes for differential expression between a treatment and control group. You find 60 genes with p < 0.05. Apply both Bonferroni correction and BH-FDR correction. How many genes remain significant under each approach? ::: {.panel-tabset} ## R ```{r} #| eval: false # Simulate 1000 p-values: 950 null (uniform), 50 truly differentially expressed set.seed(123) n_genes <- 1000 n_true <- 50 # Null p-values are uniformly distributed p_null <- runif(n_genes - n_true, 0, 1) # True effects: p-values will tend to be small p_true <- rbeta(n_true, 1, 20) # skewed toward 0 p_values <- c(p_null, p_true) # How many nominally significant? cat("Nominally significant (p < 0.05):", sum(p_values < 0.05), "\n") # Bonferroni correction p_bonferroni <- p.adjust(p_values, method = "bonferroni") cat("Significant after Bonferroni:", sum(p_bonferroni < 0.05), "\n") # BH-FDR correction p_fdr <- p.adjust(p_values, method = "BH") cat("Significant after BH-FDR:", sum(p_fdr < 0.05), "\n") # Compare cat("\nBonferroni is much more conservative.\n") cat("FDR retains more discoveries while controlling the false discovery rate.\n") ``` ## Python ```{python} #| eval: false import numpy as np from statsmodels.stats.multitest import multipletests np.random.seed(123) n_genes = 1000 n_true = 50 # Null p-values (uniform) and true effect p-values (small) p_null = np.random.uniform(0, 1, n_genes - n_true) p_true = np.random.beta(1, 20, n_true) p_values = np.concatenate([p_null, p_true]) print(f"Nominally significant (p < 0.05): {np.sum(p_values < 0.05)}") # Bonferroni correction reject_bonf, pvals_bonf, _, _ = multipletests(p_values, alpha=0.05, method='bonferroni') print(f"Significant after Bonferroni: {np.sum(reject_bonf)}") # BH-FDR correction reject_fdr, pvals_fdr, _, _ = multipletests(p_values, alpha=0.05, method='fdr_bh') print(f"Significant after BH-FDR: {np.sum(reject_fdr)}") print("\nBonferroni is much more conservative.") print("FDR retains more discoveries while controlling the false discovery rate.") ``` ::: ::: {.callout-note collapse="true" title="Solution"} With the simulated data (seed = 123), you should see approximately: - **Nominally significant:** ~60-110 genes (includes both true and false positives) - **Bonferroni:** ~15-30 genes (very conservative, few false positives but many missed true effects) - **BH-FDR:** ~30-50 genes (moderate, good balance of sensitivity and false positive control) The Bonferroni correction controls the family-wise error rate (probability of *any* false positives), while BH-FDR controls the expected proportion of false positives among rejections. In genomics, where you expect many true positives, FDR is usually more appropriate because Bonferroni's stringency causes you to miss too many real discoveries. ::: ::: ::: {.callout-tip title="Exercise 2.4: Power Analysis"} You are planning a trial to test whether a lifestyle intervention reduces HbA1c in type 2 diabetes patients. You expect a 0.4% reduction (SD = 1.2%). Compute the required sample size per group for 80% power at alpha = 0.05. Then recompute for 90% power. How does the sample size change? ::: {.panel-tabset} ## R ```{r} #| eval: false # Using the power.t.test function # 80% power result_80 <- power.t.test( delta = 0.4, # expected difference sd = 1.2, # standard deviation sig.level = 0.05, power = 0.80, type = "two.sample", alternative = "two.sided" ) cat("Sample size per group (80% power):", ceiling(result_80$n), "\n") # 90% power result_90 <- power.t.test( delta = 0.4, sd = 1.2, sig.level = 0.05, power = 0.90, type = "two.sample", alternative = "two.sided" ) cat("Sample size per group (90% power):", ceiling(result_90$n), "\n") cat("\nIncreasing power from 80% to 90% requires about", round((ceiling(result_90$n) / ceiling(result_80$n) - 1) * 100), "% more participants per group.\n") ``` ## Python ```{python} #| eval: false from statsmodels.stats.power import TTestIndPower analysis = TTestIndPower() # Cohen's d = delta / sd = 0.4 / 1.2 effect_size = 0.4 / 1.2 # 80% power n_80 = analysis.solve_power(effect_size=effect_size, alpha=0.05, power=0.80, alternative='two-sided') print(f"Sample size per group (80% power): {int(np.ceil(n_80))}") # 90% power n_90 = analysis.solve_power(effect_size=effect_size, alpha=0.05, power=0.90, alternative='two-sided') print(f"Sample size per group (90% power): {int(np.ceil(n_90))}") print(f"\nIncreasing from 80% to 90% power requires about " f"{int(round((n_90/n_80 - 1) * 100))}% more participants per group.") ``` ::: ::: {.callout-note collapse="true" title="Solution"} With an effect size of 0.4% and SD of 1.2% (Cohen's d = 0.33): - **80% power:** approximately 143 per group (286 total) - **90% power:** approximately 191 per group (382 total) Going from 80% to 90% power requires roughly 33% more participants. This illustrates the diminishing returns of increasing power -- each additional percentage point of power costs more participants. Most trials settle for 80% power as a reasonable compromise. ::: ::: ## Key Takeaways 1. **Point estimates are always uncertain.** Report confidence intervals, not just point estimates. 2. **Confidence intervals describe the precision of your estimate.** They tell you the range of plausible values for the population parameter. 3. **P-values measure compatibility with the null hypothesis.** They are not the probability that the null is true, nor the probability of a chance finding. 4. **Statistical significance is not clinical significance.** Always report and interpret effect sizes. 5. **NNT is the most clinician-friendly effect size** for binary outcomes. 6. **Underpowered studies produce inconclusive results,** not evidence that a treatment is ineffective. 7. **Multiple testing inflates false positives.** Use Bonferroni (conservative) or FDR (less conservative) corrections as appropriate. ## References and Further Reading - Wasserstein, R. L., & Lazar, N. A. (2016). The ASA statement on p-values: Context, process, and purpose. *The American Statistician*, 70(2), 129-133. - Wasserstein, R. L., Schirm, A. L., & Lazar, N. A. (2019). Moving to a world beyond "p < 0.05." *The American Statistician*, 73(sup1), 1-19. - Cohen, J. (1988). *Statistical Power Analysis for the Behavioral Sciences* (2nd ed.). Lawrence Erlbaum Associates. - Vittinghoff, E., Glidden, D. V., Shiboski, S. C., & McCulloch, C. E. (2012). *Regression Methods in Biostatistics: Linear, Logistic, Survival, and Repeated Measures Models* (2nd ed.). Springer. - Steyerberg, E. W. (2019). *Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating* (2nd ed.). Springer. - Smits, L. J. M., et al. (2026). Recommendations for clinical prediction model development, validation, and updating. *BMJ*. - Greenland, S., Senn, S. J., Rothman, K. J., et al. (2016). Statistical tests, P values, confidence intervals, and power: A guide to misinterpretations. *European Journal of Epidemiology*, 31(4), 337-350. - Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. *Journal of the Royal Statistical Society: Series B*, 57(1), 289-300. - Altman, D. G., & Bland, J. M. (1995). Absence of evidence is not evidence of absence. *BMJ*, 311(7003), 485.

	\(H_0\) is true (no effect)	\(H_0\) is false (real effect)
Reject \(H_0\)	Type I error (\(\alpha\))	Correct (Power)
Fail to reject \(H_0\)	Correct	Type II error (\(\beta\))