3 Probability, Distributions, and the Language of Uncertainty

3.1 Why Probability Matters in Health Sciences

Medicine is a discipline of uncertainty. A patient presents with chest pain — is it a heart attack, acid reflux, or a pulled muscle? A new drug shows a 15% reduction in tumor size — is this a real effect or a fluke of sampling? A screening test comes back positive — what is the chance the patient actually has the disease?

Probability gives us a formal language for reasoning about uncertainty. It allows us to quantify how likely different outcomes are, how confident we should be in our conclusions, and how we should update our beliefs when new evidence arrives. Every statistical method you will learn in this course — from logistic regression to Bayesian modeling to machine learning — is built on the foundation of probability theory.

This chapter introduces the key ideas you need. We will move slowly, use concrete clinical examples, and build intuition before introducing formulas.

3.2 What Is Probability?

At its simplest, a probability is a number between 0 and 1 that represents how likely an event is to occur. A probability of 0 means the event is impossible; a probability of 1 means it is certain. A probability of 0.5 means the event is equally likely to happen or not happen.

But where do probabilities come from? There are two major perspectives.

3.2.1 The Frequentist Interpretation

The frequentist interpretation says that the probability of an event is the long-run relative frequency with which it occurs if we repeat the same process many times. For example:

If we flip a fair coin many times, it will land heads about 50% of the time. We say $P(\text{heads}) = 0.5$.
If a particular surgery has a 3% mortality rate, this means that out of every 100 patients who undergo the procedure (under similar conditions), roughly 3 will die.

This interpretation works well when we can imagine repeating an experiment. Coin flips, blood draws from a population, random assignment in a clinical trial — all of these fit naturally into the frequentist framework.

3.2.2 The Subjective (Bayesian) Interpretation

The subjective or Bayesian interpretation says that probability represents a degree of belief. It is a measure of how confident you are that something is true, given what you currently know.

A cardiologist examining a 65-year-old male smoker with crushing chest pain might say: “I believe there is an 85% probability this is a myocardial infarction.” This is not based on repeating the exact same patient encounter thousands of times — it is an informed judgment.
Before seeing any data from a clinical trial, a researcher might believe there is a 30% chance a new drug is effective. After seeing positive results from the trial, they update that belief to 80%.

The Bayesian interpretation is powerful because it applies to one-time events and unique situations. You cannot repeat a patient’s life, but you can still reason probabilistically about their prognosis.

3.2.3 Which Interpretation Is Correct?

Both are useful. In this course, we will use frequentist methods (hypothesis tests, confidence intervals, maximum likelihood) and Bayesian methods (posterior distributions, credible intervals, prior updating). Understanding both perspectives will make you a more versatile analyst. The key insight is that probability is a tool for reasoning under uncertainty, regardless of your philosophical starting point.

3.3 Random Variables

A random variable is a quantity whose value is determined by a random process. In clinical research, random variables are everywhere:

The number of patients who respond to treatment (a count)
A patient’s systolic blood pressure (a measurement)
Whether a tumor is malignant or benign (a category)
Time from diagnosis to death (a duration)

We use capital letters like $X$, $Y$, or $Z$ to denote random variables, and lowercase letters like $x$, $y$, $z$ for specific observed values.

3.3.1 Discrete Random Variables

A discrete random variable takes on a countable number of values, typically whole numbers. Examples:

The number of adverse events in a clinical trial: 0, 1, 2, 3, …
The number of positive test results in a batch of 20 specimens: 0, 1, 2, …, 20
A patient’s pain score on a 0–10 scale: 0, 1, 2, …, 10

For a discrete random variable, we describe its behavior using a probability mass function (PMF), which gives the probability of each possible value:

\[P(X = x)\]

All probabilities must be non-negative, and they must sum to 1:

\[\sum_{\text{all } x} P(X = x) = 1\]

3.3.2 Continuous Random Variables

A continuous random variable can take on any value within a range (including decimals). Examples:

Body temperature: 36.2, 37.45, 38.1, …
Serum creatinine level: any positive real number
Time to event: any positive duration

For a continuous random variable, the probability of any exact value is technically zero (because there are infinitely many possible values). Instead, we talk about the probability of falling within a range, and we describe the distribution using a probability density function (PDF), denoted $f(x)$. The probability that $X$ falls between $a$ and $b$ is:

\[P(a \leq X \leq b) = \int_a^b f(x) \, dx\]

This is the area under the density curve between $a$ and $b$. The total area under the entire curve equals 1.

Do not worry if integrals make you nervous. The key idea is: for continuous variables, probability corresponds to area under a curve.

3.4 The Normal (Gaussian) Distribution

The normal distribution — also called the Gaussian distribution or the “bell curve” — is the most important probability distribution in statistics. It appears everywhere in nature and medicine.

3.4.1 Why Does It Matter?

Many biological measurements are approximately normal. Heights, weights, blood pressure readings, cholesterol levels, and many lab values follow roughly bell-shaped distributions in large populations.
The Central Limit Theorem (discussed later in this chapter) tells us that averages of measurements tend to follow a normal distribution, even when individual measurements do not.
Many statistical methods assume normality (or at least approximate normality) for valid inference.

3.4.2 Properties of the Normal Distribution

A normal distribution is completely described by two parameters:

Mean ($\mu$): The center of the distribution. The peak of the bell curve.
Standard deviation ($\sigma$): A measure of spread. About 68% of values fall within one standard deviation of the mean, about 95% within two, and about 99.7% within three. This is sometimes called the 68-95-99.7 rule.

The probability density function is:

\[f(x) = \frac{1}{\sigma \sqrt{2\pi}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)\]

You do not need to memorize this formula. What matters is understanding the shape and the role of $\mu$ and $\sigma$.

We write $X \sim \text{Normal}(\mu, \sigma^2)$ or $X \sim N(\mu, \sigma^2)$ to say that $X$ follows a normal distribution with mean $\mu$ and variance $\sigma^2$.

3.4.3 Clinical Example: Blood Pressure

Systolic blood pressure in healthy adults is approximately normally distributed with a mean of about 120 mmHg and a standard deviation of about 15 mmHg. That is, $\text{SBP} \sim N(120, 15^2)$.

Using the 68-95-99.7 rule:

About 68% of healthy adults have SBP between 105 and 135 mmHg (120 plus or minus 15).
About 95% have SBP between 90 and 150 mmHg (120 plus or minus 30).
About 99.7% have SBP between 75 and 165 mmHg (120 plus or minus 45).

If a patient has a systolic blood pressure of 160 mmHg, how unusual is that? We can answer this using Z-scores.

3.4.4 Z-Scores: Standardizing Values

A Z-score tells you how many standard deviations a value is from the mean:

\[Z = \frac{X - \mu}{\sigma}\]

For our patient with SBP = 160:

\[Z = \frac{160 - 120}{15} = \frac{40}{15} \approx 2.67\]

This patient’s blood pressure is about 2.67 standard deviations above the mean. Using a standard normal table or a computer, we find that only about 0.4% of the healthy population would have a blood pressure this high or higher. This is strong evidence that this patient may have hypertension.

The standard normal distribution is a normal distribution with $\mu = 0$ and $\sigma = 1$. Any normal distribution can be converted to the standard normal by computing Z-scores.

Let us visualize the normal distribution and Z-scores:

Code

library(tidyverse)

# Parameters for systolic blood pressure
mu <- 120
sigma <- 15

# Create a sequence of values for the x-axis
x <- seq(mu - 4 * sigma, mu + 4 * sigma, length.out = 300)
y <- dnorm(x, mean = mu, sd = sigma)

bp_data <- tibble(x = x, y = y)

ggplot(bp_data, aes(x = x, y = y)) +
  geom_line(color = "steelblue", linewidth = 1.2) +
  geom_area(data = bp_data |> filter(x >= 160),
            aes(x = x, y = y), fill = "firebrick", alpha = 0.4) +
  geom_vline(xintercept = 160, linetype = "dashed", color = "firebrick") +
  annotate("text", x = 168, y = 0.015,
           label = "Patient: 160 mmHg\n(Z = 2.67)", color = "firebrick",
           hjust = 0, size = 3.5) +
  labs(
    title = "Distribution of Systolic Blood Pressure in Healthy Adults",
    subtitle = paste0("Normal(", mu, ", ", sigma, "^2)"),
    x = "Systolic Blood Pressure (mmHg)",
    y = "Density"
  ) +
  theme_minimal()

Code

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

mu = 120
sigma = 15

x = np.linspace(mu - 4 * sigma, mu + 4 * sigma, 300)
y = stats.norm.pdf(x, loc=mu, scale=sigma)

fig, ax = plt.subplots(figsize=(9, 5))
ax.plot(x, y, color="steelblue", linewidth=1.5)

# Shade the area above 160
x_fill = x[x >= 160]
y_fill = stats.norm.pdf(x_fill, loc=mu, scale=sigma)
ax.fill_between(x_fill, y_fill, color="firebrick", alpha=0.4)

ax.axvline(160, linestyle="--", color="firebrick")
ax.text(162, 0.015, "Patient: 160 mmHg\n(Z = 2.67)",
        color="firebrick", fontsize=9)

ax.set_title("Distribution of Systolic Blood Pressure in Healthy Adults")
ax.set_xlabel("Systolic Blood Pressure (mmHg)")
ax.set_ylabel("Density")
plt.tight_layout()
plt.show()

3.5 The Binomial Distribution

The binomial distribution models the number of “successes” in a fixed number of independent trials, where each trial has the same probability of success. Despite the name, a “success” does not have to be a good thing — it is simply the outcome we are counting.

3.5.1 Parameters

$n$: the number of trials (a fixed, known integer)
$p$: the probability of success on each trial (between 0 and 1)

We write $X \sim \text{Binomial}(n, p)$, and the PMF is:

\[P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}, \quad k = 0, 1, 2, \ldots, n\]

where $\binom{n}{k} = \frac{n!}{k!(n-k)!}$ is the binomial coefficient (“n choose k”), which counts the number of ways to arrange $k$ successes among $n$ trials.

Mean: $E[X] = np$

Variance: $\text{Var}(X) = np(1-p)$

3.5.2 Clinical Examples

Disease prevalence. In a city, the prevalence of Type 2 diabetes among adults is 12%. If you randomly sample 50 adults, the number with diabetes follows a Binomial(50, 0.12) distribution. The expected number is $50 \times 0.12 = 6$.

Treatment response. A chemotherapy regimen achieves complete response in 35% of patients. If 20 patients are treated, the number who achieve complete response follows a Binomial(20, 0.35) distribution.

Key assumptions: The binomial distribution assumes that (1) each trial is independent (one patient’s outcome does not affect another’s), (2) the probability of success is the same for every trial, and (3) the number of trials is fixed in advance. In practice, these assumptions are sometimes approximate (e.g., disease probability may vary with age), but the binomial is still a useful starting model.

Code

library(tidyverse)

# Treatment response: 20 patients, 35% response rate
n <- 20
p <- 0.35

# Calculate probabilities for each possible outcome
outcomes <- tibble(
  k = 0:n,
  probability = dbinom(k, size = n, prob = p)
)

ggplot(outcomes, aes(x = k, y = probability)) +
  geom_col(fill = "steelblue", alpha = 0.8) +
  geom_vline(xintercept = n * p, linetype = "dashed", color = "firebrick") +
  annotate("text", x = n * p + 0.5, y = max(outcomes$probability),
           label = paste0("Expected value = ", n * p),
           hjust = 0, color = "firebrick") +
  labs(
    title = "Binomial Distribution: Treatment Response",
    subtitle = "n = 20 patients, p = 0.35 response rate",
    x = "Number of Responders",
    y = "Probability"
  ) +
  scale_x_continuous(breaks = 0:n) +
  theme_minimal()

Code

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

n = 20
p = 0.35

k = np.arange(0, n + 1)
probabilities = stats.binom.pmf(k, n, p)

fig, ax = plt.subplots(figsize=(9, 5))
ax.bar(k, probabilities, color="steelblue", alpha=0.8)
ax.axvline(n * p, linestyle="--", color="firebrick")
ax.text(n * p + 0.3, max(probabilities), f"Expected value = {n * p}",
        color="firebrick", fontsize=9)

ax.set_title("Binomial Distribution: Treatment Response")
ax.set_xlabel("Number of Responders")
ax.set_ylabel("Probability")
ax.set_xticks(k)
plt.tight_layout()
plt.show()

3.6 The Poisson Distribution

The Poisson distribution models the number of events occurring in a fixed interval of time or space, when events happen independently at a constant average rate. It is the go-to distribution for modeling rare events.

3.6.1 Parameters

$\lambda$ (lambda): the average number of events per interval. This is both the mean and the variance.

We write $X \sim \text{Poisson}(\lambda)$, and the PMF is:

\[P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}, \quad k = 0, 1, 2, \ldots\]

Mean: $E[X] = \lambda$

Variance: $\text{Var}(X) = \lambda$

The fact that the mean equals the variance is a distinctive property of the Poisson distribution. When you see count data where the variance is much larger than the mean, the Poisson may not be a good fit (a condition called overdispersion).

3.6.2 Clinical Examples

Adverse drug reactions. A hospital reports an average of 2.5 serious adverse drug reactions per month. If these events are independent, the number in any given month follows approximately a Poisson(2.5) distribution. The probability of zero serious reactions in a month is $P(X = 0) = e^{-2.5} \approx 0.082$, or about 8.2%.

Hospital admissions. An emergency department sees an average of 40 patients per shift. The number of arrivals per shift can be modeled with a Poisson(40) distribution (though in practice, arrivals may cluster, violating independence).

Rare disease incidence. If a rare genetic disorder affects 1 in 100,000 people per year and a city has 500,000 people, we expect about 5 cases per year. The annual count follows approximately a Poisson(5) distribution.

Code

library(tidyverse)

# Adverse drug reactions: average 2.5 per month
lambda <- 2.5
k <- 0:12

adr_data <- tibble(
  k = k,
  probability = dpois(k, lambda = lambda)
)

ggplot(adr_data, aes(x = k, y = probability)) +
  geom_col(fill = "darkorange", alpha = 0.8) +
  geom_vline(xintercept = lambda, linetype = "dashed", color = "firebrick") +
  labs(
    title = "Poisson Distribution: Adverse Drug Reactions per Month",
    subtitle = expression(paste(lambda, " = 2.5 reactions/month")),
    x = "Number of Adverse Reactions",
    y = "Probability"
  ) +
  scale_x_continuous(breaks = k) +
  theme_minimal()

Code

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

lam = 2.5
k = np.arange(0, 13)
probs = stats.poisson.pmf(k, lam)

fig, ax = plt.subplots(figsize=(9, 5))
ax.bar(k, probs, color="darkorange", alpha=0.8)
ax.axvline(lam, linestyle="--", color="firebrick")
ax.set_title(f"Poisson Distribution: Adverse Drug Reactions per Month (lambda = {lam})")
ax.set_xlabel("Number of Adverse Reactions")
ax.set_ylabel("Probability")
ax.set_xticks(k)
plt.tight_layout()
plt.show()

3.7 The Central Limit Theorem

The Central Limit Theorem (CLT) is one of the most remarkable results in all of statistics. It states:

If you take sufficiently large random samples from any population (regardless of the population’s distribution), the distribution of the sample means will be approximately normal.

More precisely: if $X_1, X_2, \ldots, X_n$ are independent and identically distributed random variables with mean $\mu$ and standard deviation $\sigma$, then the sample mean $\bar{X} = \frac{1}{n}\sum_{i=1}^n X_i$ has a distribution that approaches $N(\mu, \sigma^2/n)$ as $n$ grows large.

3.7.1 Why Is This Important?

The CLT is the reason the normal distribution is so central to statistics. It means that:

We can use normal-based methods (t-tests, confidence intervals, regression) even when the underlying data are not normally distributed, as long as our sample size is large enough.
Sample means are more predictable than individual observations. The standard deviation of the sample mean ($\sigma / \sqrt{n}$) shrinks as $n$ grows, meaning larger samples give more precise estimates.

3.7.2 How Large Is “Large Enough”?

It depends on how non-normal the underlying distribution is:

If the population is already roughly symmetric, $n = 15$–$30$ is often sufficient.
If the population is heavily skewed (e.g., income, hospital length of stay), you may need $n = 50$ or more.
For extremely skewed distributions, $n > 100$ may be needed.

3.7.3 Simulation Demonstration

Let us demonstrate the CLT visually. We will take samples from a right-skewed distribution (exponential, which looks nothing like a bell curve) and show that the distribution of sample means becomes increasingly normal as the sample size grows.

Code

library(tidyverse)

set.seed(42)

# Parameters
n_simulations <- 5000   # Number of samples to draw
sample_sizes <- c(1, 5, 30, 100)

# For each sample size, draw many samples and compute the mean of each
clt_data <- map_dfr(sample_sizes, function(n) {
  means <- replicate(n_simulations, mean(rexp(n, rate = 1)))
  tibble(
    sample_size = paste0("n = ", n),
    sample_mean = means
  )
})

# Convert to ordered factor so panels appear in the right order
clt_data$sample_size <- factor(clt_data$sample_size,
                                levels = paste0("n = ", sample_sizes))

ggplot(clt_data, aes(x = sample_mean)) +
  geom_histogram(aes(y = after_stat(density)),
                 bins = 50, fill = "steelblue", alpha = 0.7) +
  geom_density(color = "firebrick", linewidth = 0.8) +
  facet_wrap(~sample_size, scales = "free") +
  labs(
    title = "Central Limit Theorem in Action",
    subtitle = "Sampling from an Exponential(1) distribution (mean = 1, right-skewed)",
    x = "Sample Mean",
    y = "Density"
  ) +
  theme_minimal()

Code

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)

n_simulations = 5000
sample_sizes = [1, 5, 30, 100]

fig, axes = plt.subplots(2, 2, figsize=(10, 8))
axes = axes.flatten()

for i, n in enumerate(sample_sizes):
    means = [np.mean(np.random.exponential(1, n)) for _ in range(n_simulations)]
    axes[i].hist(means, bins=50, density=True, color="steelblue", alpha=0.7)
    axes[i].set_title(f"n = {n}")
    axes[i].set_xlabel("Sample Mean")
    axes[i].set_ylabel("Density")

fig.suptitle(
    "Central Limit Theorem in Action\n"
    "Sampling from an Exponential(1) distribution (right-skewed)",
    fontsize=13
)
plt.tight_layout()
plt.show()

Notice how the distribution of sample means changes:

When $n = 1$, the distribution of “means” is just the original exponential distribution — heavily right-skewed.
At $n = 5$, the skewness is already reduced.
At $n = 30$, the distribution looks approximately bell-shaped.
At $n = 100$, the distribution is very close to a perfect normal curve, and it is much narrower (less variable).

This is the CLT in action. It does not matter that we started with a skewed distribution — the means converge to normality.

3.8 Bayes’ Theorem: Updating Beliefs with Evidence

3.8.1 The Idea

Bayes’ theorem provides a mathematically precise way to update your beliefs when you receive new evidence. In clinical medicine, one of the most important applications is interpreting diagnostic tests.

Suppose a patient tests positive for a disease. What is the probability they actually have the disease? Your intuition might say “very high,” but as we will see, the answer depends critically on how common the disease is in the population.

3.8.2 The Formula

\[P(A \mid B) = \frac{P(B \mid A) \times P(A)}{P(B)}\]

In the context of diagnostic testing:

$P(\text{Disease} \mid \text{Positive test})$ = the probability the patient has the disease, given a positive test result. This is called the positive predictive value (PPV).
$P(\text{Positive test} \mid \text{Disease})$ = the probability of a positive test result if the patient truly has the disease. This is the test’s sensitivity (true positive rate).
$P(\text{Disease})$ = the probability of having the disease before the test, based on the population prevalence or clinical judgment. This is the prior probability or pre-test probability.
$P(\text{Positive test})$ = the overall probability of a positive test result, accounting for both true positives and false positives.

We can expand the denominator using the law of total probability:

\[P(\text{Positive test}) = P(\text{Positive} \mid \text{Disease}) \times P(\text{Disease}) + P(\text{Positive} \mid \text{No disease}) \times P(\text{No disease})\]

Note that $P(\text{Positive} \mid \text{No disease})$ is the false positive rate, which equals $1 - \text{specificity}$.

3.8.3 Key Diagnostic Test Terminology

Diagnostic test performance metrics
Term	Definition	Formula
Sensitivity	Probability of a positive test given disease is present	$P(+\mid D)$
Specificity	Probability of a negative test given disease is absent	$P(-\mid \bar{D})$
Positive Predictive Value (PPV)	Probability of disease given a positive test	$P(D\mid +)$
Negative Predictive Value (NPV)	Probability of no disease given a negative test	$P(\bar{D}\mid -)$
Prevalence	Proportion of the population with the disease	$P(D)$

3.8.4 Clinical Example: HIV Screening

Let us work through a concrete example. Consider an HIV screening test with the following characteristics:

Sensitivity: 99.7% (the test correctly identifies 99.7% of people who have HIV)
Specificity: 99.5% (the test correctly identifies 99.5% of people who do not have HIV)
Prevalence: 0.3% (the proportion of the general population with HIV)

These numbers look excellent. But what happens when a randomly selected person tests positive?

Using Bayes’ theorem:

\[PPV = P(D \mid +) = \frac{P(+ \mid D) \times P(D)}{P(+ \mid D) \times P(D) + P(+ \mid \bar{D}) \times P(\bar{D})}\]

\[= \frac{0.997 \times 0.003}{0.997 \times 0.003 + 0.005 \times 0.997}\]

\[= \frac{0.002991}{0.002991 + 0.004985}\]

\[= \frac{0.002991}{0.007976} \approx 0.375\]

The positive predictive value is only about 37.5%! This means that if a person from the general population tests positive, there is only about a 37.5% chance they actually have HIV. More than 6 out of 10 positive results are false positives.

This counterintuitive result arises because the disease is rare. Even with a highly accurate test, the large number of healthy people (99.7% of the population) generates many false positives that overwhelm the true positives from the small number of infected people.

This is why HIV screening programs use confirmatory testing: a second, different test is performed on all positive results. It is also why screening is more effective in high-prevalence populations (e.g., among people with known risk factors).

3.8.5 The Role of Prevalence

The PPV depends heavily on prevalence. Let us compute the PPV across a range of prevalence values:

Code

library(tidyverse)

sensitivity <- 0.997
specificity <- 0.995

prevalence <- seq(0.001, 0.30, by = 0.001)

ppv <- (sensitivity * prevalence) /
  (sensitivity * prevalence + (1 - specificity) * (1 - prevalence))

npv <- (specificity * (1 - prevalence)) /
  (specificity * (1 - prevalence) + (1 - sensitivity) * prevalence)

test_data <- tibble(
  prevalence = prevalence,
  PPV = ppv,
  NPV = npv
) |>
  pivot_longer(cols = c(PPV, NPV), names_to = "metric", values_to = "value")

ggplot(test_data, aes(x = prevalence, y = value, color = metric)) +
  geom_line(linewidth = 1.2) +
  scale_x_continuous(labels = scales::percent_format()) +
  scale_y_continuous(labels = scales::percent_format(), limits = c(0, 1)) +
  labs(
    title = "How Prevalence Affects PPV and NPV",
    subtitle = "Sensitivity = 99.7%, Specificity = 99.5%",
    x = "Disease Prevalence",
    y = "Predictive Value",
    color = "Metric"
  ) +
  theme_minimal() +
  scale_color_manual(values = c("PPV" = "steelblue", "NPV" = "firebrick"))

Code

import numpy as np
import matplotlib.pyplot as plt

sensitivity = 0.997
specificity = 0.995

prevalence = np.arange(0.001, 0.301, 0.001)

ppv = (sensitivity * prevalence) / (
    sensitivity * prevalence + (1 - specificity) * (1 - prevalence)
)
npv = (specificity * (1 - prevalence)) / (
    specificity * (1 - prevalence) + (1 - sensitivity) * prevalence
)

fig, ax = plt.subplots(figsize=(9, 5))
ax.plot(prevalence * 100, ppv * 100, color="steelblue", linewidth=1.5, label="PPV")
ax.plot(prevalence * 100, npv * 100, color="firebrick", linewidth=1.5, label="NPV")
ax.set_xlabel("Disease Prevalence (%)")
ax.set_ylabel("Predictive Value (%)")
ax.set_title("How Prevalence Affects PPV and NPV\nSensitivity = 99.7%, Specificity = 99.5%")
ax.legend()
ax.set_ylim(0, 105)
plt.tight_layout()
plt.show()

The plot shows a crucial pattern: when prevalence is very low, even an excellent test produces many false positives (low PPV). As prevalence increases, PPV rises. Conversely, NPV is very high when prevalence is low (a negative result is very reassuring) but decreases as disease becomes more common.

Clinical takeaway: Always consider the pre-test probability (prevalence in the relevant population) when interpreting a diagnostic test result. A positive mammogram in a 25-year-old woman with no risk factors means something very different from the same result in a 60-year-old woman with a family history of breast cancer.

3.9 Exercises

Exercise 1.1: Z-Scores and the Normal Distribution

A laboratory reports that fasting blood glucose levels in healthy adults follow a normal distribution with a mean of 90 mg/dL and a standard deviation of 10 mg/dL.

What is the Z-score for a patient with a fasting glucose of 115 mg/dL?
What proportion of healthy adults have a fasting glucose above 115 mg/dL?
What fasting glucose value corresponds to the 95th percentile?
Plot the normal distribution and shade the area above 115 mg/dL.

Code

mu <- 90
sigma <- 10
patient_value <- 115

# 1. Calculate the Z-score
z_score <- # YOUR CODE HERE

# 2. Proportion above 115 (upper tail probability)
prop_above <- # YOUR CODE HERE (hint: use pnorm())

# 3. 95th percentile
percentile_95 <- # YOUR CODE HERE (hint: use qnorm())

# 4. Plot (modify the blood pressure example from the chapter)
# YOUR CODE HERE

Code

from scipy import stats
import numpy as np
import matplotlib.pyplot as plt

mu = 90
sigma = 10
patient_value = 115

# 1. Calculate the Z-score
z_score = # YOUR CODE HERE

# 2. Proportion above 115
prop_above = # YOUR CODE HERE (hint: use stats.norm.sf() or 1 - stats.norm.cdf())

# 3. 95th percentile
percentile_95 = # YOUR CODE HERE (hint: use stats.norm.ppf())

# 4. Plot
# YOUR CODE HERE

Solution

Code

library(tidyverse)

mu <- 90
sigma <- 10
patient_value <- 115

# 1. Z-score
z_score <- (patient_value - mu) / sigma
cat("Z-score:", z_score, "\n")
# Z-score: 2.5

# 2. Proportion above 115
prop_above <- 1 - pnorm(patient_value, mean = mu, sd = sigma)
# Equivalently: pnorm(patient_value, mean = mu, sd = sigma, lower.tail = FALSE)
cat("Proportion above 115:", round(prop_above, 4), "\n")
# About 0.0062 or 0.62%

# 3. 95th percentile
percentile_95 <- qnorm(0.95, mean = mu, sd = sigma)
cat("95th percentile:", round(percentile_95, 1), "mg/dL\n")
# About 106.4 mg/dL

# 4. Plot
x <- seq(mu - 4 * sigma, mu + 4 * sigma, length.out = 300)
y <- dnorm(x, mean = mu, sd = sigma)
plot_data <- tibble(x = x, y = y)

ggplot(plot_data, aes(x = x, y = y)) +
  geom_line(color = "steelblue", linewidth = 1.2) +
  geom_area(data = plot_data |> filter(x >= patient_value),
            fill = "firebrick", alpha = 0.4) +
  geom_vline(xintercept = patient_value, linetype = "dashed", color = "firebrick") +
  annotate("text", x = 118, y = 0.02,
           label = paste0("Glucose = ", patient_value, " mg/dL\nZ = ", z_score),
           color = "firebrick", hjust = 0) +
  labs(
    title = "Distribution of Fasting Blood Glucose",
    x = "Fasting Blood Glucose (mg/dL)", y = "Density"
  ) +
  theme_minimal()

Code

from scipy import stats
import numpy as np
import matplotlib.pyplot as plt

mu = 90
sigma = 10
patient_value = 115

# 1. Z-score
z_score = (patient_value - mu) / sigma
print(f"Z-score: {z_score}")
# Z-score: 2.5

# 2. Proportion above 115
prop_above = 1 - stats.norm.cdf(patient_value, loc=mu, scale=sigma)
# Or equivalently: stats.norm.sf(patient_value, loc=mu, scale=sigma)
print(f"Proportion above 115: {prop_above:.4f}")
# About 0.0062 or 0.62%

# 3. 95th percentile
percentile_95 = stats.norm.ppf(0.95, loc=mu, scale=sigma)
print(f"95th percentile: {percentile_95:.1f} mg/dL")
# About 106.4 mg/dL

# 4. Plot
x = np.linspace(mu - 4 * sigma, mu + 4 * sigma, 300)
y = stats.norm.pdf(x, loc=mu, scale=sigma)

fig, ax = plt.subplots(figsize=(9, 5))
ax.plot(x, y, color="steelblue", linewidth=1.5)
x_fill = x[x >= patient_value]
y_fill = stats.norm.pdf(x_fill, loc=mu, scale=sigma)
ax.fill_between(x_fill, y_fill, color="firebrick", alpha=0.4)
ax.axvline(patient_value, linestyle="--", color="firebrick")
ax.text(117, 0.02, f"Glucose = {patient_value} mg/dL\nZ = {z_score}",
        color="firebrick")
ax.set_title("Distribution of Fasting Blood Glucose")
ax.set_xlabel("Fasting Blood Glucose (mg/dL)")
ax.set_ylabel("Density")
plt.tight_layout()
plt.show()

Exercise 1.2: Binomial Distribution — Vaccine Efficacy

A COVID-19 vaccine has an efficacy of 90%, meaning that among people exposed to the virus, a vaccinated person has a 90% chance of NOT getting symptomatic disease (equivalently, a 10% chance of a breakthrough infection).

In a group of 25 vaccinated individuals who are all exposed to the virus:

What is the expected number of breakthrough infections?
What is the probability of zero breakthrough infections?
What is the probability of 5 or more breakthrough infections?
Plot the full probability distribution.

Code

n <- 25
p <- 0.10  # probability of breakthrough infection

# 1. Expected number
expected <- # YOUR CODE HERE

# 2. P(X = 0)
prob_zero <- # YOUR CODE HERE (hint: dbinom())

# 3. P(X >= 5)
prob_five_or_more <- # YOUR CODE HERE (hint: use pbinom with lower.tail)

# 4. Plot the distribution
# YOUR CODE HERE

Code

from scipy import stats
import numpy as np
import matplotlib.pyplot as plt

n = 25
p = 0.10

# 1. Expected number
expected = # YOUR CODE HERE

# 2. P(X = 0)
prob_zero = # YOUR CODE HERE (hint: stats.binom.pmf())

# 3. P(X >= 5)
prob_five_or_more = # YOUR CODE HERE (hint: stats.binom.sf())

# 4. Plot
# YOUR CODE HERE

Solution

Code

library(tidyverse)

n <- 25
p <- 0.10

# 1. Expected number
expected <- n * p
cat("Expected breakthrough infections:", expected, "\n")
# 2.5

# 2. P(X = 0)
prob_zero <- dbinom(0, size = n, prob = p)
cat("P(X = 0):", round(prob_zero, 4), "\n")
# About 0.0718

# 3. P(X >= 5)
prob_five_or_more <- 1 - pbinom(4, size = n, prob = p)
# Equivalently: pbinom(4, size = n, prob = p, lower.tail = FALSE)
cat("P(X >= 5):", round(prob_five_or_more, 4), "\n")
# About 0.0980

# 4. Plot
outcomes <- tibble(
  k = 0:n,
  probability = dbinom(k, size = n, prob = p)
)

ggplot(outcomes, aes(x = k, y = probability)) +
  geom_col(aes(fill = k >= 5), alpha = 0.8) +
  scale_fill_manual(values = c("FALSE" = "steelblue", "TRUE" = "firebrick"),
                    guide = "none") +
  geom_vline(xintercept = expected, linetype = "dashed") +
  labs(
    title = "Breakthrough Infections in 25 Vaccinated Individuals",
    subtitle = "Binomial(25, 0.10); red bars = 5 or more infections",
    x = "Number of Breakthrough Infections", y = "Probability"
  ) +
  scale_x_continuous(breaks = seq(0, 25, 2)) +
  theme_minimal()

Code

from scipy import stats
import numpy as np
import matplotlib.pyplot as plt

n = 25
p = 0.10

# 1. Expected number
expected = n * p
print(f"Expected breakthrough infections: {expected}")
# 2.5

# 2. P(X = 0)
prob_zero = stats.binom.pmf(0, n, p)
print(f"P(X = 0): {prob_zero:.4f}")
# About 0.0718

# 3. P(X >= 5)
prob_five_or_more = stats.binom.sf(4, n, p)  # sf = survival function = 1 - cdf
print(f"P(X >= 5): {prob_five_or_more:.4f}")
# About 0.0980

# 4. Plot
k = np.arange(0, n + 1)
probs = stats.binom.pmf(k, n, p)
colors = ["firebrick" if ki >= 5 else "steelblue" for ki in k]

fig, ax = plt.subplots(figsize=(9, 5))
ax.bar(k, probs, color=colors, alpha=0.8)
ax.axvline(expected, linestyle="--", color="black")
ax.set_title("Breakthrough Infections in 25 Vaccinated Individuals\nBinomial(25, 0.10)")
ax.set_xlabel("Number of Breakthrough Infections")
ax.set_ylabel("Probability")
ax.set_xticks(range(0, 26, 2))
plt.tight_layout()
plt.show()

Exercise 1.3: Poisson Distribution — Emergency Department Visits

An emergency department receives an average of 8 trauma cases per day.

What is the probability of receiving exactly 8 trauma cases tomorrow?
What is the probability of receiving 12 or more?
On how many days per year (365 days) would you expect 0 trauma cases?
Create a bar plot of the Poisson(8) distribution from 0 to 20.

Code

lambda <- 8

# 1. P(X = 8)
prob_exactly_8 <- # YOUR CODE HERE (hint: dpois())

# 2. P(X >= 12)
prob_12_or_more <- # YOUR CODE HERE (hint: ppois with lower.tail)

# 3. Expected days with 0 cases in a year
prob_zero <- # YOUR CODE HERE
expected_zero_days <- # YOUR CODE HERE

# 4. Plot
# YOUR CODE HERE

Code

from scipy import stats
import numpy as np
import matplotlib.pyplot as plt

lam = 8

# 1. P(X = 8)
prob_exactly_8 = # YOUR CODE HERE (hint: stats.poisson.pmf())

# 2. P(X >= 12)
prob_12_or_more = # YOUR CODE HERE (hint: stats.poisson.sf())

# 3. Expected days with 0 cases
prob_zero = # YOUR CODE HERE
expected_zero_days = # YOUR CODE HERE

# 4. Plot
# YOUR CODE HERE

Solution

Code

library(tidyverse)

lambda <- 8

# 1. P(X = 8)
prob_exactly_8 <- dpois(8, lambda = lambda)
cat("P(X = 8):", round(prob_exactly_8, 4), "\n")
# About 0.1396

# 2. P(X >= 12)
prob_12_or_more <- ppois(11, lambda = lambda, lower.tail = FALSE)
cat("P(X >= 12):", round(prob_12_or_more, 4), "\n")
# About 0.1121

# 3. Expected days with 0 cases
prob_zero <- dpois(0, lambda = lambda)
expected_zero_days <- 365 * prob_zero
cat("P(X = 0):", format(prob_zero, scientific = TRUE), "\n")
cat("Expected zero-case days per year:", round(expected_zero_days, 3), "\n")
# P(0) is very small (~0.000335), so about 0.12 days/year

# 4. Plot
k <- 0:20
poisson_data <- tibble(
  k = k,
  probability = dpois(k, lambda = lambda)
)

ggplot(poisson_data, aes(x = k, y = probability)) +
  geom_col(fill = "darkorange", alpha = 0.8) +
  geom_vline(xintercept = lambda, linetype = "dashed", color = "firebrick") +
  labs(
    title = "Poisson Distribution: Trauma Cases per Day",
    subtitle = expression(paste(lambda, " = 8")),
    x = "Number of Trauma Cases", y = "Probability"
  ) +
  scale_x_continuous(breaks = k) +
  theme_minimal()

Code

from scipy import stats
import numpy as np
import matplotlib.pyplot as plt

lam = 8

# 1. P(X = 8)
prob_exactly_8 = stats.poisson.pmf(8, lam)
print(f"P(X = 8): {prob_exactly_8:.4f}")
# About 0.1396

# 2. P(X >= 12)
prob_12_or_more = stats.poisson.sf(11, lam)
print(f"P(X >= 12): {prob_12_or_more:.4f}")
# About 0.1121

# 3. Expected days with 0 cases
prob_zero = stats.poisson.pmf(0, lam)
expected_zero_days = 365 * prob_zero
print(f"P(X = 0): {prob_zero:.6f}")
print(f"Expected zero-case days per year: {expected_zero_days:.3f}")
# Very small, about 0.12 days per year

# 4. Plot
k = np.arange(0, 21)
probs = stats.poisson.pmf(k, lam)

fig, ax = plt.subplots(figsize=(9, 5))
ax.bar(k, probs, color="darkorange", alpha=0.8)
ax.axvline(lam, linestyle="--", color="firebrick")
ax.set_title(f"Poisson Distribution: Trauma Cases per Day (lambda = {lam})")
ax.set_xlabel("Number of Trauma Cases")
ax.set_ylabel("Probability")
ax.set_xticks(k)
plt.tight_layout()
plt.show()

Exercise 1.4: Bayes’ Theorem — Interpreting a Cancer Screening Test

A breast cancer screening mammogram has the following characteristics:

Sensitivity: 87% (detects 87% of actual cancers)
Specificity: 95% (correctly clears 95% of women without cancer)
Prevalence of breast cancer in women aged 50–59 undergoing screening: 2%

Calculate the positive predictive value (PPV): if a woman in this age group has a positive mammogram, what is the probability she actually has breast cancer?
Calculate the negative predictive value (NPV).
How does the PPV change if we consider a high-risk population where prevalence is 10%?
Create a plot showing PPV as a function of prevalence (from 0.1% to 30%).

Code

sensitivity <- 0.87
specificity <- 0.95
prevalence <- 0.02

# 1. PPV
ppv <- # YOUR CODE HERE

# 2. NPV
npv <- # YOUR CODE HERE

# 3. PPV with prevalence = 10%
ppv_high_risk <- # YOUR CODE HERE

# 4. Plot PPV vs. prevalence
# YOUR CODE HERE

Code

import numpy as np
import matplotlib.pyplot as plt

sensitivity = 0.87
specificity = 0.95
prevalence = 0.02

# 1. PPV
ppv = # YOUR CODE HERE

# 2. NPV
npv = # YOUR CODE HERE

# 3. PPV with prevalence = 10%
ppv_high_risk = # YOUR CODE HERE

# 4. Plot PPV vs. prevalence
# YOUR CODE HERE

Solution

Code

library(tidyverse)

sensitivity <- 0.87
specificity <- 0.95
prevalence <- 0.02

# 1. PPV
ppv <- (sensitivity * prevalence) /
  (sensitivity * prevalence + (1 - specificity) * (1 - prevalence))
cat("PPV (prevalence = 2%):", round(ppv, 4), "\n")
# About 0.2623 or 26.2%

# 2. NPV
npv <- (specificity * (1 - prevalence)) /
  (specificity * (1 - prevalence) + (1 - sensitivity) * prevalence)
cat("NPV (prevalence = 2%):", round(npv, 4), "\n")
# About 0.9972 or 99.7%

# 3. PPV with high-risk prevalence
prev_high <- 0.10
ppv_high_risk <- (sensitivity * prev_high) /
  (sensitivity * prev_high + (1 - specificity) * (1 - prev_high))
cat("PPV (prevalence = 10%):", round(ppv_high_risk, 4), "\n")
# About 0.6588 or 65.9%

# 4. Plot
prev_range <- seq(0.001, 0.30, by = 0.001)
ppv_curve <- (sensitivity * prev_range) /
  (sensitivity * prev_range + (1 - specificity) * (1 - prev_range))

plot_data <- tibble(prevalence = prev_range, PPV = ppv_curve)

ggplot(plot_data, aes(x = prevalence, y = PPV)) +
  geom_line(color = "steelblue", linewidth = 1.2) +
  geom_point(data = tibble(prevalence = c(0.02, 0.10),
                            PPV = c(ppv, ppv_high_risk)),
             color = "firebrick", size = 3) +
  annotate("text", x = 0.04, y = ppv - 0.03,
           label = paste0("General pop (2%): PPV = ", round(ppv * 100, 1), "%"),
           hjust = 0, color = "firebrick", size = 3.5) +
  annotate("text", x = 0.12, y = ppv_high_risk - 0.03,
           label = paste0("High risk (10%): PPV = ", round(ppv_high_risk * 100, 1), "%"),
           hjust = 0, color = "firebrick", size = 3.5) +
  scale_x_continuous(labels = scales::percent_format()) +
  scale_y_continuous(labels = scales::percent_format(), limits = c(0, 1)) +
  labs(
    title = "Mammography PPV Depends on Prevalence",
    subtitle = "Sensitivity = 87%, Specificity = 95%",
    x = "Prevalence",
    y = "Positive Predictive Value (PPV)"
  ) +
  theme_minimal()

Code

import numpy as np
import matplotlib.pyplot as plt

sensitivity = 0.87
specificity = 0.95
prevalence = 0.02

# 1. PPV
ppv = (sensitivity * prevalence) / (
    sensitivity * prevalence + (1 - specificity) * (1 - prevalence)
)
print(f"PPV (prevalence = 2%): {ppv:.4f}")
# About 0.2623 or 26.2%

# 2. NPV
npv = (specificity * (1 - prevalence)) / (
    specificity * (1 - prevalence) + (1 - sensitivity) * prevalence
)
print(f"NPV (prevalence = 2%): {npv:.4f}")
# About 0.9972 or 99.7%

# 3. PPV with high-risk prevalence
prev_high = 0.10
ppv_high = (sensitivity * prev_high) / (
    sensitivity * prev_high + (1 - specificity) * (1 - prev_high)
)
print(f"PPV (prevalence = 10%): {ppv_high:.4f}")
# About 0.6588 or 65.9%

# 4. Plot
prev_range = np.arange(0.001, 0.301, 0.001)
ppv_curve = (sensitivity * prev_range) / (
    sensitivity * prev_range + (1 - specificity) * (1 - prev_range)
)

fig, ax = plt.subplots(figsize=(9, 5))
ax.plot(prev_range * 100, ppv_curve * 100, color="steelblue", linewidth=1.5)
ax.plot([2, 10], [ppv * 100, ppv_high * 100], "o", color="firebrick", markersize=8)
ax.annotate(f"General pop (2%): PPV = {ppv*100:.1f}%",
            xy=(2, ppv*100), xytext=(5, ppv*100 - 5),
            color="firebrick", fontsize=9,
            arrowprops=dict(arrowstyle="->", color="firebrick"))
ax.annotate(f"High risk (10%): PPV = {ppv_high*100:.1f}%",
            xy=(10, ppv_high*100), xytext=(13, ppv_high*100 - 5),
            color="firebrick", fontsize=9,
            arrowprops=dict(arrowstyle="->", color="firebrick"))
ax.set_xlabel("Prevalence (%)")
ax.set_ylabel("Positive Predictive Value (%)")
ax.set_title("Mammography PPV Depends on Prevalence\nSensitivity = 87%, Specificity = 95%")
ax.set_ylim(0, 105)
plt.tight_layout()
plt.show()

Exercise 1.5: Central Limit Theorem — Hands-On Simulation

Demonstrate the Central Limit Theorem yourself by simulating sample means from a non-normal distribution.

Generate 10,000 individual observations from a Poisson distribution with $\lambda = 3$ (this distribution is right-skewed). Plot a histogram.
Now, for each of the following sample sizes ($n$ = 5, 15, 50, 200), draw 5,000 random samples of that size, compute the mean of each sample, and plot the distribution of sample means.
Overlay a normal curve on the $n = 50$ histogram to show how well the normal approximation fits.

Code

library(tidyverse)

set.seed(123)
lambda <- 3

# 1. Histogram of raw Poisson data
raw_data <- rpois(10000, lambda = lambda)
# YOUR CODE HERE: plot a histogram

# 2. CLT simulation for different sample sizes
sample_sizes <- c(5, 15, 50, 200)
# YOUR CODE HERE: for each n, draw 5000 samples, compute means, plot

# 3. Overlay normal curve on n = 50
# YOUR CODE HERE

Code

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

np.random.seed(123)
lam = 3

# 1. Histogram of raw Poisson data
raw_data = np.random.poisson(lam, 10000)
# YOUR CODE HERE: plot a histogram

# 2. CLT simulation
sample_sizes = [5, 15, 50, 200]
# YOUR CODE HERE

# 3. Overlay normal curve
# YOUR CODE HERE

Solution

Code

library(tidyverse)

set.seed(123)
lambda <- 3
n_sim <- 5000

# 1. Raw Poisson data
raw_data <- rpois(10000, lambda = lambda)
ggplot(tibble(x = raw_data), aes(x = x)) +
  geom_histogram(binwidth = 1, fill = "steelblue", color = "white", alpha = 0.8) +
  labs(title = "Raw Poisson(3) Data (right-skewed)",
       x = "Value", y = "Count") +
  theme_minimal()

# 2. CLT simulation
sample_sizes <- c(5, 15, 50, 200)

clt_data <- map_dfr(sample_sizes, function(n) {
  means <- replicate(n_sim, mean(rpois(n, lambda = lambda)))
  tibble(n = paste0("n = ", n), sample_mean = means)
})

clt_data$n <- factor(clt_data$n, levels = paste0("n = ", sample_sizes))

ggplot(clt_data, aes(x = sample_mean)) +
  geom_histogram(aes(y = after_stat(density)),
                 bins = 40, fill = "steelblue", alpha = 0.7) +
  facet_wrap(~n, scales = "free_y") +
  labs(title = "CLT: Distribution of Sample Means from Poisson(3)",
       x = "Sample Mean", y = "Density") +
  theme_minimal()

# 3. Overlay normal curve on n = 50
n50_means <- replicate(n_sim, mean(rpois(50, lambda = lambda)))
theoretical_sd <- sqrt(lambda / 50)

ggplot(tibble(x = n50_means), aes(x = x)) +
  geom_histogram(aes(y = after_stat(density)),
                 bins = 40, fill = "steelblue", alpha = 0.7) +
  stat_function(fun = dnorm, args = list(mean = lambda, sd = theoretical_sd),
                color = "firebrick", linewidth = 1.2) +
  labs(title = "Sample Means (n=50) with Normal Approximation Overlay",
       subtitle = paste0("Theoretical: N(", lambda, ", ", round(theoretical_sd^2, 4), ")"),
       x = "Sample Mean", y = "Density") +
  theme_minimal()

Code

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

np.random.seed(123)
lam = 3
n_sim = 5000

# 1. Raw Poisson data
raw_data = np.random.poisson(lam, 10000)
fig, ax = plt.subplots(figsize=(8, 4))
ax.hist(raw_data, bins=range(0, 15), color="steelblue", alpha=0.8, edgecolor="white")
ax.set_title("Raw Poisson(3) Data (right-skewed)")
ax.set_xlabel("Value")
ax.set_ylabel("Count")
plt.tight_layout()
plt.show()

# 2. CLT simulation
sample_sizes = [5, 15, 50, 200]
fig, axes = plt.subplots(2, 2, figsize=(10, 8))
axes = axes.flatten()

for i, n in enumerate(sample_sizes):
    means = [np.mean(np.random.poisson(lam, n)) for _ in range(n_sim)]
    axes[i].hist(means, bins=40, density=True, color="steelblue", alpha=0.7)
    axes[i].set_title(f"n = {n}")
    axes[i].set_xlabel("Sample Mean")
    axes[i].set_ylabel("Density")

fig.suptitle("CLT: Distribution of Sample Means from Poisson(3)", fontsize=13)
plt.tight_layout()
plt.show()

# 3. Overlay normal curve on n = 50
n50_means = [np.mean(np.random.poisson(lam, 50)) for _ in range(n_sim)]
theoretical_sd = np.sqrt(lam / 50)

fig, ax = plt.subplots(figsize=(8, 5))
ax.hist(n50_means, bins=40, density=True, color="steelblue", alpha=0.7, label="Simulated")
x_range = np.linspace(min(n50_means), max(n50_means), 200)
ax.plot(x_range, stats.norm.pdf(x_range, loc=lam, scale=theoretical_sd),
        color="firebrick", linewidth=1.5, label="Normal approx.")
ax.set_title("Sample Means (n=50) with Normal Approximation Overlay")
ax.set_xlabel("Sample Mean")
ax.set_ylabel("Density")
ax.legend()
plt.tight_layout()
plt.show()

3.10 Chapter Summary

In this chapter, we covered the foundational language of probability and statistics:

Probability quantifies uncertainty. The frequentist interpretation focuses on long-run frequencies; the Bayesian interpretation treats probability as a degree of belief. Both are useful.
Random variables can be discrete (countable values) or continuous (any value in a range).
The normal distribution is the most important distribution in statistics, characterized by its mean and standard deviation. Z-scores standardize values relative to the distribution.
The binomial distribution models the number of successes in a fixed number of independent trials — useful for modeling treatment response rates, disease prevalence, and more.
The Poisson distribution models counts of rare events occurring at a constant rate — useful for adverse events, hospital admissions, and rare disease incidence.
The Central Limit Theorem tells us that sample means are approximately normally distributed regardless of the underlying population distribution, as long as the sample size is large enough. This justifies the widespread use of normal-based statistical methods.
Bayes’ theorem provides a principled way to update probabilities in light of new evidence. In diagnostic testing, it reveals that the predictive value of a test depends critically on disease prevalence, not just sensitivity and specificity.

These concepts underpin everything that follows in this course. In the next chapter, we will build on this foundation to discuss statistical inference — how to draw conclusions about populations from samples.

3.11 References and Further Reading

Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. Springer. A rigorous but accessible introduction to probability and statistics. Chapters 1–3 cover the material in this chapter at a more mathematical level.
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian Data Analysis (3rd ed.). CRC Press. The definitive textbook on Bayesian statistics. Chapter 1 provides an excellent introduction to Bayesian thinking.
Harrell, F. E. (2015). Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis (2nd ed.). Springer. Chapter 2 discusses the role of probability and distributions in the context of regression modeling for health sciences.
Altman, D. G., & Bland, J. M. (1994). “Diagnostic tests 1: Sensitivity and specificity.” BMJ, 308(6943), 1552. A classic short paper explaining sensitivity, specificity, and predictive values for a clinical audience.
Altman, D. G., & Bland, J. M. (1994). “Diagnostic tests 2: Predictive values.” BMJ, 309(6947), 102. The companion paper on PPV and NPV, with a clear explanation of how prevalence affects predictive values.
Gigerenzer, G., Gaissmaier, W., Kurz-Milcke, E., Schwartz, L. M., & Woloshin, S. (2007). “Helping doctors and patients make sense of health statistics.” Psychological Science in the Public Interest, 8(2), 53–96. An excellent discussion of how to communicate probabilistic information about diagnostic tests and treatments.
Cetinkaya-Rundel, M., & Hardin, J. (2024). Introduction to Modern Statistics (2nd ed.). OpenIntro. Freely available at https://openintro-ims2.netlify.app/. Chapters 3–4 provide an accessible introduction to probability and distributions with many worked examples.
Rice, J. A. (2007). Mathematical Statistics and Data Analysis (3rd ed.). Thomson Brooks/Cole. A thorough treatment of probability distributions for students comfortable with calculus.
Motulsky, H. (2018). Intuitive Biostatistics: A Nonmathematical Guide to Statistical Thinking (4th ed.). Oxford University Press. A superb introduction to biostatistics for medical professionals, with clear explanations of probability, distributions, and Bayesian reasoning.

# Probability, Distributions, and the Language of Uncertainty {#sec-probability} ## Why Probability Matters in Health Sciences Medicine is a discipline of uncertainty. A patient presents with chest pain --- is it a heart attack, acid reflux, or a pulled muscle? A new drug shows a 15% reduction in tumor size --- is this a real effect or a fluke of sampling? A screening test comes back positive --- what is the chance the patient actually has the disease? Probability gives us a formal language for reasoning about uncertainty. It allows us to quantify how likely different outcomes are, how confident we should be in our conclusions, and how we should update our beliefs when new evidence arrives. Every statistical method you will learn in this course --- from logistic regression to Bayesian modeling to machine learning --- is built on the foundation of probability theory. This chapter introduces the key ideas you need. We will move slowly, use concrete clinical examples, and build intuition before introducing formulas. ## What Is Probability? At its simplest, a **probability** is a number between 0 and 1 that represents how likely an event is to occur. A probability of 0 means the event is impossible; a probability of 1 means it is certain. A probability of 0.5 means the event is equally likely to happen or not happen. But where do probabilities come from? There are two major perspectives. ### The Frequentist Interpretation The **frequentist** interpretation says that the probability of an event is the long-run relative frequency with which it occurs if we repeat the same process many times. For example: - If we flip a fair coin many times, it will land heads about 50% of the time. We say $P(\text{heads}) = 0.5$. - If a particular surgery has a 3% mortality rate, this means that out of every 100 patients who undergo the procedure (under similar conditions), roughly 3 will die. This interpretation works well when we can imagine repeating an experiment. Coin flips, blood draws from a population, random assignment in a clinical trial --- all of these fit naturally into the frequentist framework. ### The Subjective (Bayesian) Interpretation The **subjective** or **Bayesian** interpretation says that probability represents a *degree of belief*. It is a measure of how confident you are that something is true, given what you currently know. - A cardiologist examining a 65-year-old male smoker with crushing chest pain might say: "I believe there is an 85% probability this is a myocardial infarction." This is not based on repeating the exact same patient encounter thousands of times --- it is an informed judgment. - Before seeing any data from a clinical trial, a researcher might believe there is a 30% chance a new drug is effective. After seeing positive results from the trial, they update that belief to 80%. The Bayesian interpretation is powerful because it applies to one-time events and unique situations. You cannot repeat a patient's life, but you can still reason probabilistically about their prognosis. ### Which Interpretation Is Correct? Both are useful. In this course, we will use frequentist methods (hypothesis tests, confidence intervals, maximum likelihood) and Bayesian methods (posterior distributions, credible intervals, prior updating). Understanding both perspectives will make you a more versatile analyst. The key insight is that probability is a *tool for reasoning under uncertainty*, regardless of your philosophical starting point. ## Random Variables A **random variable** is a quantity whose value is determined by a random process. In clinical research, random variables are everywhere: - The number of patients who respond to treatment (a count) - A patient's systolic blood pressure (a measurement) - Whether a tumor is malignant or benign (a category) - Time from diagnosis to death (a duration) We use capital letters like $X$, $Y$, or $Z$ to denote random variables, and lowercase letters like $x$, $y$, $z$ for specific observed values. ### Discrete Random Variables A **discrete** random variable takes on a countable number of values, typically whole numbers. Examples: - The number of adverse events in a clinical trial: 0, 1, 2, 3, ... - The number of positive test results in a batch of 20 specimens: 0, 1, 2, ..., 20 - A patient's pain score on a 0--10 scale: 0, 1, 2, ..., 10 For a discrete random variable, we describe its behavior using a **probability mass function (PMF)**, which gives the probability of each possible value: $$P(X = x)$$ All probabilities must be non-negative, and they must sum to 1: $$\sum_{\text{all } x} P(X = x) = 1$$ ### Continuous Random Variables A **continuous** random variable can take on any value within a range (including decimals). Examples: - Body temperature: 36.2, 37.45, 38.1, ... - Serum creatinine level: any positive real number - Time to event: any positive duration For a continuous random variable, the probability of any *exact* value is technically zero (because there are infinitely many possible values). Instead, we talk about the probability of falling within a *range*, and we describe the distribution using a **probability density function (PDF)**, denoted $f(x)$. The probability that $X$ falls between $a$ and $b$ is: $$P(a \leq X \leq b) = \int_a^b f(x) \, dx$$ This is the area under the density curve between $a$ and $b$. The total area under the entire curve equals 1. Do not worry if integrals make you nervous. The key idea is: for continuous variables, probability corresponds to *area under a curve*. ## The Normal (Gaussian) Distribution The **normal distribution** --- also called the Gaussian distribution or the "bell curve" --- is the most important probability distribution in statistics. It appears everywhere in nature and medicine. ### Why Does It Matter? 1. **Many biological measurements are approximately normal.** Heights, weights, blood pressure readings, cholesterol levels, and many lab values follow roughly bell-shaped distributions in large populations. 2. **The Central Limit Theorem** (discussed later in this chapter) tells us that averages of measurements tend to follow a normal distribution, even when individual measurements do not. 3. **Many statistical methods assume normality** (or at least approximate normality) for valid inference. ### Properties of the Normal Distribution A normal distribution is completely described by two parameters: - **Mean ($\mu$)**: The center of the distribution. The peak of the bell curve. - **Standard deviation ($\sigma$)**: A measure of spread. About 68% of values fall within one standard deviation of the mean, about 95% within two, and about 99.7% within three. This is sometimes called the **68-95-99.7 rule**. The probability density function is: $$f(x) = \frac{1}{\sigma \sqrt{2\pi}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)$$ You do not need to memorize this formula. What matters is understanding the shape and the role of $\mu$ and $\sigma$. We write $X \sim \text{Normal}(\mu, \sigma^2)$ or $X \sim N(\mu, \sigma^2)$ to say that $X$ follows a normal distribution with mean $\mu$ and variance $\sigma^2$. ### Clinical Example: Blood Pressure Systolic blood pressure in healthy adults is approximately normally distributed with a mean of about 120 mmHg and a standard deviation of about 15 mmHg. That is, $\text{SBP} \sim N(120, 15^2)$. Using the 68-95-99.7 rule: - About 68% of healthy adults have SBP between 105 and 135 mmHg (120 plus or minus 15). - About 95% have SBP between 90 and 150 mmHg (120 plus or minus 30). - About 99.7% have SBP between 75 and 165 mmHg (120 plus or minus 45). If a patient has a systolic blood pressure of 160 mmHg, how unusual is that? We can answer this using **Z-scores**. ### Z-Scores: Standardizing Values A **Z-score** tells you how many standard deviations a value is from the mean: $$Z = \frac{X - \mu}{\sigma}$$ For our patient with SBP = 160: $$Z = \frac{160 - 120}{15} = \frac{40}{15} \approx 2.67$$ This patient's blood pressure is about 2.67 standard deviations above the mean. Using a standard normal table or a computer, we find that only about 0.4% of the healthy population would have a blood pressure this high or higher. This is strong evidence that this patient may have hypertension. The **standard normal distribution** is a normal distribution with $\mu = 0$ and $\sigma = 1$. Any normal distribution can be converted to the standard normal by computing Z-scores. Let us visualize the normal distribution and Z-scores: ::: {.panel-tabset} ### R ```{r} #| eval: false library(tidyverse) # Parameters for systolic blood pressure mu <- 120 sigma <- 15 # Create a sequence of values for the x-axis x <- seq(mu - 4 * sigma, mu + 4 * sigma, length.out = 300) y <- dnorm(x, mean = mu, sd = sigma) bp_data <- tibble(x = x, y = y) ggplot(bp_data, aes(x = x, y = y)) + geom_line(color = "steelblue", linewidth = 1.2) + geom_area(data = bp_data |> filter(x >= 160), aes(x = x, y = y), fill = "firebrick", alpha = 0.4) + geom_vline(xintercept = 160, linetype = "dashed", color = "firebrick") + annotate("text", x = 168, y = 0.015, label = "Patient: 160 mmHg\n(Z = 2.67)", color = "firebrick", hjust = 0, size = 3.5) + labs( title = "Distribution of Systolic Blood Pressure in Healthy Adults", subtitle = paste0("Normal(", mu, ", ", sigma, "^2)"), x = "Systolic Blood Pressure (mmHg)", y = "Density" ) + theme_minimal() ``` ### Python ```{python} #| eval: false import numpy as np import matplotlib.pyplot as plt from scipy import stats mu = 120 sigma = 15 x = np.linspace(mu - 4 * sigma, mu + 4 * sigma, 300) y = stats.norm.pdf(x, loc=mu, scale=sigma) fig, ax = plt.subplots(figsize=(9, 5)) ax.plot(x, y, color="steelblue", linewidth=1.5) # Shade the area above 160 x_fill = x[x >= 160] y_fill = stats.norm.pdf(x_fill, loc=mu, scale=sigma) ax.fill_between(x_fill, y_fill, color="firebrick", alpha=0.4) ax.axvline(160, linestyle="--", color="firebrick") ax.text(162, 0.015, "Patient: 160 mmHg\n(Z = 2.67)", color="firebrick", fontsize=9) ax.set_title("Distribution of Systolic Blood Pressure in Healthy Adults") ax.set_xlabel("Systolic Blood Pressure (mmHg)") ax.set_ylabel("Density") plt.tight_layout() plt.show() ``` ::: ## The Binomial Distribution The **binomial distribution** models the number of "successes" in a fixed number of independent trials, where each trial has the same probability of success. Despite the name, a "success" does not have to be a good thing --- it is simply the outcome we are counting. ### Parameters - $n$: the number of trials (a fixed, known integer) - $p$: the probability of success on each trial (between 0 and 1) We write $X \sim \text{Binomial}(n, p)$, and the PMF is: $$P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}, \quad k = 0, 1, 2, \ldots, n$$ where $\binom{n}{k} = \frac{n!}{k!(n-k)!}$ is the binomial coefficient ("n choose k"), which counts the number of ways to arrange $k$ successes among $n$ trials. **Mean**: $E[X] = np$ **Variance**: $\text{Var}(X) = np(1-p)$ ### Clinical Examples **Disease prevalence.** In a city, the prevalence of Type 2 diabetes among adults is 12%. If you randomly sample 50 adults, the number with diabetes follows a Binomial(50, 0.12) distribution. The expected number is $50 \times 0.12 = 6$. **Treatment response.** A chemotherapy regimen achieves complete response in 35% of patients. If 20 patients are treated, the number who achieve complete response follows a Binomial(20, 0.35) distribution. **Key assumptions**: The binomial distribution assumes that (1) each trial is independent (one patient's outcome does not affect another's), (2) the probability of success is the same for every trial, and (3) the number of trials is fixed in advance. In practice, these assumptions are sometimes approximate (e.g., disease probability may vary with age), but the binomial is still a useful starting model. ::: {.panel-tabset} ### R ```{r} #| eval: false library(tidyverse) # Treatment response: 20 patients, 35% response rate n <- 20 p <- 0.35 # Calculate probabilities for each possible outcome outcomes <- tibble( k = 0:n, probability = dbinom(k, size = n, prob = p) ) ggplot(outcomes, aes(x = k, y = probability)) + geom_col(fill = "steelblue", alpha = 0.8) + geom_vline(xintercept = n * p, linetype = "dashed", color = "firebrick") + annotate("text", x = n * p + 0.5, y = max(outcomes$probability), label = paste0("Expected value = ", n * p), hjust = 0, color = "firebrick") + labs( title = "Binomial Distribution: Treatment Response", subtitle = "n = 20 patients, p = 0.35 response rate", x = "Number of Responders", y = "Probability" ) + scale_x_continuous(breaks = 0:n) + theme_minimal() ``` ### Python ```{python} #| eval: false import numpy as np import matplotlib.pyplot as plt from scipy import stats n = 20 p = 0.35 k = np.arange(0, n + 1) probabilities = stats.binom.pmf(k, n, p) fig, ax = plt.subplots(figsize=(9, 5)) ax.bar(k, probabilities, color="steelblue", alpha=0.8) ax.axvline(n * p, linestyle="--", color="firebrick") ax.text(n * p + 0.3, max(probabilities), f"Expected value = {n * p}", color="firebrick", fontsize=9) ax.set_title("Binomial Distribution: Treatment Response") ax.set_xlabel("Number of Responders") ax.set_ylabel("Probability") ax.set_xticks(k) plt.tight_layout() plt.show() ``` ::: ## The Poisson Distribution The **Poisson distribution** models the number of events occurring in a fixed interval of time or space, when events happen independently at a constant average rate. It is the go-to distribution for modeling rare events. ### Parameters - $\lambda$ (lambda): the average number of events per interval. This is both the mean and the variance. We write $X \sim \text{Poisson}(\lambda)$, and the PMF is: $$P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}, \quad k = 0, 1, 2, \ldots$$ **Mean**: $E[X] = \lambda$ **Variance**: $\text{Var}(X) = \lambda$ The fact that the mean equals the variance is a distinctive property of the Poisson distribution. When you see count data where the variance is much larger than the mean, the Poisson may not be a good fit (a condition called **overdispersion**). ### Clinical Examples **Adverse drug reactions.** A hospital reports an average of 2.5 serious adverse drug reactions per month. If these events are independent, the number in any given month follows approximately a Poisson(2.5) distribution. The probability of zero serious reactions in a month is $P(X = 0) = e^{-2.5} \approx 0.082$, or about 8.2%. **Hospital admissions.** An emergency department sees an average of 40 patients per shift. The number of arrivals per shift can be modeled with a Poisson(40) distribution (though in practice, arrivals may cluster, violating independence). **Rare disease incidence.** If a rare genetic disorder affects 1 in 100,000 people per year and a city has 500,000 people, we expect about 5 cases per year. The annual count follows approximately a Poisson(5) distribution. ::: {.panel-tabset} ### R ```{r} #| eval: false library(tidyverse) # Adverse drug reactions: average 2.5 per month lambda <- 2.5 k <- 0:12 adr_data <- tibble( k = k, probability = dpois(k, lambda = lambda) ) ggplot(adr_data, aes(x = k, y = probability)) + geom_col(fill = "darkorange", alpha = 0.8) + geom_vline(xintercept = lambda, linetype = "dashed", color = "firebrick") + labs( title = "Poisson Distribution: Adverse Drug Reactions per Month", subtitle = expression(paste(lambda, " = 2.5 reactions/month")), x = "Number of Adverse Reactions", y = "Probability" ) + scale_x_continuous(breaks = k) + theme_minimal() ``` ### Python ```{python} #| eval: false import numpy as np import matplotlib.pyplot as plt from scipy import stats lam = 2.5 k = np.arange(0, 13) probs = stats.poisson.pmf(k, lam) fig, ax = plt.subplots(figsize=(9, 5)) ax.bar(k, probs, color="darkorange", alpha=0.8) ax.axvline(lam, linestyle="--", color="firebrick") ax.set_title(f"Poisson Distribution: Adverse Drug Reactions per Month (lambda = {lam})") ax.set_xlabel("Number of Adverse Reactions") ax.set_ylabel("Probability") ax.set_xticks(k) plt.tight_layout() plt.show() ``` ::: ## The Central Limit Theorem The **Central Limit Theorem (CLT)** is one of the most remarkable results in all of statistics. It states: > If you take sufficiently large random samples from *any* population (regardless of the population's distribution), the distribution of the sample means will be approximately normal. More precisely: if $X_1, X_2, \ldots, X_n$ are independent and identically distributed random variables with mean $\mu$ and standard deviation $\sigma$, then the sample mean $\bar{X} = \frac{1}{n}\sum_{i=1}^n X_i$ has a distribution that approaches $N(\mu, \sigma^2/n)$ as $n$ grows large. ### Why Is This Important? The CLT is the reason the normal distribution is so central to statistics. It means that: 1. **We can use normal-based methods** (t-tests, confidence intervals, regression) even when the underlying data are not normally distributed, as long as our sample size is large enough. 2. **Sample means are more predictable than individual observations.** The standard deviation of the sample mean ($\sigma / \sqrt{n}$) shrinks as $n$ grows, meaning larger samples give more precise estimates. ### How Large Is "Large Enough"? It depends on how non-normal the underlying distribution is: - If the population is already roughly symmetric, $n = 15$--$30$ is often sufficient. - If the population is heavily skewed (e.g., income, hospital length of stay), you may need $n = 50$ or more. - For extremely skewed distributions, $n > 100$ may be needed. ### Simulation Demonstration Let us demonstrate the CLT visually. We will take samples from a **right-skewed** distribution (exponential, which looks nothing like a bell curve) and show that the distribution of sample means becomes increasingly normal as the sample size grows. ::: {.panel-tabset} ### R ```{r} #| eval: false library(tidyverse) set.seed(42) # Parameters n_simulations <- 5000 # Number of samples to draw sample_sizes <- c(1, 5, 30, 100) # For each sample size, draw many samples and compute the mean of each clt_data <- map_dfr(sample_sizes, function(n) { means <- replicate(n_simulations, mean(rexp(n, rate = 1))) tibble( sample_size = paste0("n = ", n), sample_mean = means ) }) # Convert to ordered factor so panels appear in the right order clt_data$sample_size <- factor(clt_data$sample_size, levels = paste0("n = ", sample_sizes)) ggplot(clt_data, aes(x = sample_mean)) + geom_histogram(aes(y = after_stat(density)), bins = 50, fill = "steelblue", alpha = 0.7) + geom_density(color = "firebrick", linewidth = 0.8) + facet_wrap(~sample_size, scales = "free") + labs( title = "Central Limit Theorem in Action", subtitle = "Sampling from an Exponential(1) distribution (mean = 1, right-skewed)", x = "Sample Mean", y = "Density" ) + theme_minimal() ``` ### Python ```{python} #| eval: false import numpy as np import matplotlib.pyplot as plt np.random.seed(42) n_simulations = 5000 sample_sizes = [1, 5, 30, 100] fig, axes = plt.subplots(2, 2, figsize=(10, 8)) axes = axes.flatten() for i, n in enumerate(sample_sizes): means = [np.mean(np.random.exponential(1, n)) for _ in range(n_simulations)] axes[i].hist(means, bins=50, density=True, color="steelblue", alpha=0.7) axes[i].set_title(f"n = {n}") axes[i].set_xlabel("Sample Mean") axes[i].set_ylabel("Density") fig.suptitle( "Central Limit Theorem in Action\n" "Sampling from an Exponential(1) distribution (right-skewed)", fontsize=13 ) plt.tight_layout() plt.show() ``` ::: Notice how the distribution of sample means changes: - When $n = 1$, the distribution of "means" is just the original exponential distribution --- heavily right-skewed. - At $n = 5$, the skewness is already reduced. - At $n = 30$, the distribution looks approximately bell-shaped. - At $n = 100$, the distribution is very close to a perfect normal curve, and it is much narrower (less variable). This is the CLT in action. It does not matter that we started with a skewed distribution --- the means converge to normality. ## Bayes' Theorem: Updating Beliefs with Evidence {#sec-bayes-theorem} ### The Idea **Bayes' theorem** provides a mathematically precise way to update your beliefs when you receive new evidence. In clinical medicine, one of the most important applications is interpreting diagnostic tests. Suppose a patient tests positive for a disease. What is the probability they actually have the disease? Your intuition might say "very high," but as we will see, the answer depends critically on how common the disease is in the population. ### The Formula $$P(A \mid B) = \frac{P(B \mid A) \times P(A)}{P(B)}$$ In the context of diagnostic testing: - $P(\text{Disease} \mid \text{Positive test})$ = the probability the patient has the disease, given a positive test result. This is called the **positive predictive value (PPV)**. - $P(\text{Positive test} \mid \text{Disease})$ = the probability of a positive test result if the patient truly has the disease. This is the test's **sensitivity** (true positive rate). - $P(\text{Disease})$ = the probability of having the disease before the test, based on the population prevalence or clinical judgment. This is the **prior probability** or **pre-test probability**. - $P(\text{Positive test})$ = the overall probability of a positive test result, accounting for both true positives and false positives. We can expand the denominator using the **law of total probability**: $$P(\text{Positive test}) = P(\text{Positive} \mid \text{Disease}) \times P(\text{Disease}) + P(\text{Positive} \mid \text{No disease}) \times P(\text{No disease})$$ Note that $P(\text{Positive} \mid \text{No disease})$ is the **false positive rate**, which equals $1 - \text{specificity}$. ### Key Diagnostic Test Terminology | Term | Definition | Formula | |------|-----------|---------| | Sensitivity | Probability of a positive test given disease is present | $P(+\mid D)$ | | Specificity | Probability of a negative test given disease is absent | $P(-\mid \bar{D})$ | | Positive Predictive Value (PPV) | Probability of disease given a positive test | $P(D\mid +)$ | | Negative Predictive Value (NPV) | Probability of no disease given a negative test | $P(\bar{D}\mid -)$ | | Prevalence | Proportion of the population with the disease | $P(D)$ | : Diagnostic test performance metrics {.striped} ### Clinical Example: HIV Screening Let us work through a concrete example. Consider an HIV screening test with the following characteristics: - **Sensitivity**: 99.7% (the test correctly identifies 99.7% of people who have HIV) - **Specificity**: 99.5% (the test correctly identifies 99.5% of people who do not have HIV) - **Prevalence**: 0.3% (the proportion of the general population with HIV) These numbers look excellent. But what happens when a randomly selected person tests positive? Using Bayes' theorem: $$PPV = P(D \mid +) = \frac{P(+ \mid D) \times P(D)}{P(+ \mid D) \times P(D) + P(+ \mid \bar{D}) \times P(\bar{D})}$$ $$= \frac{0.997 \times 0.003}{0.997 \times 0.003 + 0.005 \times 0.997}$$ $$= \frac{0.002991}{0.002991 + 0.004985}$$ $$= \frac{0.002991}{0.007976} \approx 0.375$$ The positive predictive value is only about **37.5%**! This means that if a person from the general population tests positive, there is only about a 37.5% chance they actually have HIV. More than 6 out of 10 positive results are false positives. This counterintuitive result arises because the disease is rare. Even with a highly accurate test, the large number of healthy people (99.7% of the population) generates many false positives that overwhelm the true positives from the small number of infected people. This is why HIV screening programs use **confirmatory testing**: a second, different test is performed on all positive results. It is also why screening is more effective in high-prevalence populations (e.g., among people with known risk factors). ### The Role of Prevalence The PPV depends heavily on prevalence. Let us compute the PPV across a range of prevalence values: ::: {.panel-tabset} ### R ```{r} #| eval: false library(tidyverse) sensitivity <- 0.997 specificity <- 0.995 prevalence <- seq(0.001, 0.30, by = 0.001) ppv <- (sensitivity * prevalence) / (sensitivity * prevalence + (1 - specificity) * (1 - prevalence)) npv <- (specificity * (1 - prevalence)) / (specificity * (1 - prevalence) + (1 - sensitivity) * prevalence) test_data <- tibble( prevalence = prevalence, PPV = ppv, NPV = npv ) |> pivot_longer(cols = c(PPV, NPV), names_to = "metric", values_to = "value") ggplot(test_data, aes(x = prevalence, y = value, color = metric)) + geom_line(linewidth = 1.2) + scale_x_continuous(labels = scales::percent_format()) + scale_y_continuous(labels = scales::percent_format(), limits = c(0, 1)) + labs( title = "How Prevalence Affects PPV and NPV", subtitle = "Sensitivity = 99.7%, Specificity = 99.5%", x = "Disease Prevalence", y = "Predictive Value", color = "Metric" ) + theme_minimal() + scale_color_manual(values = c("PPV" = "steelblue", "NPV" = "firebrick")) ``` ### Python ```{python} #| eval: false import numpy as np import matplotlib.pyplot as plt sensitivity = 0.997 specificity = 0.995 prevalence = np.arange(0.001, 0.301, 0.001) ppv = (sensitivity * prevalence) / ( sensitivity * prevalence + (1 - specificity) * (1 - prevalence) ) npv = (specificity * (1 - prevalence)) / ( specificity * (1 - prevalence) + (1 - sensitivity) * prevalence ) fig, ax = plt.subplots(figsize=(9, 5)) ax.plot(prevalence * 100, ppv * 100, color="steelblue", linewidth=1.5, label="PPV") ax.plot(prevalence * 100, npv * 100, color="firebrick", linewidth=1.5, label="NPV") ax.set_xlabel("Disease Prevalence (%)") ax.set_ylabel("Predictive Value (%)") ax.set_title("How Prevalence Affects PPV and NPV\nSensitivity = 99.7%, Specificity = 99.5%") ax.legend() ax.set_ylim(0, 105) plt.tight_layout() plt.show() ``` ::: The plot shows a crucial pattern: when prevalence is very low, even an excellent test produces many false positives (low PPV). As prevalence increases, PPV rises. Conversely, NPV is very high when prevalence is low (a negative result is very reassuring) but decreases as disease becomes more common. **Clinical takeaway**: Always consider the pre-test probability (prevalence in the relevant population) when interpreting a diagnostic test result. A positive mammogram in a 25-year-old woman with no risk factors means something very different from the same result in a 60-year-old woman with a family history of breast cancer. ## Exercises ::: {.callout-tip title="Exercise 1.1: Z-Scores and the Normal Distribution"} A laboratory reports that fasting blood glucose levels in healthy adults follow a normal distribution with a mean of 90 mg/dL and a standard deviation of 10 mg/dL. 1. What is the Z-score for a patient with a fasting glucose of 115 mg/dL? 2. What proportion of healthy adults have a fasting glucose above 115 mg/dL? 3. What fasting glucose value corresponds to the 95th percentile? 4. Plot the normal distribution and shade the area above 115 mg/dL. ::: {.panel-tabset} ## R ```{r} #| eval: false mu <- 90 sigma <- 10 patient_value <- 115 # 1. Calculate the Z-score z_score <- # YOUR CODE HERE # 2. Proportion above 115 (upper tail probability) prop_above <- # YOUR CODE HERE (hint: use pnorm()) # 3. 95th percentile percentile_95 <- # YOUR CODE HERE (hint: use qnorm()) # 4. Plot (modify the blood pressure example from the chapter) # YOUR CODE HERE ``` ## Python ```{python} #| eval: false from scipy import stats import numpy as np import matplotlib.pyplot as plt mu = 90 sigma = 10 patient_value = 115 # 1. Calculate the Z-score z_score = # YOUR CODE HERE # 2. Proportion above 115 prop_above = # YOUR CODE HERE (hint: use stats.norm.sf() or 1 - stats.norm.cdf()) # 3. 95th percentile percentile_95 = # YOUR CODE HERE (hint: use stats.norm.ppf()) # 4. Plot # YOUR CODE HERE ``` ::: ::: {.callout-note collapse="true" title="Solution"} ::: {.panel-tabset} ## R ```{r} #| eval: false library(tidyverse) mu <- 90 sigma <- 10 patient_value <- 115 # 1. Z-score z_score <- (patient_value - mu) / sigma cat("Z-score:", z_score, "\n") # Z-score: 2.5 # 2. Proportion above 115 prop_above <- 1 - pnorm(patient_value, mean = mu, sd = sigma) # Equivalently: pnorm(patient_value, mean = mu, sd = sigma, lower.tail = FALSE) cat("Proportion above 115:", round(prop_above, 4), "\n") # About 0.0062 or 0.62% # 3. 95th percentile percentile_95 <- qnorm(0.95, mean = mu, sd = sigma) cat("95th percentile:", round(percentile_95, 1), "mg/dL\n") # About 106.4 mg/dL # 4. Plot x <- seq(mu - 4 * sigma, mu + 4 * sigma, length.out = 300) y <- dnorm(x, mean = mu, sd = sigma) plot_data <- tibble(x = x, y = y) ggplot(plot_data, aes(x = x, y = y)) + geom_line(color = "steelblue", linewidth = 1.2) + geom_area(data = plot_data |> filter(x >= patient_value), fill = "firebrick", alpha = 0.4) + geom_vline(xintercept = patient_value, linetype = "dashed", color = "firebrick") + annotate("text", x = 118, y = 0.02, label = paste0("Glucose = ", patient_value, " mg/dL\nZ = ", z_score), color = "firebrick", hjust = 0) + labs( title = "Distribution of Fasting Blood Glucose", x = "Fasting Blood Glucose (mg/dL)", y = "Density" ) + theme_minimal() ``` ## Python ```{python} #| eval: false from scipy import stats import numpy as np import matplotlib.pyplot as plt mu = 90 sigma = 10 patient_value = 115 # 1. Z-score z_score = (patient_value - mu) / sigma print(f"Z-score: {z_score}") # Z-score: 2.5 # 2. Proportion above 115 prop_above = 1 - stats.norm.cdf(patient_value, loc=mu, scale=sigma) # Or equivalently: stats.norm.sf(patient_value, loc=mu, scale=sigma) print(f"Proportion above 115: {prop_above:.4f}") # About 0.0062 or 0.62% # 3. 95th percentile percentile_95 = stats.norm.ppf(0.95, loc=mu, scale=sigma) print(f"95th percentile: {percentile_95:.1f} mg/dL") # About 106.4 mg/dL # 4. Plot x = np.linspace(mu - 4 * sigma, mu + 4 * sigma, 300) y = stats.norm.pdf(x, loc=mu, scale=sigma) fig, ax = plt.subplots(figsize=(9, 5)) ax.plot(x, y, color="steelblue", linewidth=1.5) x_fill = x[x >= patient_value] y_fill = stats.norm.pdf(x_fill, loc=mu, scale=sigma) ax.fill_between(x_fill, y_fill, color="firebrick", alpha=0.4) ax.axvline(patient_value, linestyle="--", color="firebrick") ax.text(117, 0.02, f"Glucose = {patient_value} mg/dL\nZ = {z_score}", color="firebrick") ax.set_title("Distribution of Fasting Blood Glucose") ax.set_xlabel("Fasting Blood Glucose (mg/dL)") ax.set_ylabel("Density") plt.tight_layout() plt.show() ``` ::: ::: ::: ::: {.callout-tip title="Exercise 1.2: Binomial Distribution --- Vaccine Efficacy"} A COVID-19 vaccine has an efficacy of 90%, meaning that among people exposed to the virus, a vaccinated person has a 90% chance of NOT getting symptomatic disease (equivalently, a 10% chance of a breakthrough infection). In a group of 25 vaccinated individuals who are all exposed to the virus: 1. What is the expected number of breakthrough infections? 2. What is the probability of zero breakthrough infections? 3. What is the probability of 5 or more breakthrough infections? 4. Plot the full probability distribution. ::: {.panel-tabset} ## R ```{r} #| eval: false n <- 25 p <- 0.10 # probability of breakthrough infection # 1. Expected number expected <- # YOUR CODE HERE # 2. P(X = 0) prob_zero <- # YOUR CODE HERE (hint: dbinom()) # 3. P(X >= 5) prob_five_or_more <- # YOUR CODE HERE (hint: use pbinom with lower.tail) # 4. Plot the distribution # YOUR CODE HERE ``` ## Python ```{python} #| eval: false from scipy import stats import numpy as np import matplotlib.pyplot as plt n = 25 p = 0.10 # 1. Expected number expected = # YOUR CODE HERE # 2. P(X = 0) prob_zero = # YOUR CODE HERE (hint: stats.binom.pmf()) # 3. P(X >= 5) prob_five_or_more = # YOUR CODE HERE (hint: stats.binom.sf()) # 4. Plot # YOUR CODE HERE ``` ::: ::: {.callout-note collapse="true" title="Solution"} ::: {.panel-tabset} ## R ```{r} #| eval: false library(tidyverse) n <- 25 p <- 0.10 # 1. Expected number expected <- n * p cat("Expected breakthrough infections:", expected, "\n") # 2.5 # 2. P(X = 0) prob_zero <- dbinom(0, size = n, prob = p) cat("P(X = 0):", round(prob_zero, 4), "\n") # About 0.0718 # 3. P(X >= 5) prob_five_or_more <- 1 - pbinom(4, size = n, prob = p) # Equivalently: pbinom(4, size = n, prob = p, lower.tail = FALSE) cat("P(X >= 5):", round(prob_five_or_more, 4), "\n") # About 0.0980 # 4. Plot outcomes <- tibble( k = 0:n, probability = dbinom(k, size = n, prob = p) ) ggplot(outcomes, aes(x = k, y = probability)) + geom_col(aes(fill = k >= 5), alpha = 0.8) + scale_fill_manual(values = c("FALSE" = "steelblue", "TRUE" = "firebrick"), guide = "none") + geom_vline(xintercept = expected, linetype = "dashed") + labs( title = "Breakthrough Infections in 25 Vaccinated Individuals", subtitle = "Binomial(25, 0.10); red bars = 5 or more infections", x = "Number of Breakthrough Infections", y = "Probability" ) + scale_x_continuous(breaks = seq(0, 25, 2)) + theme_minimal() ``` ## Python ```{python} #| eval: false from scipy import stats import numpy as np import matplotlib.pyplot as plt n = 25 p = 0.10 # 1. Expected number expected = n * p print(f"Expected breakthrough infections: {expected}") # 2.5 # 2. P(X = 0) prob_zero = stats.binom.pmf(0, n, p) print(f"P(X = 0): {prob_zero:.4f}") # About 0.0718 # 3. P(X >= 5) prob_five_or_more = stats.binom.sf(4, n, p) # sf = survival function = 1 - cdf print(f"P(X >= 5): {prob_five_or_more:.4f}") # About 0.0980 # 4. Plot k = np.arange(0, n + 1) probs = stats.binom.pmf(k, n, p) colors = ["firebrick" if ki >= 5 else "steelblue" for ki in k] fig, ax = plt.subplots(figsize=(9, 5)) ax.bar(k, probs, color=colors, alpha=0.8) ax.axvline(expected, linestyle="--", color="black") ax.set_title("Breakthrough Infections in 25 Vaccinated Individuals\nBinomial(25, 0.10)") ax.set_xlabel("Number of Breakthrough Infections") ax.set_ylabel("Probability") ax.set_xticks(range(0, 26, 2)) plt.tight_layout() plt.show() ``` ::: ::: ::: ::: {.callout-tip title="Exercise 1.3: Poisson Distribution --- Emergency Department Visits"} An emergency department receives an average of 8 trauma cases per day. 1. What is the probability of receiving exactly 8 trauma cases tomorrow? 2. What is the probability of receiving 12 or more? 3. On how many days per year (365 days) would you expect 0 trauma cases? 4. Create a bar plot of the Poisson(8) distribution from 0 to 20. ::: {.panel-tabset} ## R ```{r} #| eval: false lambda <- 8 # 1. P(X = 8) prob_exactly_8 <- # YOUR CODE HERE (hint: dpois()) # 2. P(X >= 12) prob_12_or_more <- # YOUR CODE HERE (hint: ppois with lower.tail) # 3. Expected days with 0 cases in a year prob_zero <- # YOUR CODE HERE expected_zero_days <- # YOUR CODE HERE # 4. Plot # YOUR CODE HERE ``` ## Python ```{python} #| eval: false from scipy import stats import numpy as np import matplotlib.pyplot as plt lam = 8 # 1. P(X = 8) prob_exactly_8 = # YOUR CODE HERE (hint: stats.poisson.pmf()) # 2. P(X >= 12) prob_12_or_more = # YOUR CODE HERE (hint: stats.poisson.sf()) # 3. Expected days with 0 cases prob_zero = # YOUR CODE HERE expected_zero_days = # YOUR CODE HERE # 4. Plot # YOUR CODE HERE ``` ::: ::: {.callout-note collapse="true" title="Solution"} ::: {.panel-tabset} ## R ```{r} #| eval: false library(tidyverse) lambda <- 8 # 1. P(X = 8) prob_exactly_8 <- dpois(8, lambda = lambda) cat("P(X = 8):", round(prob_exactly_8, 4), "\n") # About 0.1396 # 2. P(X >= 12) prob_12_or_more <- ppois(11, lambda = lambda, lower.tail = FALSE) cat("P(X >= 12):", round(prob_12_or_more, 4), "\n") # About 0.1121 # 3. Expected days with 0 cases prob_zero <- dpois(0, lambda = lambda) expected_zero_days <- 365 * prob_zero cat("P(X = 0):", format(prob_zero, scientific = TRUE), "\n") cat("Expected zero-case days per year:", round(expected_zero_days, 3), "\n") # P(0) is very small (~0.000335), so about 0.12 days/year # 4. Plot k <- 0:20 poisson_data <- tibble( k = k, probability = dpois(k, lambda = lambda) ) ggplot(poisson_data, aes(x = k, y = probability)) + geom_col(fill = "darkorange", alpha = 0.8) + geom_vline(xintercept = lambda, linetype = "dashed", color = "firebrick") + labs( title = "Poisson Distribution: Trauma Cases per Day", subtitle = expression(paste(lambda, " = 8")), x = "Number of Trauma Cases", y = "Probability" ) + scale_x_continuous(breaks = k) + theme_minimal() ``` ## Python ```{python} #| eval: false from scipy import stats import numpy as np import matplotlib.pyplot as plt lam = 8 # 1. P(X = 8) prob_exactly_8 = stats.poisson.pmf(8, lam) print(f"P(X = 8): {prob_exactly_8:.4f}") # About 0.1396 # 2. P(X >= 12) prob_12_or_more = stats.poisson.sf(11, lam) print(f"P(X >= 12): {prob_12_or_more:.4f}") # About 0.1121 # 3. Expected days with 0 cases prob_zero = stats.poisson.pmf(0, lam) expected_zero_days = 365 * prob_zero print(f"P(X = 0): {prob_zero:.6f}") print(f"Expected zero-case days per year: {expected_zero_days:.3f}") # Very small, about 0.12 days per year # 4. Plot k = np.arange(0, 21) probs = stats.poisson.pmf(k, lam) fig, ax = plt.subplots(figsize=(9, 5)) ax.bar(k, probs, color="darkorange", alpha=0.8) ax.axvline(lam, linestyle="--", color="firebrick") ax.set_title(f"Poisson Distribution: Trauma Cases per Day (lambda = {lam})") ax.set_xlabel("Number of Trauma Cases") ax.set_ylabel("Probability") ax.set_xticks(k) plt.tight_layout() plt.show() ``` ::: ::: ::: ::: {.callout-tip title="Exercise 1.4: Bayes' Theorem --- Interpreting a Cancer Screening Test"} A breast cancer screening mammogram has the following characteristics: - Sensitivity: 87% (detects 87% of actual cancers) - Specificity: 95% (correctly clears 95% of women without cancer) - Prevalence of breast cancer in women aged 50--59 undergoing screening: 2% 1. Calculate the positive predictive value (PPV): if a woman in this age group has a positive mammogram, what is the probability she actually has breast cancer? 2. Calculate the negative predictive value (NPV). 3. How does the PPV change if we consider a high-risk population where prevalence is 10%? 4. Create a plot showing PPV as a function of prevalence (from 0.1% to 30%). ::: {.panel-tabset} ## R ```{r} #| eval: false sensitivity <- 0.87 specificity <- 0.95 prevalence <- 0.02 # 1. PPV ppv <- # YOUR CODE HERE # 2. NPV npv <- # YOUR CODE HERE # 3. PPV with prevalence = 10% ppv_high_risk <- # YOUR CODE HERE # 4. Plot PPV vs. prevalence # YOUR CODE HERE ``` ## Python ```{python} #| eval: false import numpy as np import matplotlib.pyplot as plt sensitivity = 0.87 specificity = 0.95 prevalence = 0.02 # 1. PPV ppv = # YOUR CODE HERE # 2. NPV npv = # YOUR CODE HERE # 3. PPV with prevalence = 10% ppv_high_risk = # YOUR CODE HERE # 4. Plot PPV vs. prevalence # YOUR CODE HERE ``` ::: ::: {.callout-note collapse="true" title="Solution"} ::: {.panel-tabset} ## R ```{r} #| eval: false library(tidyverse) sensitivity <- 0.87 specificity <- 0.95 prevalence <- 0.02 # 1. PPV ppv <- (sensitivity * prevalence) / (sensitivity * prevalence + (1 - specificity) * (1 - prevalence)) cat("PPV (prevalence = 2%):", round(ppv, 4), "\n") # About 0.2623 or 26.2% # 2. NPV npv <- (specificity * (1 - prevalence)) / (specificity * (1 - prevalence) + (1 - sensitivity) * prevalence) cat("NPV (prevalence = 2%):", round(npv, 4), "\n") # About 0.9972 or 99.7% # 3. PPV with high-risk prevalence prev_high <- 0.10 ppv_high_risk <- (sensitivity * prev_high) / (sensitivity * prev_high + (1 - specificity) * (1 - prev_high)) cat("PPV (prevalence = 10%):", round(ppv_high_risk, 4), "\n") # About 0.6588 or 65.9% # 4. Plot prev_range <- seq(0.001, 0.30, by = 0.001) ppv_curve <- (sensitivity * prev_range) / (sensitivity * prev_range + (1 - specificity) * (1 - prev_range)) plot_data <- tibble(prevalence = prev_range, PPV = ppv_curve) ggplot(plot_data, aes(x = prevalence, y = PPV)) + geom_line(color = "steelblue", linewidth = 1.2) + geom_point(data = tibble(prevalence = c(0.02, 0.10), PPV = c(ppv, ppv_high_risk)), color = "firebrick", size = 3) + annotate("text", x = 0.04, y = ppv - 0.03, label = paste0("General pop (2%): PPV = ", round(ppv * 100, 1), "%"), hjust = 0, color = "firebrick", size = 3.5) + annotate("text", x = 0.12, y = ppv_high_risk - 0.03, label = paste0("High risk (10%): PPV = ", round(ppv_high_risk * 100, 1), "%"), hjust = 0, color = "firebrick", size = 3.5) + scale_x_continuous(labels = scales::percent_format()) + scale_y_continuous(labels = scales::percent_format(), limits = c(0, 1)) + labs( title = "Mammography PPV Depends on Prevalence", subtitle = "Sensitivity = 87%, Specificity = 95%", x = "Prevalence", y = "Positive Predictive Value (PPV)" ) + theme_minimal() ``` ## Python ```{python} #| eval: false import numpy as np import matplotlib.pyplot as plt sensitivity = 0.87 specificity = 0.95 prevalence = 0.02 # 1. PPV ppv = (sensitivity * prevalence) / ( sensitivity * prevalence + (1 - specificity) * (1 - prevalence) ) print(f"PPV (prevalence = 2%): {ppv:.4f}") # About 0.2623 or 26.2% # 2. NPV npv = (specificity * (1 - prevalence)) / ( specificity * (1 - prevalence) + (1 - sensitivity) * prevalence ) print(f"NPV (prevalence = 2%): {npv:.4f}") # About 0.9972 or 99.7% # 3. PPV with high-risk prevalence prev_high = 0.10 ppv_high = (sensitivity * prev_high) / ( sensitivity * prev_high + (1 - specificity) * (1 - prev_high) ) print(f"PPV (prevalence = 10%): {ppv_high:.4f}") # About 0.6588 or 65.9% # 4. Plot prev_range = np.arange(0.001, 0.301, 0.001) ppv_curve = (sensitivity * prev_range) / ( sensitivity * prev_range + (1 - specificity) * (1 - prev_range) ) fig, ax = plt.subplots(figsize=(9, 5)) ax.plot(prev_range * 100, ppv_curve * 100, color="steelblue", linewidth=1.5) ax.plot([2, 10], [ppv * 100, ppv_high * 100], "o", color="firebrick", markersize=8) ax.annotate(f"General pop (2%): PPV = {ppv*100:.1f}%", xy=(2, ppv*100), xytext=(5, ppv*100 - 5), color="firebrick", fontsize=9, arrowprops=dict(arrowstyle="->", color="firebrick")) ax.annotate(f"High risk (10%): PPV = {ppv_high*100:.1f}%", xy=(10, ppv_high*100), xytext=(13, ppv_high*100 - 5), color="firebrick", fontsize=9, arrowprops=dict(arrowstyle="->", color="firebrick")) ax.set_xlabel("Prevalence (%)") ax.set_ylabel("Positive Predictive Value (%)") ax.set_title("Mammography PPV Depends on Prevalence\nSensitivity = 87%, Specificity = 95%") ax.set_ylim(0, 105) plt.tight_layout() plt.show() ``` ::: ::: ::: ::: {.callout-tip title="Exercise 1.5: Central Limit Theorem --- Hands-On Simulation"} Demonstrate the Central Limit Theorem yourself by simulating sample means from a non-normal distribution. 1. Generate 10,000 individual observations from a **Poisson distribution** with $\lambda = 3$ (this distribution is right-skewed). Plot a histogram. 2. Now, for each of the following sample sizes ($n$ = 5, 15, 50, 200), draw 5,000 random samples of that size, compute the mean of each sample, and plot the distribution of sample means. 3. Overlay a normal curve on the $n = 50$ histogram to show how well the normal approximation fits. ::: {.panel-tabset} ## R ```{r} #| eval: false library(tidyverse) set.seed(123) lambda <- 3 # 1. Histogram of raw Poisson data raw_data <- rpois(10000, lambda = lambda) # YOUR CODE HERE: plot a histogram # 2. CLT simulation for different sample sizes sample_sizes <- c(5, 15, 50, 200) # YOUR CODE HERE: for each n, draw 5000 samples, compute means, plot # 3. Overlay normal curve on n = 50 # YOUR CODE HERE ``` ## Python ```{python} #| eval: false import numpy as np import matplotlib.pyplot as plt from scipy import stats np.random.seed(123) lam = 3 # 1. Histogram of raw Poisson data raw_data = np.random.poisson(lam, 10000) # YOUR CODE HERE: plot a histogram # 2. CLT simulation sample_sizes = [5, 15, 50, 200] # YOUR CODE HERE # 3. Overlay normal curve # YOUR CODE HERE ``` ::: ::: {.callout-note collapse="true" title="Solution"} ::: {.panel-tabset} ## R ```{r} #| eval: false library(tidyverse) set.seed(123) lambda <- 3 n_sim <- 5000 # 1. Raw Poisson data raw_data <- rpois(10000, lambda = lambda) ggplot(tibble(x = raw_data), aes(x = x)) + geom_histogram(binwidth = 1, fill = "steelblue", color = "white", alpha = 0.8) + labs(title = "Raw Poisson(3) Data (right-skewed)", x = "Value", y = "Count") + theme_minimal() # 2. CLT simulation sample_sizes <- c(5, 15, 50, 200) clt_data <- map_dfr(sample_sizes, function(n) { means <- replicate(n_sim, mean(rpois(n, lambda = lambda))) tibble(n = paste0("n = ", n), sample_mean = means) }) clt_data$n <- factor(clt_data$n, levels = paste0("n = ", sample_sizes)) ggplot(clt_data, aes(x = sample_mean)) + geom_histogram(aes(y = after_stat(density)), bins = 40, fill = "steelblue", alpha = 0.7) + facet_wrap(~n, scales = "free_y") + labs(title = "CLT: Distribution of Sample Means from Poisson(3)", x = "Sample Mean", y = "Density") + theme_minimal() # 3. Overlay normal curve on n = 50 n50_means <- replicate(n_sim, mean(rpois(50, lambda = lambda))) theoretical_sd <- sqrt(lambda / 50) ggplot(tibble(x = n50_means), aes(x = x)) + geom_histogram(aes(y = after_stat(density)), bins = 40, fill = "steelblue", alpha = 0.7) + stat_function(fun = dnorm, args = list(mean = lambda, sd = theoretical_sd), color = "firebrick", linewidth = 1.2) + labs(title = "Sample Means (n=50) with Normal Approximation Overlay", subtitle = paste0("Theoretical: N(", lambda, ", ", round(theoretical_sd^2, 4), ")"), x = "Sample Mean", y = "Density") + theme_minimal() ``` ## Python ```{python} #| eval: false import numpy as np import matplotlib.pyplot as plt from scipy import stats np.random.seed(123) lam = 3 n_sim = 5000 # 1. Raw Poisson data raw_data = np.random.poisson(lam, 10000) fig, ax = plt.subplots(figsize=(8, 4)) ax.hist(raw_data, bins=range(0, 15), color="steelblue", alpha=0.8, edgecolor="white") ax.set_title("Raw Poisson(3) Data (right-skewed)") ax.set_xlabel("Value") ax.set_ylabel("Count") plt.tight_layout() plt.show() # 2. CLT simulation sample_sizes = [5, 15, 50, 200] fig, axes = plt.subplots(2, 2, figsize=(10, 8)) axes = axes.flatten() for i, n in enumerate(sample_sizes): means = [np.mean(np.random.poisson(lam, n)) for _ in range(n_sim)] axes[i].hist(means, bins=40, density=True, color="steelblue", alpha=0.7) axes[i].set_title(f"n = {n}") axes[i].set_xlabel("Sample Mean") axes[i].set_ylabel("Density") fig.suptitle("CLT: Distribution of Sample Means from Poisson(3)", fontsize=13) plt.tight_layout() plt.show() # 3. Overlay normal curve on n = 50 n50_means = [np.mean(np.random.poisson(lam, 50)) for _ in range(n_sim)] theoretical_sd = np.sqrt(lam / 50) fig, ax = plt.subplots(figsize=(8, 5)) ax.hist(n50_means, bins=40, density=True, color="steelblue", alpha=0.7, label="Simulated") x_range = np.linspace(min(n50_means), max(n50_means), 200) ax.plot(x_range, stats.norm.pdf(x_range, loc=lam, scale=theoretical_sd), color="firebrick", linewidth=1.5, label="Normal approx.") ax.set_title("Sample Means (n=50) with Normal Approximation Overlay") ax.set_xlabel("Sample Mean") ax.set_ylabel("Density") ax.legend() plt.tight_layout() plt.show() ``` ::: ::: ::: ## Chapter Summary In this chapter, we covered the foundational language of probability and statistics: - **Probability** quantifies uncertainty. The frequentist interpretation focuses on long-run frequencies; the Bayesian interpretation treats probability as a degree of belief. Both are useful. - **Random variables** can be discrete (countable values) or continuous (any value in a range). - The **normal distribution** is the most important distribution in statistics, characterized by its mean and standard deviation. Z-scores standardize values relative to the distribution. - The **binomial distribution** models the number of successes in a fixed number of independent trials --- useful for modeling treatment response rates, disease prevalence, and more. - The **Poisson distribution** models counts of rare events occurring at a constant rate --- useful for adverse events, hospital admissions, and rare disease incidence. - The **Central Limit Theorem** tells us that sample means are approximately normally distributed regardless of the underlying population distribution, as long as the sample size is large enough. This justifies the widespread use of normal-based statistical methods. - **Bayes' theorem** provides a principled way to update probabilities in light of new evidence. In diagnostic testing, it reveals that the predictive value of a test depends critically on disease prevalence, not just sensitivity and specificity. These concepts underpin everything that follows in this course. In the next chapter, we will build on this foundation to discuss statistical inference --- how to draw conclusions about populations from samples. ## References and Further Reading - Wasserman, L. (2004). *All of Statistics: A Concise Course in Statistical Inference*. Springer. A rigorous but accessible introduction to probability and statistics. Chapters 1--3 cover the material in this chapter at a more mathematical level. - Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). *Bayesian Data Analysis* (3rd ed.). CRC Press. The definitive textbook on Bayesian statistics. Chapter 1 provides an excellent introduction to Bayesian thinking. - Harrell, F. E. (2015). *Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis* (2nd ed.). Springer. Chapter 2 discusses the role of probability and distributions in the context of regression modeling for health sciences. - Altman, D. G., & Bland, J. M. (1994). "Diagnostic tests 1: Sensitivity and specificity." *BMJ*, 308(6943), 1552. A classic short paper explaining sensitivity, specificity, and predictive values for a clinical audience. - Altman, D. G., & Bland, J. M. (1994). "Diagnostic tests 2: Predictive values." *BMJ*, 309(6947), 102. The companion paper on PPV and NPV, with a clear explanation of how prevalence affects predictive values. - Gigerenzer, G., Gaissmaier, W., Kurz-Milcke, E., Schwartz, L. M., & Woloshin, S. (2007). "Helping doctors and patients make sense of health statistics." *Psychological Science in the Public Interest*, 8(2), 53--96. An excellent discussion of how to communicate probabilistic information about diagnostic tests and treatments. - Cetinkaya-Rundel, M., & Hardin, J. (2024). *Introduction to Modern Statistics* (2nd ed.). OpenIntro. Freely available at <https://openintro-ims2.netlify.app/>. Chapters 3--4 provide an accessible introduction to probability and distributions with many worked examples. - Rice, J. A. (2007). *Mathematical Statistics and Data Analysis* (3rd ed.). Thomson Brooks/Cole. A thorough treatment of probability distributions for students comfortable with calculus. - Motulsky, H. (2018). *Intuitive Biostatistics: A Nonmathematical Guide to Statistical Thinking* (4th ed.). Oxford University Press. A superb introduction to biostatistics for medical professionals, with clear explanations of probability, distributions, and Bayesian reasoning.