Advanced Statistics and Machine Learning for Health Research

A Practical Course for Health Researchers

Authors

Mark Khurana

Neil Scheidwasser

Published

June 10, 2026

Welcome

This course teaches advanced statistics and machine learning to people who work in health research. It assumes you know the basics — means, standard deviations, maybe a t-test — and takes you from there to the methods you actually see in current medical journals: splines, penalised regression, prediction models, Bayesian analysis, and dimensionality reduction.

Everything uses clinical data. Every exercise works in R and Python. Pick one or try both. Solutions for all exercises are in the Appendix.

0.1 What you will learn

By the end of this course, you will be able to:

Model non-linear relationships properly instead of categorising continuous variables
Build, validate, and report clinical prediction models to current standards (TRIPOD+AI)
Use machine learning methods (random forests, gradient boosting, deep learning) and know when they help and when they do not
Apply Bayesian methods including hierarchical models for multi-site studies
Reduce high-dimensional data with PCA, t-SNE, and UMAP

0.2 Structure

Part	What it covers
Pre-Course	Probability, inference, regression refresher
I	Splines, penalised regression, survival analysis
II	Bayesian inference and hierarchical models
III	ML foundations, trees, ensembles, neural networks, ROC curves, calibration, decision curves
IV	PCA, UMAP, clustering
V	Clinical prediction models, validation, reporting
Advanced Toolkit	Causal inference, meta-analysis, journal-ready analysis

The Pre-Course material should be completed before Day 1 if you are taking this in person. If you are self-studying, work through everything in order.

The Advanced Research Toolkit is for after the course — use it when you are doing your own research and need guidance on causal inference, meta-analysis, or producing a journal-ready manuscript.

0.3 Am I ready?

Skim Chapters 1-3. If the material feels familiar, skip ahead to Part I. If it feels new, work through those chapters carefully first — they are the foundation everything else builds on.

Rough time estimates per chapter:

Chapter	Topic	Estimated time
0	Environment setup	1-2 hrs
1	Probability and distributions	2-3 hrs
2	Inference and estimation	2-3 hrs
3	Regression foundations	3-4 hrs
4	Splines and non-linear modelling	3-4 hrs
5	Penalised regression	2-3 hrs
6	Survival analysis	3-4 hrs
13	Bayesian inference	3-4 hrs
14	Applied Bayesian modelling	3-4 hrs
7	ML foundations	2-3 hrs
8	Trees and ensembles	2-3 hrs
8b	Introduction to neural networks	2-3 hrs
9	Model evaluation	2-3 hrs
15	Dimensionality reduction	2-3 hrs
16	Clustering	2-3 hrs
10	Building prediction models	3-4 hrs
11	Performance and validation	3-4 hrs
12	Reporting and TRIPOD+AI	2-3 hrs

0.4 Datasets

All exercises use real or realistically simulated clinical data:

Framingham Heart Study — cardiovascular risk
Wisconsin Breast Cancer — diagnostic classification
PBC (Mayo Clinic) — liver disease survival
NHANES — national health survey
Simulated meningitis — based on Lopez-Ayala et al. (BMJ 2025)

Details in the Dataset Codebook.

0.5 Key sources

This course draws heavily on:

Smits, van Kuijk & Wynants, Improving Health Care with Clinical Prediction Models (2026, CC BY 4.0)
Van Calster et al., Lancet Digital Health (2025) — performance measures for prediction models
Lopez-Ayala et al., BMJ (2025) — handling continuous variables and splines
Collins et al., BMJ (2024) — TRIPOD+AI reporting guidelines
Harrell, Regression Modeling Strategies (2015)
McElreath, Statistical Rethinking (2020)

Full references appear at the bottom of each chapter.

0.6 Licence

Creative Commons Attribution 4.0. Use it, adapt it, share it — just give credit.

--- title: "Advanced Statistics and Machine Learning for Health Research" --- # Welcome {.unnumbered} This course teaches advanced statistics and machine learning to people who work in health research. It assumes you know the basics --- means, standard deviations, maybe a t-test --- and takes you from there to the methods you actually see in current medical journals: splines, penalised regression, prediction models, Bayesian analysis, and dimensionality reduction. Everything uses **clinical data**. Every exercise works in **R and Python**. Pick one or try both. Solutions for all exercises are in the [Appendix](appendices/D_exercise_solutions.qmd). ## What you will learn By the end of this course, you will be able to: - Model non-linear relationships properly instead of categorising continuous variables - Build, validate, and report clinical prediction models to current standards (TRIPOD+AI) - Use machine learning methods (random forests, gradient boosting, deep learning) and know when they help and when they do not - Apply Bayesian methods including hierarchical models for multi-site studies - Reduce high-dimensional data with PCA, t-SNE, and UMAP ## Structure | Part | What it covers | |------|---------------| | **Pre-Course** | Probability, inference, regression refresher | | **I** | Splines, penalised regression, survival analysis | | **II** | Bayesian inference and hierarchical models | | **III** | ML foundations, trees, ensembles, neural networks, ROC curves, calibration, decision curves | | **IV** | PCA, UMAP, clustering | | **V** | Clinical prediction models, validation, reporting | | **Advanced Toolkit** | Causal inference, meta-analysis, journal-ready analysis | The **Pre-Course** material should be completed before Day 1 if you are taking this in person. If you are self-studying, work through everything in order. The **Advanced Research Toolkit** is for after the course --- use it when you are doing your own research and need guidance on causal inference, meta-analysis, or producing a journal-ready manuscript. ## Am I ready? Skim Chapters 1-3. If the material feels familiar, skip ahead to Part I. If it feels new, work through those chapters carefully first — they are the foundation everything else builds on. Rough time estimates per chapter: | Chapter | Topic | Estimated time | |---------|-------|---------------| | 0 | Environment setup | 1-2 hrs | | 1 | Probability and distributions | 2-3 hrs | | 2 | Inference and estimation | 2-3 hrs | | 3 | Regression foundations | 3-4 hrs | | 4 | Splines and non-linear modelling | 3-4 hrs | | 5 | Penalised regression | 2-3 hrs | | 6 | Survival analysis | 3-4 hrs | | 13 | Bayesian inference | 3-4 hrs | | 14 | Applied Bayesian modelling | 3-4 hrs | | 7 | ML foundations | 2-3 hrs | | 8 | Trees and ensembles | 2-3 hrs | | 8b | Introduction to neural networks | 2-3 hrs | | 9 | Model evaluation | 2-3 hrs | | 15 | Dimensionality reduction | 2-3 hrs | | 16 | Clustering | 2-3 hrs | | 10 | Building prediction models | 3-4 hrs | | 11 | Performance and validation | 3-4 hrs | | 12 | Reporting and TRIPOD+AI | 2-3 hrs | ## Datasets All exercises use real or realistically simulated clinical data: - **Framingham Heart Study** --- cardiovascular risk - **Wisconsin Breast Cancer** --- diagnostic classification - **PBC (Mayo Clinic)** --- liver disease survival - **NHANES** --- national health survey - **Simulated meningitis** --- based on Lopez-Ayala et al. (*BMJ* 2025) Details in the [Dataset Codebook](appendices/A_dataset_codebook.qmd). ## Key sources This course draws heavily on: - Smits, van Kuijk & Wynants, *Improving Health Care with Clinical Prediction Models* (2026, CC BY 4.0) - Van Calster et al., *Lancet Digital Health* (2025) --- performance measures for prediction models - Lopez-Ayala et al., *BMJ* (2025) --- handling continuous variables and splines - Collins et al., *BMJ* (2024) --- TRIPOD+AI reporting guidelines - Harrell, *Regression Modeling Strategies* (2015) - McElreath, *Statistical Rethinking* (2020) Full references appear at the bottom of each chapter. ## Licence [Creative Commons Attribution 4.0](https://creativecommons.org/licenses/by/4.0/). Use it, adapt it, share it --- just give credit.