Advanced Statistics and Machine Learning for Health Research

A Practical Course for Health Researchers

Authors

Mark Khurana

Neil Scheidwasser

Published

June 10, 2026

Welcome

This course teaches advanced statistics and machine learning to people who work in health research. It assumes you know the basics — means, standard deviations, maybe a t-test — and takes you from there to the methods you actually see in current medical journals: splines, penalised regression, prediction models, Bayesian analysis, and dimensionality reduction.

Everything uses clinical data. Every exercise works in R and Python. Pick one or try both. Solutions for all exercises are in the Appendix.

0.1 What you will learn

By the end of this course, you will be able to:

  • Model non-linear relationships properly instead of categorising continuous variables
  • Build, validate, and report clinical prediction models to current standards (TRIPOD+AI)
  • Use machine learning methods (random forests, gradient boosting, deep learning) and know when they help and when they do not
  • Apply Bayesian methods including hierarchical models for multi-site studies
  • Reduce high-dimensional data with PCA, t-SNE, and UMAP

0.2 Structure

Part What it covers
Pre-Course Probability, inference, regression refresher
I Splines, penalised regression, survival analysis
II Bayesian inference and hierarchical models
III ML foundations, trees, ensembles, neural networks, ROC curves, calibration, decision curves
IV PCA, UMAP, clustering
V Clinical prediction models, validation, reporting
Advanced Toolkit Causal inference, meta-analysis, journal-ready analysis

The Pre-Course material should be completed before Day 1 if you are taking this in person. If you are self-studying, work through everything in order.

The Advanced Research Toolkit is for after the course — use it when you are doing your own research and need guidance on causal inference, meta-analysis, or producing a journal-ready manuscript.

0.3 Am I ready?

Skim Chapters 1-3. If the material feels familiar, skip ahead to Part I. If it feels new, work through those chapters carefully first — they are the foundation everything else builds on.

Rough time estimates per chapter:

Chapter Topic Estimated time
0 Environment setup 1-2 hrs
1 Probability and distributions 2-3 hrs
2 Inference and estimation 2-3 hrs
3 Regression foundations 3-4 hrs
4 Splines and non-linear modelling 3-4 hrs
5 Penalised regression 2-3 hrs
6 Survival analysis 3-4 hrs
13 Bayesian inference 3-4 hrs
14 Applied Bayesian modelling 3-4 hrs
7 ML foundations 2-3 hrs
8 Trees and ensembles 2-3 hrs
8b Introduction to neural networks 2-3 hrs
9 Model evaluation 2-3 hrs
15 Dimensionality reduction 2-3 hrs
16 Clustering 2-3 hrs
10 Building prediction models 3-4 hrs
11 Performance and validation 3-4 hrs
12 Reporting and TRIPOD+AI 2-3 hrs

0.4 Datasets

All exercises use real or realistically simulated clinical data:

  • Framingham Heart Study — cardiovascular risk
  • Wisconsin Breast Cancer — diagnostic classification
  • PBC (Mayo Clinic) — liver disease survival
  • NHANES — national health survey
  • Simulated meningitis — based on Lopez-Ayala et al. (BMJ 2025)

Details in the Dataset Codebook.

0.5 Key sources

This course draws heavily on:

  • Smits, van Kuijk & Wynants, Improving Health Care with Clinical Prediction Models (2026, CC BY 4.0)
  • Van Calster et al., Lancet Digital Health (2025) — performance measures for prediction models
  • Lopez-Ayala et al., BMJ (2025) — handling continuous variables and splines
  • Collins et al., BMJ (2024) — TRIPOD+AI reporting guidelines
  • Harrell, Regression Modeling Strategies (2015)
  • McElreath, Statistical Rethinking (2020)

Full references appear at the bottom of each chapter.

0.6 Licence

Creative Commons Attribution 4.0. Use it, adapt it, share it — just give credit.