Advanced Statistics and Machine Learning for Health Research
A Practical Course for Health Researchers
Welcome
This course teaches advanced statistics and machine learning to people who work in health research. It assumes you know the basics — means, standard deviations, maybe a t-test — and takes you from there to the methods you actually see in current medical journals: splines, penalised regression, prediction models, Bayesian analysis, and dimensionality reduction.
Everything uses clinical data. Every exercise works in R and Python. Pick one or try both. Solutions for all exercises are in the Appendix.
0.1 What you will learn
By the end of this course, you will be able to:
- Model non-linear relationships properly instead of categorising continuous variables
- Build, validate, and report clinical prediction models to current standards (TRIPOD+AI)
- Use machine learning methods (random forests, gradient boosting, deep learning) and know when they help and when they do not
- Apply Bayesian methods including hierarchical models for multi-site studies
- Reduce high-dimensional data with PCA, t-SNE, and UMAP
0.2 Structure
| Part | What it covers |
|---|---|
| Pre-Course | Probability, inference, regression refresher |
| I | Splines, penalised regression, survival analysis |
| II | Bayesian inference and hierarchical models |
| III | ML foundations, trees, ensembles, neural networks, ROC curves, calibration, decision curves |
| IV | PCA, UMAP, clustering |
| V | Clinical prediction models, validation, reporting |
| Advanced Toolkit | Causal inference, meta-analysis, journal-ready analysis |
The Pre-Course material should be completed before Day 1 if you are taking this in person. If you are self-studying, work through everything in order.
The Advanced Research Toolkit is for after the course — use it when you are doing your own research and need guidance on causal inference, meta-analysis, or producing a journal-ready manuscript.
0.3 Am I ready?
Skim Chapters 1-3. If the material feels familiar, skip ahead to Part I. If it feels new, work through those chapters carefully first — they are the foundation everything else builds on.
Rough time estimates per chapter:
| Chapter | Topic | Estimated time |
|---|---|---|
| 0 | Environment setup | 1-2 hrs |
| 1 | Probability and distributions | 2-3 hrs |
| 2 | Inference and estimation | 2-3 hrs |
| 3 | Regression foundations | 3-4 hrs |
| 4 | Splines and non-linear modelling | 3-4 hrs |
| 5 | Penalised regression | 2-3 hrs |
| 6 | Survival analysis | 3-4 hrs |
| 13 | Bayesian inference | 3-4 hrs |
| 14 | Applied Bayesian modelling | 3-4 hrs |
| 7 | ML foundations | 2-3 hrs |
| 8 | Trees and ensembles | 2-3 hrs |
| 8b | Introduction to neural networks | 2-3 hrs |
| 9 | Model evaluation | 2-3 hrs |
| 15 | Dimensionality reduction | 2-3 hrs |
| 16 | Clustering | 2-3 hrs |
| 10 | Building prediction models | 3-4 hrs |
| 11 | Performance and validation | 3-4 hrs |
| 12 | Reporting and TRIPOD+AI | 2-3 hrs |
0.4 Datasets
All exercises use real or realistically simulated clinical data:
- Framingham Heart Study — cardiovascular risk
- Wisconsin Breast Cancer — diagnostic classification
- PBC (Mayo Clinic) — liver disease survival
- NHANES — national health survey
- Simulated meningitis — based on Lopez-Ayala et al. (BMJ 2025)
Details in the Dataset Codebook.
0.5 Key sources
This course draws heavily on:
- Smits, van Kuijk & Wynants, Improving Health Care with Clinical Prediction Models (2026, CC BY 4.0)
- Van Calster et al., Lancet Digital Health (2025) — performance measures for prediction models
- Lopez-Ayala et al., BMJ (2025) — handling continuous variables and splines
- Collins et al., BMJ (2024) — TRIPOD+AI reporting guidelines
- Harrell, Regression Modeling Strategies (2015)
- McElreath, Statistical Rethinking (2020)
Full references appear at the bottom of each chapter.
0.6 Licence
Creative Commons Attribution 4.0. Use it, adapt it, share it — just give credit.