13 Introduction to Deep Learning for Health Research

13.1 When Does Deep Learning Make Sense?

In the previous chapter, you learned that gradient-boosted trees are powerful, flexible, and remarkably effective for clinical prediction on structured data. A natural question follows: when should you reach for something more complex?

The short answer: deep learning excels when your data has spatial, sequential, or linguistic structure that tabular methods cannot exploit. If your data lives in a spreadsheet — rows of patients, columns of lab values — gradient-boosted trees will usually match or beat a neural network, with less effort and better interpretability. But if your data is a chest X-ray, a clinical note, an ECG tracing, or a pathology slide, deep learning is the tool that unlocked performance previously thought impossible.

13.1.1 Common Misconceptions

“Deep learning is always better than traditional ML.” False. For tabular clinical data, a 2022 NeurIPS benchmark by Grinsztajn et al. showed that tree-based models consistently outperform neural networks on typical tabular datasets. A 2026 clinical benchmark confirmed that TabPFN — a transformer designed specifically for tabular data — exceeded the best traditional ML model in only 17% of clinical prediction tasks. Start with logistic regression or XGBoost; reach for deep learning only when the data type demands it.
“Deep learning requires millions of examples.” Misleading. Transfer learning — starting from a model pretrained on millions of images and fine-tuning on your hundreds — has made deep learning practical even with small clinical datasets. Many successful medical imaging studies use fewer than 5,000 labelled examples.
“Neural networks are uninterpretable black boxes.” Partially true, but increasingly addressable. Techniques such as Grad-CAM (which highlights the image regions driving a prediction) and attention visualization (which shows which words or time steps a model focuses on) provide clinically meaningful explanations. They are not as clean as a regression coefficient, but they are far from opaque.
“I need a GPU cluster to do deep learning.” Not necessarily. Transfer learning with pretrained models can be done on a single consumer GPU or even in the cloud (Google Colab offers free GPU access). Training a model from scratch on large datasets is another matter entirely.

13.2 How Neural Networks Learn

A neural network is, at its core, a series of matrix multiplications followed by non-linear transformations. If you have worked through logistic regression, you already understand the basic idea: take a weighted sum of inputs and pass them through a sigmoid function to produce a probability. A neural network does exactly the same thing — but stacks many such layers on top of each other, allowing it to learn increasingly abstract representations.

13.2.1 The Building Blocks

Neurons (nodes): Each neuron computes a weighted sum of its inputs, adds a bias, and applies a non-linear function (called an activation function):

\[a = f\left(\sum_{i=1}^{n} w_i x_i + b\right)\]

where $x_i$ are the inputs, $w_i$ are the weights, $b$ is the bias, and $f$ is the activation function.

Layers: Neurons are organized into layers:

Input layer: receives the raw data (pixel values, lab values, word embeddings).
Hidden layers: intermediate layers that learn increasingly abstract features. A network with many hidden layers is called “deep” — hence “deep learning.”
Output layer: produces the final prediction (a probability, a class label, a continuous value).

Activation functions: The non-linear functions that give neural networks their power. Without them, stacking layers would be pointless — the composition of linear functions is still linear. Common choices include:

ReLU (Rectified Linear Unit): $f(x) = \max(0, x)$. The default for hidden layers. Simple, fast, and effective.
Sigmoid: $f(x) = \frac{1}{1+e^{-x}}$. Used in the output layer for binary classification. You already know this from logistic regression.
Softmax: Generalizes sigmoid to multiple classes. Used in the output layer for multi-class classification.

13.2.2 Training: Gradient Descent and Backpropagation

Neural networks learn by adjusting their weights to minimize a loss function — a measure of how wrong the predictions are. The process works as follows:

Forward pass: Feed data through the network to produce a prediction.
Compute loss: Compare the prediction to the true label using a loss function (cross-entropy for classification, mean squared error for regression).
Backward pass (backpropagation): Compute the gradient of the loss with respect to every weight in the network, using the chain rule of calculus.
Update weights: Adjust each weight in the direction that reduces the loss, scaled by a learning rate $\eta$:

\[w \leftarrow w - \eta \frac{\partial \mathcal{L}}{\partial w}\]

This process repeats over many epochs (complete passes through the training data). The learning rate is critical: too large, and the model oscillates; too small, and it converges painfully slowly.

You Already Know This Pattern

Gradient descent is not unique to deep learning. It is the same optimization strategy used in logistic regression and many other statistical models. The difference is scale: a logistic regression with 10 predictors has 11 parameters; a deep neural network may have millions.

13.2.3 Regularization: Preventing Overfitting

Deep networks have enormous capacity and will happily memorize the training data if you let them. The strategies for preventing this parallel what you learned in Chapter 7, but with some additions:

Dropout: During training, randomly “turn off” a fraction of neurons in each layer. This prevents co-adaptation and acts as an implicit ensemble. Typical dropout rates range from 0.2 to 0.5.
Weight decay (L2 regularization): Penalize large weights, exactly as in ridge regression.
Early stopping: Monitor performance on a validation set and stop training when performance begins to deteriorate.
Data augmentation: Artificially expand the training set by applying random transformations (rotations, flips, crops, colour jitter). Especially important in medical imaging, where labelled data is scarce.

13.3 Clinical Example: Neural Network for Readmission Prediction

To connect these concepts to what you already know, let us apply a simple neural network to the hospital readmission data from Chapter 12. This is not a scenario where deep learning is the right tool — gradient-boosted trees will likely perform as well or better on tabular data — but it illustrates the mechanics.

Code

library(keras3)

# Simulate readmission data (same structure as Chapter 8)
set.seed(42)
n <- 1000

readmit_data <- tibble(
  age = rnorm(n, 68, 12),
  length_of_stay = rpois(n, 5) + 1,
  num_comorbidities = rpois(n, 3),
  prior_admissions = rpois(n, 1),
  discharge_hgb = rnorm(n, 11, 2),
  discharge_creatinine = rlnorm(n, 0.2, 0.5),
  has_diabetes = rbinom(n, 1, 0.35),
  has_chf = rbinom(n, 1, 0.25)
)

readmit_prob <- plogis(-3 + 0.02 * (readmit_data$age - 68) +
                         0.15 * readmit_data$prior_admissions +
                         0.1 * readmit_data$num_comorbidities +
                         0.3 * readmit_data$has_chf -
                         0.1 * readmit_data$discharge_hgb)
readmit_data$readmitted <- rbinom(n, 1, readmit_prob)

# Prepare data: scale features, split into train/test
x <- readmit_data %>% select(-readmitted) %>% as.matrix()
y <- readmit_data$readmitted

# Standardize
x_mean <- apply(x, 2, mean)
x_sd   <- apply(x, 2, sd)
x_scaled <- scale(x, center = x_mean, scale = x_sd)

# Train/test split
set.seed(123)
train_idx <- sample(n, 800)
x_train <- x_scaled[train_idx, ]
x_test  <- x_scaled[-train_idx, ]
y_train <- y[train_idx]
y_test  <- y[-train_idx]

# Define a simple feedforward neural network
model <- keras_model_sequential(input_shape = ncol(x_train)) %>%
  layer_dense(units = 32, activation = "relu") %>%
  layer_dropout(rate = 0.3) %>%
  layer_dense(units = 16, activation = "relu") %>%
  layer_dropout(rate = 0.3) %>%
  layer_dense(units = 1, activation = "sigmoid")

model %>% compile(
  optimizer = optimizer_adam(learning_rate = 0.001),
  loss = "binary_crossentropy",
  metrics = "AUC"
)

# Train with early stopping
history <- model %>% fit(
  x_train, y_train,
  epochs = 50,
  batch_size = 32,
  validation_split = 0.2,
  callbacks = list(
    callback_early_stopping(
      patience = 5, restore_best_weights = TRUE
    )
  ),
  verbose = 0
)

# Evaluate on test set
results <- model %>% evaluate(x_test, y_test, verbose = 0)
cat("Test loss:", round(results[[1]], 3), "\n")
cat("Test AUC:", round(results[[2]], 3), "\n")

Code

import numpy as np
import tensorflow as tf
from tensorflow.keras import layers, models, callbacks
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Simulate readmission data (same structure as Chapter 8)
np.random.seed(42)
n = 1000

age = np.random.normal(68, 12, n)
length_of_stay = np.random.poisson(5, n) + 1
num_comorbidities = np.random.poisson(3, n)
prior_admissions = np.random.poisson(1, n)
discharge_hgb = np.random.normal(11, 2, n)
discharge_creatinine = np.random.lognormal(0.2, 0.5, n)
has_diabetes = np.random.binomial(1, 0.35, n)
has_chf = np.random.binomial(1, 0.25, n)

X = np.column_stack([age, length_of_stay, num_comorbidities,
                     prior_admissions, discharge_hgb,
                     discharge_creatinine, has_diabetes, has_chf])

prob = 1 / (1 + np.exp(-(-3 + 0.02 * (age - 68) +
                          0.15 * prior_admissions +
                          0.1 * num_comorbidities +
                          0.3 * has_chf -
                          0.1 * discharge_hgb)))
y = np.random.binomial(1, prob)

# Split and scale
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=123
)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Define a simple feedforward neural network
model = models.Sequential([
    layers.Dense(32, activation="relu", input_shape=(X_train.shape[1],)),
    layers.Dropout(0.3),
    layers.Dense(16, activation="relu"),
    layers.Dropout(0.3),
    layers.Dense(1, activation="sigmoid")
])

model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
    loss="binary_crossentropy",
    metrics=[tf.keras.metrics.AUC(name="auc")]
)

# Train with early stopping
history = model.fit(
    X_train, y_train,
    epochs=50,
    batch_size=32,
    validation_split=0.2,
    callbacks=[
        callbacks.EarlyStopping(
            patience=5, restore_best_weights=True
        )
    ],
    verbose=0
)

loss, auc = model.evaluate(X_test, y_test, verbose=0)
print(f"Test loss: {loss:.3f}")
print(f"Test AUC:  {auc:.3f}")

Compare This to Chapter 8

Run the XGBoost model from Chapter 12 on the same data and compare the AUC. You will likely find that XGBoost matches or exceeds the neural network — with less code, less tuning, and immediate access to variable importance. This is the norm for tabular clinical data.

13.4 Key Architectures for Clinical Research

You do not need to understand every layer and parameter to use deep learning effectively. But you do need to understand which architecture suits which data type and why.

13.4.1 Convolutional Neural Networks (CNNs): Learning from Images

CNNs learn spatial hierarchies of features from images. Early layers detect edges and textures; deeper layers combine these into complex patterns (tumour boundaries, retinal vessel structures, skin lesion shapes).

Instead of connecting every neuron to every input, CNNs use convolutional filters — small windows that slide across the image, detecting local patterns. A $3 \times 3$ filter might learn to detect a horizontal edge; stacking many filters across many layers builds up a rich feature hierarchy.

Landmark clinical applications:

Application	Key Study	Architecture	Performance
Chest X-ray diagnosis	Rajpurkar et al., 2017	DenseNet-121	Radiologist-level pneumonia detection
Diabetic retinopathy	Gulshan et al., JAMA 2016	Inception-v3	AUC 0.991 for referable DR
Skin cancer classification	Esteva et al., Nature 2017	Inception-v3	Dermatologist-level performance
Pathology	Campanella et al., Nature Medicine 2019	ResNet	Weakly supervised cancer detection in whole-slide images

From CNNs to Vision Transformers

Classic CNN architectures (ResNet, DenseNet) dominated medical imaging from 2015 to 2021. Since then, Vision Transformers (ViTs) have emerged as strong alternatives in research. ViTs split images into patches, treat each patch as a “token” (analogous to a word in NLP), and use self-attention to model relationships between patches. However, there is an “architectural gap” between research and clinical deployment: as of early 2026, nearly all FDA-cleared radiology AI devices still use CNNs, not transformers or foundation models (Lancet Digital Health, 2026). In the research literature, CNNs and ViTs achieve comparable performance on many tasks, and hybrid architectures are increasingly common.

13.4.2 Recurrent Networks and Transformers: Learning from Sequences

Recurrent Neural Networks (RNNs) were designed for sequential data — time series, text, and any data where order matters. The network maintains a hidden state updated at each time step, allowing it to retain information from earlier in the sequence.

Long Short-Term Memory (LSTM) networks solve the tendency of standard RNNs to forget long-range dependencies. They use gating mechanisms to control what information to retain, forget, and output at each step. LSTMs were the dominant architecture for clinical time series from roughly 2015 to 2022.

Transformers have largely superseded RNNs and LSTMs. Instead of processing sequences one element at a time, transformers use self-attention to relate every element to every other element simultaneously. This parallelism makes them faster to train, and the attention mechanism captures long-range dependencies more effectively. Medformer (NeurIPS 2024) is the current state-of-the-art transformer architecture specifically designed for medical time series classification.

Clinical applications of sequential architectures:

Data Type	Example	Current Preferred Architecture
ECG tracings	Arrhythmia detection, digital biomarkers	Transformers
EEG signals	Seizure detection, ICU monitoring	Transformers with attention-based interpretability
ICU vital signs	Mortality prediction, clinical deterioration	Transformers, temporal CNNs
Wearable sensor data	Activity recognition, gait analysis	Temporal CNNs, hybrid models

13.4.3 Large Language Models: Learning from Clinical Text

The transformer architecture is also the foundation of large language models (LLMs). In clinical research, LLMs and their smaller predecessors have been applied to:

Named entity recognition: Identifying diseases, medications, procedures, and lab values in clinical notes. Models like ClinicalBERT and PubMedBERT (~110M parameters) remain workhorses for structured extraction.
Information extraction: Pulling structured data from pathology and radiology reports.
Report generation: Drafting radiology or pathology reports from imaging studies.
Clinical question answering: Med-PaLM 2 achieved 86.5% on MedQA; Med-Gemini reached 91.1%. The MedHELM benchmark (Nature Medicine, 2025) tested 9 frontier LLMs across 121 medical tasks, finding scores of 0.73–0.85 for clinical note generation but only 0.56–0.72 for clinical decision support.

LLMs Are Not Clinical Decision-Making Tools (Yet)

LLMs can hallucinate — generating plausible-sounding but factually incorrect medical information. They lack the ability to reason causally about individual patients. Current evidence supports their use as assistants (drafting notes, extracting structured data, literature search) rather than as autonomous clinical decision-makers. Always verify LLM outputs against primary sources.

13.4.4 Deep Learning for Survival Analysis

If you worked through Chapter 8, you know that time-to-event data requires special handling. Deep learning has extended survival analysis beyond the Cox model:

Model	Approach	Key Feature
DeepSurv (Katzman et al., 2018)	Neural network within Cox PH framework	Learns non-linear risk functions
DeepHit (Lee et al., 2018)	Custom loss; no parametric assumptions	Handles competing risks directly
SurvTRACE (2024)	Transformer-based	Models competing events with attention
DySurv (JAMIA, 2025)	Conditional variational autoencoder	Dynamic risk from longitudinal EHR data

For most clinical survival analysis, Cox regression and random survival forests remain the appropriate starting point. Deep survival models become relevant with high-dimensional inputs (imaging, genomics) or complex temporal patterns.

13.5 Transfer Learning and Foundation Models

Transfer learning is the single most important practical technique in deep learning for health research. The idea is simple: take a model that has already learned useful features from a large dataset, and adapt it to your specific task with a much smaller dataset.

13.5.1 How Transfer Learning Works

Start with a pretrained model: A CNN trained on ImageNet (14 million natural images) has learned to detect edges, textures, shapes, and objects. These features transfer surprisingly well to medical images.
Replace the output layer: Swap the final classification head with one that predicts your clinical outcome.
Fine-tune: Train the modified model on your clinical dataset. You can freeze the pretrained layers (fast, works with very small datasets) or fine-tune all layers with a small learning rate (better performance, requires more data).

13.5.2 Foundation Models in Medicine

The field is moving beyond ImageNet toward domain-specific foundation models pretrained on medical data:

Model	Domain	Training Data	Published
MedSAM	Image segmentation	1.57M image-mask pairs, 10 modalities	Nature Communications, 2024
BiomedCLIP	Vision-language	15M biomedical image-text pairs	Microsoft, 2023
RETFound	Ophthalmology	1.6M retinal images, self-supervised	Nature, 2023
UNI	Pathology	100K+ whole-slide images	Nature Medicine, 2024
Virchow	Pathology	1.5M whole-slide images	Paige/Microsoft, 2024

Practical Advice for Clinical Researchers

If you want to apply deep learning to medical images, do not train from scratch. Use a pretrained foundation model (MedSAM for segmentation, BiomedCLIP for classification) and fine-tune on your labelled data. With even a few hundred annotated images, you can achieve strong performance. This is the practical path for most clinical research groups.

13.5.3 Transfer Learning in Practice

The following example shows the complete workflow for fine-tuning a pretrained ResNet50 for binary image classification. In practice, you would replace the data loading step with your own clinical images.

Code

library(keras3)

# Load pretrained ResNet50 (trained on ImageNet, without classification head)
base_model <- application_resnet50(
  weights = "imagenet",
  include_top = FALSE,
  input_shape = c(224, 224, 3)
)
freeze_weights(base_model)

# Add new classification head for binary outcome
model <- keras_model_sequential(input_shape = c(224, 224, 3)) %>%
  base_model() %>%
  layer_global_average_pooling_2d() %>%
  layer_dropout(rate = 0.3) %>%
  layer_dense(units = 1, activation = "sigmoid")

model %>% compile(
  optimizer = optimizer_adam(learning_rate = 1e-4),
  loss = "binary_crossentropy",
  metrics = "AUC"
)

# Data augmentation (rotation, flipping, shifting)
train_datagen <- image_data_generator(
  rescale = 1/255,
  rotation_range = 20,
  horizontal_flip = TRUE,
  validation_split = 0.2
)

# Point to your image directory organized as: images/class_0/ and images/class_1/
train_gen <- flow_images_from_directory(
  "path/to/images/", train_datagen,
  target_size = c(224, 224), batch_size = 32,
  class_mode = "binary", subset = "training"
)

val_gen <- flow_images_from_directory(
  "path/to/images/", train_datagen,
  target_size = c(224, 224), batch_size = 32,
  class_mode = "binary", subset = "validation"
)

# Fine-tune with early stopping
history <- model %>% fit(
  train_gen, epochs = 20,
  validation_data = val_gen,
  callbacks = list(
    callback_early_stopping(
      patience = 3, restore_best_weights = TRUE
    )
  )
)

Code

import tensorflow as tf
from tensorflow.keras.applications import ResNet50
from tensorflow.keras import layers, models, callbacks
from tensorflow.keras.preprocessing.image import ImageDataGenerator

# Load pretrained ResNet50 (trained on ImageNet, without classification head)
base_model = ResNet50(weights="imagenet", include_top=False,
                      input_shape=(224, 224, 3))
base_model.trainable = False

# Add new classification head for binary outcome
model = models.Sequential([
    base_model,
    layers.GlobalAveragePooling2D(),
    layers.Dropout(0.3),
    layers.Dense(1, activation="sigmoid")
])

model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-4),
              loss="binary_crossentropy",
              metrics=[tf.keras.metrics.AUC(name="auc")])

# Data augmentation (rotation, flipping, shifting)
train_datagen = ImageDataGenerator(
    rescale=1.0/255, rotation_range=20,
    horizontal_flip=True, validation_split=0.2
)

# Point to your image directory organized as: images/class_0/ and images/class_1/
train_gen = train_datagen.flow_from_directory(
    "path/to/images/", target_size=(224, 224),
    batch_size=32, class_mode="binary", subset="training"
)

val_gen = train_datagen.flow_from_directory(
    "path/to/images/", target_size=(224, 224),
    batch_size=32, class_mode="binary", subset="validation"
)

# Fine-tune with early stopping
history = model.fit(
    train_gen, epochs=20, validation_data=val_gen,
    callbacks=[
        callbacks.EarlyStopping(
            patience=3, restore_best_weights=True
        )
    ]
)

13.6 Practical Considerations

13.6.1 When to Use (and Not Use) Deep Learning

Scenario	Recommendation
Tabular EHR data, <50 features	Logistic regression or XGBoost
Tabular data combined with images or text	DL for the unstructured component; consider multimodal fusion
Medical images (X-ray, CT, MRI, pathology)	DL with transfer learning
Clinical notes, radiology reports	Transformer-based models
ECG, EEG, ICU time series	DL is appropriate; consider transformers or temporal CNNs
Survival analysis, standard covariates	Cox regression or random survival forests first

13.6.2 Data Requirements

Approach	Typical Data Needed
Training from scratch	Tens of thousands of labelled examples
Transfer learning (fine-tuning)	Hundreds to low thousands
Foundation model adaptation	As few as 50–100 examples for simple tasks
Self-supervised pretraining	Large amounts of unlabelled data

13.6.3 Compute

Approach	Hardware
Fine-tuning a pretrained model	Single GPU; Google Colab (free) works
Training a moderate model from scratch	1–4 GPUs; cloud instance ~$2–4/hour
Training a foundation model	GPU cluster; not realistic for most research groups
Running inference	CPU is often sufficient

13.6.4 Reporting Deep Learning Studies

If you publish research involving deep learning, several reporting guidelines apply:

TRIPOD+AI (Collins et al., BMJ 2024): 27-item checklist for prediction models using regression or ML. Applies to all prediction model studies.
CLAIM (Mongan, Moy, and Kahn, Radiology: AI 2020): Checklist for AI in Medical Imaging. Covers study design, data, model, evaluation, and discussion.
CONSORT-AI / SPIRIT-AI (Liu et al. and Rivera et al., Nature Medicine 2020): Extensions for reporting randomised trials and protocols involving AI.
MINIMAR (Hernandez-Boussard et al., JAMIA 2020): Minimum information for medical AI reporting.

13.6.5 Regulatory Context

As of early 2026, the FDA has authorized over 1,350 AI-enabled medical devices, with 76% in radiology. Nearly all are Class II devices cleared via the 510(k) pathway. The EU AI Act, which entered into force in August 2024, classifies AI-enabled medical devices as high-risk, with full compliance obligations from August 2026. If you are developing a model intended for clinical deployment, regulatory awareness is essential from the outset.

13.7 Challenges and Limitations

The External Validation Problem

A systematic review of 86 deep learning algorithms in radiology found that 81% exhibited decreased accuracy on external datasets, with nearly a quarter experiencing a substantial drop of 0.10 or greater in AUC. Domain shift — differences in patient demographics, imaging equipment, acquisition protocols, and disease prevalence between development and deployment sites — remains the greatest obstacle to clinical translation.

Fairness and bias: A 2024 study in Nature Medicine demonstrated that even if a model is optimized for fairness at a single site, fairness does not transfer to out-of-distribution datasets. Subgroup analysis across age, sex, and ethnicity is essential.

Interpretability: A systematic review of 67 studies (2019–2024) found that the addition of explainability methods (Grad-CAM, saliency maps) provided no statistically significant improvement in diagnostic accuracy beyond the AI prediction itself. Interpretability tools are valuable for debugging and trust-building, but they are not a substitute for rigorous external validation.

Temporal degradation: Clinical AI models can degrade over time as clinical practice, coding conventions, and patient populations change. Post-deployment monitoring is essential but rarely implemented.

13.8 Exercises

13.8.1 Exercise 1: Neural Network vs XGBoost on Tabular Data

Using the readmission dataset from this chapter, fit both a neural network and an XGBoost model. Compare their performance using 5-fold cross-validated AUC.

Code

library(keras3)
library(tidymodels)
library(xgboost)

# Use the readmit_data from above (or re-simulate it)
set.seed(42)
n <- 1000

readmit_data <- tibble(
  age = rnorm(n, 68, 12),
  length_of_stay = rpois(n, 5) + 1,
  num_comorbidities = rpois(n, 3),
  prior_admissions = rpois(n, 1),
  discharge_hgb = rnorm(n, 11, 2),
  discharge_creatinine = rlnorm(n, 0.2, 0.5),
  has_diabetes = rbinom(n, 1, 0.35),
  has_chf = rbinom(n, 1, 0.25)
)

readmit_prob <- plogis(-3 + 0.02 * (readmit_data$age - 68) +
                         0.15 * readmit_data$prior_admissions +
                         0.1 * readmit_data$num_comorbidities +
                         0.3 * readmit_data$has_chf -
                         0.1 * readmit_data$discharge_hgb)
readmit_data$readmitted <- factor(rbinom(n, 1, readmit_prob),
                                   labels = c("No", "Yes"))

# Your code:
# 1. Set up a tidymodels XGBoost workflow with 5-fold CV
# 2. Train a Keras neural network with 5-fold CV (use a loop over folds)
# 3. Compare AUC from both approaches
# 4. Which model performs better? Does this surprise you?

Code

import numpy as np
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
import xgboost as xgb
import tensorflow as tf
from tensorflow.keras import layers, models, callbacks
from sklearn.metrics import roc_auc_score

np.random.seed(42)
n = 1000

# Re-simulate readmission data (same as above)
# Your code:
# 1. Run 5-fold CV with XGBClassifier using cross_val_score
# 2. Run 5-fold CV with a Keras model (manual loop)
# 3. Compare mean AUC across folds
# 4. Which model performs better? Does this surprise you?

13.8.2 Exercise 2: Architecture Matching

For each of the following clinical tasks, identify (a) the most appropriate deep learning architecture, (b) whether deep learning is likely to outperform gradient-boosted trees, and (c) the most relevant reporting guideline. Write 2–3 sentences justifying each answer.

Predicting 30-day mortality from 15 structured EHR variables (age, labs, vitals, comorbidities).
Classifying skin lesions as benign or malignant from dermoscopy images.
Detecting atrial fibrillation from 12-lead ECG tracings.
Extracting medication names from unstructured discharge summaries.
Predicting length of stay from a combination of structured EHR data and a chest X-ray at admission.

13.8.3 Exercise 3: Critical Appraisal of a Deep Learning Study

Find a recent (2024 or later) paper that applies deep learning to a clinical task in your area of interest. Evaluate it against the CLAIM or TRIPOD+AI checklist:

Was the model externally validated? If so, how did performance compare to internal validation?
Were subgroup analyses reported (by age, sex, ethnicity)?
Was the model compared to a simpler baseline (e.g., logistic regression)?
Were the training data, code, and model weights made available?
Based on your assessment, how close is this model to clinical deployment? What would you want to see before trusting it with patient care?

13.9 Summary

Concept	Key Takeaway
When to use deep learning	Images, text, time series, and sequences — not tabular data
Neural network basics	Weighted sums + non-linear activations, stacked in layers
CNNs	Learn spatial features from images via convolutional filters
Transformers	Use self-attention to model relationships in sequences; dominant for NLP and increasingly for imaging
LLMs in medicine	Strong for extraction and generation; not reliable for autonomous decision-making
Transfer learning	Start from pretrained weights; fine-tune on your data
Foundation models	Domain-specific pretrained models (MedSAM, RETFound, UNI) dramatically reduce data requirements
External validation	81% of radiology DL models degrade on external data
Reporting	Use TRIPOD+AI, CLAIM, CONSORT-AI, or MINIMAR as appropriate

13.10 References and Further Reading

Goodfellow IJ, Bengio Y, Courville A. Deep Learning. MIT Press, 2016. Freely available at deeplearningbook.org. The definitive theoretical reference. Chapters 6 (deep feedforward networks), 9 (CNNs), and 10 (RNNs) are directly relevant.

Zhang A, Lipton ZC, Li M, Smola AJ. Dive into Deep Learning. Cambridge University Press, 2023. Freely available at d2l.ai. An interactive textbook with executable code in PyTorch, TensorFlow, and JAX. More practical and accessible than Goodfellow et al. Adopted at over 500 universities.

Howard J, Gugger S. Deep Learning for Coders with fastai and PyTorch. O’Reilly, 2020. Companion to the free fast.ai course. Top-down, coding-first approach ideal for researchers who want to apply deep learning without years of theory.

Grinsztajn L, Oyallon E, Varoquaux G. Why do tree-based models still outperform deep learning on typical tabular data? NeurIPS 2022. The benchmark study demonstrating that XGBoost consistently outperforms neural networks on tabular datasets. Essential reading before choosing deep learning for structured clinical data.

Ma J, He Y, Li F, et al. Segment anything in medical images. Nature Communications 2024;15:654. The MedSAM paper: a foundation model for universal medical image segmentation trained on 1.57 million image-mask pairs across 10 modalities.

Zhou Y, Chia MA, Wagner SK, et al. A foundation model for generalizable disease detection from retinal images. Nature 2023. RETFound: self-supervised pretraining on 1.6 million retinal images, with strong downstream performance from minimal fine-tuning.

Collins GS, et al. TRIPOD+AI statement. BMJ 2024;385:e078378. The current reporting standard for prediction models using regression or machine learning.

Bedi S, Cui H, Fuentes M, et al. MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks. Nature Medicine 2025. The most comprehensive LLM evaluation for clinical tasks, testing 9 frontier models across 121 tasks.

Wiegrebe S, et al. Deep learning for survival analysis: a review. Artificial Intelligence Review, Springer, 2024. Comprehensive taxonomy covering DeepSurv, DeepHit, neural ODEs, and transformer-based approaches.

For hands-on practice, Stanford CS231n (computer vision) and CS224n (NLP) offer free lecture materials. MIT 6.S191 (Introduction to Deep Learning) at introtodeeplearning.com provides accessible video lectures updated annually.

# Introduction to Neural Networks {#sec-deep-learning} ```{r} #| label: setup-dl #| include: false library(tidyverse) library(tidymodels) library(keras3) library(knitr) opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE) ``` ## When Does Deep Learning Make Sense? In the previous chapter, you learned that gradient-boosted trees are powerful, flexible, and remarkably effective for clinical prediction on structured data. A natural question follows: when should you reach for something more complex? The short answer: **deep learning excels when your data has spatial, sequential, or linguistic structure that tabular methods cannot exploit.** If your data lives in a spreadsheet --- rows of patients, columns of lab values --- gradient-boosted trees will usually match or beat a neural network, with less effort and better interpretability. But if your data is a chest X-ray, a clinical note, an ECG tracing, or a pathology slide, deep learning is the tool that unlocked performance previously thought impossible. ### Common Misconceptions 1. **"Deep learning is always better than traditional ML."** False. For tabular clinical data, a 2022 NeurIPS benchmark by Grinsztajn et al. showed that tree-based models consistently outperform neural networks on typical tabular datasets. A 2026 clinical benchmark confirmed that TabPFN --- a transformer designed specifically for tabular data --- exceeded the best traditional ML model in only 17% of clinical prediction tasks. Start with logistic regression or XGBoost; reach for deep learning only when the data type demands it. 2. **"Deep learning requires millions of examples."** Misleading. Transfer learning --- starting from a model pretrained on millions of images and fine-tuning on your hundreds --- has made deep learning practical even with small clinical datasets. Many successful medical imaging studies use fewer than 5,000 labelled examples. 3. **"Neural networks are uninterpretable black boxes."** Partially true, but increasingly addressable. Techniques such as Grad-CAM (which highlights the image regions driving a prediction) and attention visualization (which shows which words or time steps a model focuses on) provide clinically meaningful explanations. They are not as clean as a regression coefficient, but they are far from opaque. 4. **"I need a GPU cluster to do deep learning."** Not necessarily. Transfer learning with pretrained models can be done on a single consumer GPU or even in the cloud (Google Colab offers free GPU access). Training a model from scratch on large datasets is another matter entirely. ## How Neural Networks Learn A neural network is, at its core, a series of matrix multiplications followed by non-linear transformations. If you have worked through logistic regression, you already understand the basic idea: take a weighted sum of inputs and pass them through a sigmoid function to produce a probability. A neural network does exactly the same thing --- but stacks many such layers on top of each other, allowing it to learn increasingly abstract representations. ### The Building Blocks **Neurons (nodes)**: Each neuron computes a weighted sum of its inputs, adds a bias, and applies a non-linear function (called an **activation function**): $$a = f\left(\sum_{i=1}^{n} w_i x_i + b\right)$$ where $x_i$ are the inputs, $w_i$ are the weights, $b$ is the bias, and $f$ is the activation function. **Layers**: Neurons are organized into layers: - **Input layer**: receives the raw data (pixel values, lab values, word embeddings). - **Hidden layers**: intermediate layers that learn increasingly abstract features. A network with many hidden layers is called "deep" --- hence "deep learning." - **Output layer**: produces the final prediction (a probability, a class label, a continuous value). **Activation functions**: The non-linear functions that give neural networks their power. Without them, stacking layers would be pointless --- the composition of linear functions is still linear. Common choices include: - **ReLU** (Rectified Linear Unit): $f(x) = \max(0, x)$. The default for hidden layers. Simple, fast, and effective. - **Sigmoid**: $f(x) = \frac{1}{1+e^{-x}}$. Used in the output layer for binary classification. You already know this from logistic regression. - **Softmax**: Generalizes sigmoid to multiple classes. Used in the output layer for multi-class classification. ### Training: Gradient Descent and Backpropagation Neural networks learn by adjusting their weights to minimize a **loss function** --- a measure of how wrong the predictions are. The process works as follows: 1. **Forward pass**: Feed data through the network to produce a prediction. 2. **Compute loss**: Compare the prediction to the true label using a loss function (cross-entropy for classification, mean squared error for regression). 3. **Backward pass (backpropagation)**: Compute the gradient of the loss with respect to every weight in the network, using the chain rule of calculus. 4. **Update weights**: Adjust each weight in the direction that reduces the loss, scaled by a **learning rate** $\eta$: $$w \leftarrow w - \eta \frac{\partial \mathcal{L}}{\partial w}$$ This process repeats over many **epochs** (complete passes through the training data). The learning rate is critical: too large, and the model oscillates; too small, and it converges painfully slowly. ::: {.callout-tip} ## You Already Know This Pattern Gradient descent is not unique to deep learning. It is the same optimization strategy used in logistic regression and many other statistical models. The difference is scale: a logistic regression with 10 predictors has 11 parameters; a deep neural network may have millions. ::: ### Regularization: Preventing Overfitting Deep networks have enormous capacity and will happily memorize the training data if you let them. The strategies for preventing this parallel what you learned in @sec-penalised-regression, but with some additions: - **Dropout**: During training, randomly "turn off" a fraction of neurons in each layer. This prevents co-adaptation and acts as an implicit ensemble. Typical dropout rates range from 0.2 to 0.5. - **Weight decay (L2 regularization)**: Penalize large weights, exactly as in ridge regression. - **Early stopping**: Monitor performance on a validation set and stop training when performance begins to deteriorate. - **Data augmentation**: Artificially expand the training set by applying random transformations (rotations, flips, crops, colour jitter). Especially important in medical imaging, where labelled data is scarce. ## Clinical Example: Neural Network for Readmission Prediction To connect these concepts to what you already know, let us apply a simple neural network to the hospital readmission data from @sec-trees-ensembles. This is *not* a scenario where deep learning is the right tool --- gradient-boosted trees will likely perform as well or better on tabular data --- but it illustrates the mechanics. ::: {.panel-tabset} #### R ```{r} #| label: nn-tabular-r library(keras3) # Simulate readmission data (same structure as Chapter 8) set.seed(42) n <- 1000 readmit_data <- tibble( age = rnorm(n, 68, 12), length_of_stay = rpois(n, 5) + 1, num_comorbidities = rpois(n, 3), prior_admissions = rpois(n, 1), discharge_hgb = rnorm(n, 11, 2), discharge_creatinine = rlnorm(n, 0.2, 0.5), has_diabetes = rbinom(n, 1, 0.35), has_chf = rbinom(n, 1, 0.25) ) readmit_prob <- plogis(-3 + 0.02 * (readmit_data$age - 68) + 0.15 * readmit_data$prior_admissions + 0.1 * readmit_data$num_comorbidities + 0.3 * readmit_data$has_chf - 0.1 * readmit_data$discharge_hgb) readmit_data$readmitted <- rbinom(n, 1, readmit_prob) # Prepare data: scale features, split into train/test x <- readmit_data %>% select(-readmitted) %>% as.matrix() y <- readmit_data$readmitted # Standardize x_mean <- apply(x, 2, mean) x_sd <- apply(x, 2, sd) x_scaled <- scale(x, center = x_mean, scale = x_sd) # Train/test split set.seed(123) train_idx <- sample(n, 800) x_train <- x_scaled[train_idx, ] x_test <- x_scaled[-train_idx, ] y_train <- y[train_idx] y_test <- y[-train_idx] # Define a simple feedforward neural network model <- keras_model_sequential(input_shape = ncol(x_train)) %>% layer_dense(units = 32, activation = "relu") %>% layer_dropout(rate = 0.3) %>% layer_dense(units = 16, activation = "relu") %>% layer_dropout(rate = 0.3) %>% layer_dense(units = 1, activation = "sigmoid") model %>% compile( optimizer = optimizer_adam(learning_rate = 0.001), loss = "binary_crossentropy", metrics = "AUC" ) # Train with early stopping history <- model %>% fit( x_train, y_train, epochs = 50, batch_size = 32, validation_split = 0.2, callbacks = list( callback_early_stopping( patience = 5, restore_best_weights = TRUE ) ), verbose = 0 ) # Evaluate on test set results <- model %>% evaluate(x_test, y_test, verbose = 0) cat("Test loss:", round(results[[1]], 3), "\n") cat("Test AUC:", round(results[[2]], 3), "\n") ``` #### Python ```{python} #| label: nn-tabular-python import numpy as np import tensorflow as tf from tensorflow.keras import layers, models, callbacks from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler # Simulate readmission data (same structure as Chapter 8) np.random.seed(42) n = 1000 age = np.random.normal(68, 12, n) length_of_stay = np.random.poisson(5, n) + 1 num_comorbidities = np.random.poisson(3, n) prior_admissions = np.random.poisson(1, n) discharge_hgb = np.random.normal(11, 2, n) discharge_creatinine = np.random.lognormal(0.2, 0.5, n) has_diabetes = np.random.binomial(1, 0.35, n) has_chf = np.random.binomial(1, 0.25, n) X = np.column_stack([age, length_of_stay, num_comorbidities, prior_admissions, discharge_hgb, discharge_creatinine, has_diabetes, has_chf]) prob = 1 / (1 + np.exp(-(-3 + 0.02 * (age - 68) + 0.15 * prior_admissions + 0.1 * num_comorbidities + 0.3 * has_chf - 0.1 * discharge_hgb))) y = np.random.binomial(1, prob) # Split and scale X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, stratify=y, random_state=123 ) scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test) # Define a simple feedforward neural network model = models.Sequential([ layers.Dense(32, activation="relu", input_shape=(X_train.shape[1],)), layers.Dropout(0.3), layers.Dense(16, activation="relu"), layers.Dropout(0.3), layers.Dense(1, activation="sigmoid") ]) model.compile( optimizer=tf.keras.optimizers.Adam(learning_rate=0.001), loss="binary_crossentropy", metrics=[tf.keras.metrics.AUC(name="auc")] ) # Train with early stopping history = model.fit( X_train, y_train, epochs=50, batch_size=32, validation_split=0.2, callbacks=[ callbacks.EarlyStopping( patience=5, restore_best_weights=True ) ], verbose=0 ) loss, auc = model.evaluate(X_test, y_test, verbose=0) print(f"Test loss: {loss:.3f}") print(f"Test AUC: {auc:.3f}") ``` ::: ::: {.callout-note} ## Compare This to Chapter 8 Run the XGBoost model from @sec-trees-ensembles on the same data and compare the AUC. You will likely find that XGBoost matches or exceeds the neural network --- with less code, less tuning, and immediate access to variable importance. This is the norm for tabular clinical data. ::: ## Key Architectures for Clinical Research You do not need to understand every layer and parameter to use deep learning effectively. But you do need to understand *which architecture suits which data type* and *why*. ### Convolutional Neural Networks (CNNs): Learning from Images CNNs learn spatial hierarchies of features from images. Early layers detect edges and textures; deeper layers combine these into complex patterns (tumour boundaries, retinal vessel structures, skin lesion shapes). Instead of connecting every neuron to every input, CNNs use **convolutional filters** --- small windows that slide across the image, detecting local patterns. A $3 \times 3$ filter might learn to detect a horizontal edge; stacking many filters across many layers builds up a rich feature hierarchy. **Landmark clinical applications**: | Application | Key Study | Architecture | Performance | |-------------|-----------|-------------|-------------| | Chest X-ray diagnosis | Rajpurkar et al., 2017 | DenseNet-121 | Radiologist-level pneumonia detection | | Diabetic retinopathy | Gulshan et al., *JAMA* 2016 | Inception-v3 | AUC 0.991 for referable DR | | Skin cancer classification | Esteva et al., *Nature* 2017 | Inception-v3 | Dermatologist-level performance | | Pathology | Campanella et al., *Nature Medicine* 2019 | ResNet | Weakly supervised cancer detection in whole-slide images | ::: {.callout-note} ## From CNNs to Vision Transformers Classic CNN architectures (ResNet, DenseNet) dominated medical imaging from 2015 to 2021. Since then, **Vision Transformers (ViTs)** have emerged as strong alternatives in research. ViTs split images into patches, treat each patch as a "token" (analogous to a word in NLP), and use self-attention to model relationships between patches. However, there is an "architectural gap" between research and clinical deployment: as of early 2026, nearly all FDA-cleared radiology AI devices still use CNNs, not transformers or foundation models (*Lancet Digital Health*, 2026). In the research literature, CNNs and ViTs achieve comparable performance on many tasks, and hybrid architectures are increasingly common. ::: ### Recurrent Networks and Transformers: Learning from Sequences **Recurrent Neural Networks (RNNs)** were designed for sequential data --- time series, text, and any data where order matters. The network maintains a **hidden state** updated at each time step, allowing it to retain information from earlier in the sequence. **Long Short-Term Memory (LSTM)** networks solve the tendency of standard RNNs to forget long-range dependencies. They use gating mechanisms to control what information to retain, forget, and output at each step. LSTMs were the dominant architecture for clinical time series from roughly 2015 to 2022. **Transformers** have largely superseded RNNs and LSTMs. Instead of processing sequences one element at a time, transformers use **self-attention** to relate every element to every other element simultaneously. This parallelism makes them faster to train, and the attention mechanism captures long-range dependencies more effectively. Medformer (NeurIPS 2024) is the current state-of-the-art transformer architecture specifically designed for medical time series classification. **Clinical applications of sequential architectures**: | Data Type | Example | Current Preferred Architecture | |-----------|---------|-------------------------------| | ECG tracings | Arrhythmia detection, digital biomarkers | Transformers | | EEG signals | Seizure detection, ICU monitoring | Transformers with attention-based interpretability | | ICU vital signs | Mortality prediction, clinical deterioration | Transformers, temporal CNNs | | Wearable sensor data | Activity recognition, gait analysis | Temporal CNNs, hybrid models | ### Large Language Models: Learning from Clinical Text The transformer architecture is also the foundation of **large language models (LLMs)**. In clinical research, LLMs and their smaller predecessors have been applied to: - **Named entity recognition**: Identifying diseases, medications, procedures, and lab values in clinical notes. Models like ClinicalBERT and PubMedBERT (~110M parameters) remain workhorses for structured extraction. - **Information extraction**: Pulling structured data from pathology and radiology reports. - **Report generation**: Drafting radiology or pathology reports from imaging studies. - **Clinical question answering**: Med-PaLM 2 achieved 86.5% on MedQA; Med-Gemini reached 91.1%. The MedHELM benchmark (*Nature Medicine*, 2025) tested 9 frontier LLMs across 121 medical tasks, finding scores of 0.73--0.85 for clinical note generation but only 0.56--0.72 for clinical decision support. ::: {.callout-warning} ## LLMs Are Not Clinical Decision-Making Tools (Yet) LLMs can hallucinate --- generating plausible-sounding but factually incorrect medical information. They lack the ability to reason causally about individual patients. Current evidence supports their use as *assistants* (drafting notes, extracting structured data, literature search) rather than as autonomous clinical decision-makers. Always verify LLM outputs against primary sources. ::: ### Deep Learning for Survival Analysis If you worked through @sec-survival, you know that time-to-event data requires special handling. Deep learning has extended survival analysis beyond the Cox model: | Model | Approach | Key Feature | |-------|----------|-------------| | **DeepSurv** (Katzman et al., 2018) | Neural network within Cox PH framework | Learns non-linear risk functions | | **DeepHit** (Lee et al., 2018) | Custom loss; no parametric assumptions | Handles competing risks directly | | **SurvTRACE** (2024) | Transformer-based | Models competing events with attention | | **DySurv** (*JAMIA*, 2025) | Conditional variational autoencoder | Dynamic risk from longitudinal EHR data | For most clinical survival analysis, Cox regression and random survival forests remain the appropriate starting point. Deep survival models become relevant with high-dimensional inputs (imaging, genomics) or complex temporal patterns. ## Transfer Learning and Foundation Models Transfer learning is the single most important practical technique in deep learning for health research. The idea is simple: take a model that has already learned useful features from a large dataset, and adapt it to your specific task with a much smaller dataset. ### How Transfer Learning Works 1. **Start with a pretrained model**: A CNN trained on ImageNet (14 million natural images) has learned to detect edges, textures, shapes, and objects. These features transfer surprisingly well to medical images. 2. **Replace the output layer**: Swap the final classification head with one that predicts your clinical outcome. 3. **Fine-tune**: Train the modified model on your clinical dataset. You can freeze the pretrained layers (fast, works with very small datasets) or fine-tune all layers with a small learning rate (better performance, requires more data). ### Foundation Models in Medicine The field is moving beyond ImageNet toward **domain-specific foundation models** pretrained on medical data: | Model | Domain | Training Data | Published | |-------|--------|---------------|-----------| | **MedSAM** | Image segmentation | 1.57M image-mask pairs, 10 modalities | *Nature Communications*, 2024 | | **BiomedCLIP** | Vision-language | 15M biomedical image-text pairs | Microsoft, 2023 | | **RETFound** | Ophthalmology | 1.6M retinal images, self-supervised | *Nature*, 2023 | | **UNI** | Pathology | 100K+ whole-slide images | *Nature Medicine*, 2024 | | **Virchow** | Pathology | 1.5M whole-slide images | Paige/Microsoft, 2024 | ::: {.callout-tip} ## Practical Advice for Clinical Researchers If you want to apply deep learning to medical images, do not train from scratch. Use a pretrained foundation model (MedSAM for segmentation, BiomedCLIP for classification) and fine-tune on your labelled data. With even a few hundred annotated images, you can achieve strong performance. This is the practical path for most clinical research groups. ::: ### Transfer Learning in Practice The following example shows the complete workflow for fine-tuning a pretrained ResNet50 for binary image classification. In practice, you would replace the data loading step with your own clinical images. ::: {.panel-tabset} #### R ```{r} #| label: transfer-learning-r #| eval: false library(keras3) # Load pretrained ResNet50 (trained on ImageNet, without classification head) base_model <- application_resnet50( weights = "imagenet", include_top = FALSE, input_shape = c(224, 224, 3) ) freeze_weights(base_model) # Add new classification head for binary outcome model <- keras_model_sequential(input_shape = c(224, 224, 3)) %>% base_model() %>% layer_global_average_pooling_2d() %>% layer_dropout(rate = 0.3) %>% layer_dense(units = 1, activation = "sigmoid") model %>% compile( optimizer = optimizer_adam(learning_rate = 1e-4), loss = "binary_crossentropy", metrics = "AUC" ) # Data augmentation (rotation, flipping, shifting) train_datagen <- image_data_generator( rescale = 1/255, rotation_range = 20, horizontal_flip = TRUE, validation_split = 0.2 ) # Point to your image directory organized as: images/class_0/ and images/class_1/ train_gen <- flow_images_from_directory( "path/to/images/", train_datagen, target_size = c(224, 224), batch_size = 32, class_mode = "binary", subset = "training" ) val_gen <- flow_images_from_directory( "path/to/images/", train_datagen, target_size = c(224, 224), batch_size = 32, class_mode = "binary", subset = "validation" ) # Fine-tune with early stopping history <- model %>% fit( train_gen, epochs = 20, validation_data = val_gen, callbacks = list( callback_early_stopping( patience = 3, restore_best_weights = TRUE ) ) ) ``` #### Python ```{python} #| label: transfer-learning-python #| eval: false import tensorflow as tf from tensorflow.keras.applications import ResNet50 from tensorflow.keras import layers, models, callbacks from tensorflow.keras.preprocessing.image import ImageDataGenerator # Load pretrained ResNet50 (trained on ImageNet, without classification head) base_model = ResNet50(weights="imagenet", include_top=False, input_shape=(224, 224, 3)) base_model.trainable = False # Add new classification head for binary outcome model = models.Sequential([ base_model, layers.GlobalAveragePooling2D(), layers.Dropout(0.3), layers.Dense(1, activation="sigmoid") ]) model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-4), loss="binary_crossentropy", metrics=[tf.keras.metrics.AUC(name="auc")]) # Data augmentation (rotation, flipping, shifting) train_datagen = ImageDataGenerator( rescale=1.0/255, rotation_range=20, horizontal_flip=True, validation_split=0.2 ) # Point to your image directory organized as: images/class_0/ and images/class_1/ train_gen = train_datagen.flow_from_directory( "path/to/images/", target_size=(224, 224), batch_size=32, class_mode="binary", subset="training" ) val_gen = train_datagen.flow_from_directory( "path/to/images/", target_size=(224, 224), batch_size=32, class_mode="binary", subset="validation" ) # Fine-tune with early stopping history = model.fit( train_gen, epochs=20, validation_data=val_gen, callbacks=[ callbacks.EarlyStopping( patience=3, restore_best_weights=True ) ] ) ``` ::: ## Practical Considerations ### When to Use (and Not Use) Deep Learning | Scenario | Recommendation | |----------|---------------| | Tabular EHR data, <50 features | Logistic regression or XGBoost | | Tabular data combined with images or text | DL for the unstructured component; consider multimodal fusion | | Medical images (X-ray, CT, MRI, pathology) | DL with transfer learning | | Clinical notes, radiology reports | Transformer-based models | | ECG, EEG, ICU time series | DL is appropriate; consider transformers or temporal CNNs | | Survival analysis, standard covariates | Cox regression or random survival forests first | ### Data Requirements | Approach | Typical Data Needed | |----------|-------------------| | Training from scratch | Tens of thousands of labelled examples | | Transfer learning (fine-tuning) | Hundreds to low thousands | | Foundation model adaptation | As few as 50--100 examples for simple tasks | | Self-supervised pretraining | Large amounts of *unlabelled* data | ### Compute | Approach | Hardware | |----------|----------| | Fine-tuning a pretrained model | Single GPU; Google Colab (free) works | | Training a moderate model from scratch | 1--4 GPUs; cloud instance ~\$2--4/hour | | Training a foundation model | GPU cluster; not realistic for most research groups | | Running inference | CPU is often sufficient | ### Reporting Deep Learning Studies If you publish research involving deep learning, several reporting guidelines apply: - **TRIPOD+AI** (Collins et al., *BMJ* 2024): 27-item checklist for prediction models using regression or ML. Applies to all prediction model studies. - **CLAIM** (Mongan, Moy, and Kahn, *Radiology: AI* 2020): Checklist for AI in Medical Imaging. Covers study design, data, model, evaluation, and discussion. - **CONSORT-AI / SPIRIT-AI** (Liu et al. and Rivera et al., *Nature Medicine* 2020): Extensions for reporting randomised trials and protocols involving AI. - **MINIMAR** (Hernandez-Boussard et al., *JAMIA* 2020): Minimum information for medical AI reporting. ### Regulatory Context As of early 2026, the FDA has authorized over 1,350 AI-enabled medical devices, with 76% in radiology. Nearly all are Class II devices cleared via the 510(k) pathway. The EU AI Act, which entered into force in August 2024, classifies AI-enabled medical devices as high-risk, with full compliance obligations from August 2026. If you are developing a model intended for clinical deployment, regulatory awareness is essential from the outset. ## Challenges and Limitations ::: {.callout-warning} ## The External Validation Problem A systematic review of 86 deep learning algorithms in radiology found that **81% exhibited decreased accuracy on external datasets**, with nearly a quarter experiencing a substantial drop of 0.10 or greater in AUC. Domain shift --- differences in patient demographics, imaging equipment, acquisition protocols, and disease prevalence between development and deployment sites --- remains the greatest obstacle to clinical translation. ::: **Fairness and bias**: A 2024 study in *Nature Medicine* demonstrated that even if a model is optimized for fairness at a single site, fairness does not transfer to out-of-distribution datasets. Subgroup analysis across age, sex, and ethnicity is essential. **Interpretability**: A systematic review of 67 studies (2019--2024) found that the addition of explainability methods (Grad-CAM, saliency maps) provided no statistically significant improvement in diagnostic accuracy beyond the AI prediction itself. Interpretability tools are valuable for debugging and trust-building, but they are not a substitute for rigorous external validation. **Temporal degradation**: Clinical AI models can degrade over time as clinical practice, coding conventions, and patient populations change. Post-deployment monitoring is essential but rarely implemented. ## Exercises ### Exercise 1: Neural Network vs XGBoost on Tabular Data Using the readmission dataset from this chapter, fit both a neural network and an XGBoost model. Compare their performance using 5-fold cross-validated AUC. ::: {.panel-tabset} #### R Starter Code ```{r} #| label: ex1-nn-vs-xgb-r #| eval: false library(keras3) library(tidymodels) library(xgboost) # Use the readmit_data from above (or re-simulate it) set.seed(42) n <- 1000 readmit_data <- tibble( age = rnorm(n, 68, 12), length_of_stay = rpois(n, 5) + 1, num_comorbidities = rpois(n, 3), prior_admissions = rpois(n, 1), discharge_hgb = rnorm(n, 11, 2), discharge_creatinine = rlnorm(n, 0.2, 0.5), has_diabetes = rbinom(n, 1, 0.35), has_chf = rbinom(n, 1, 0.25) ) readmit_prob <- plogis(-3 + 0.02 * (readmit_data$age - 68) + 0.15 * readmit_data$prior_admissions + 0.1 * readmit_data$num_comorbidities + 0.3 * readmit_data$has_chf - 0.1 * readmit_data$discharge_hgb) readmit_data$readmitted <- factor(rbinom(n, 1, readmit_prob), labels = c("No", "Yes")) # Your code: # 1. Set up a tidymodels XGBoost workflow with 5-fold CV # 2. Train a Keras neural network with 5-fold CV (use a loop over folds) # 3. Compare AUC from both approaches # 4. Which model performs better? Does this surprise you? ``` #### Python Starter Code ```{python} #| label: ex1-nn-vs-xgb-python #| eval: false import numpy as np from sklearn.model_selection import StratifiedKFold, cross_val_score from sklearn.preprocessing import StandardScaler from sklearn.pipeline import make_pipeline import xgboost as xgb import tensorflow as tf from tensorflow.keras import layers, models, callbacks from sklearn.metrics import roc_auc_score np.random.seed(42) n = 1000 # Re-simulate readmission data (same as above) # Your code: # 1. Run 5-fold CV with XGBClassifier using cross_val_score # 2. Run 5-fold CV with a Keras model (manual loop) # 3. Compare mean AUC across folds # 4. Which model performs better? Does this surprise you? ``` ::: ### Exercise 2: Architecture Matching For each of the following clinical tasks, identify (a) the most appropriate deep learning architecture, (b) whether deep learning is likely to outperform gradient-boosted trees, and (c) the most relevant reporting guideline. Write 2--3 sentences justifying each answer. 1. Predicting 30-day mortality from 15 structured EHR variables (age, labs, vitals, comorbidities). 2. Classifying skin lesions as benign or malignant from dermoscopy images. 3. Detecting atrial fibrillation from 12-lead ECG tracings. 4. Extracting medication names from unstructured discharge summaries. 5. Predicting length of stay from a combination of structured EHR data and a chest X-ray at admission. ### Exercise 3: Critical Appraisal of a Deep Learning Study Find a recent (2024 or later) paper that applies deep learning to a clinical task in your area of interest. Evaluate it against the CLAIM or TRIPOD+AI checklist: 1. Was the model externally validated? If so, how did performance compare to internal validation? 2. Were subgroup analyses reported (by age, sex, ethnicity)? 3. Was the model compared to a simpler baseline (e.g., logistic regression)? 4. Were the training data, code, and model weights made available? 5. Based on your assessment, how close is this model to clinical deployment? What would you want to see before trusting it with patient care? ## Summary | Concept | Key Takeaway | |---------|-------------| | When to use deep learning | Images, text, time series, and sequences --- not tabular data | | Neural network basics | Weighted sums + non-linear activations, stacked in layers | | CNNs | Learn spatial features from images via convolutional filters | | Transformers | Use self-attention to model relationships in sequences; dominant for NLP and increasingly for imaging | | LLMs in medicine | Strong for extraction and generation; not reliable for autonomous decision-making | | Transfer learning | Start from pretrained weights; fine-tune on your data | | Foundation models | Domain-specific pretrained models (MedSAM, RETFound, UNI) dramatically reduce data requirements | | External validation | 81% of radiology DL models degrade on external data | | Reporting | Use TRIPOD+AI, CLAIM, CONSORT-AI, or MINIMAR as appropriate | ## References and Further Reading **Goodfellow IJ, Bengio Y, Courville A.** *Deep Learning.* MIT Press, 2016. Freely available at [deeplearningbook.org](https://www.deeplearningbook.org/). The definitive theoretical reference. Chapters 6 (deep feedforward networks), 9 (CNNs), and 10 (RNNs) are directly relevant. **Zhang A, Lipton ZC, Li M, Smola AJ.** *Dive into Deep Learning.* Cambridge University Press, 2023. Freely available at [d2l.ai](https://d2l.ai/). An interactive textbook with executable code in PyTorch, TensorFlow, and JAX. More practical and accessible than Goodfellow et al. Adopted at over 500 universities. **Howard J, Gugger S.** *Deep Learning for Coders with fastai and PyTorch.* O'Reilly, 2020. Companion to the free [fast.ai course](https://course.fast.ai/). Top-down, coding-first approach ideal for researchers who want to apply deep learning without years of theory. **Grinsztajn L, Oyallon E, Varoquaux G.** Why do tree-based models still outperform deep learning on typical tabular data? *NeurIPS* 2022. The benchmark study demonstrating that XGBoost consistently outperforms neural networks on tabular datasets. Essential reading before choosing deep learning for structured clinical data. **Ma J, He Y, Li F, et al.** Segment anything in medical images. *Nature Communications* 2024;15:654. The MedSAM paper: a foundation model for universal medical image segmentation trained on 1.57 million image-mask pairs across 10 modalities. **Zhou Y, Chia MA, Wagner SK, et al.** A foundation model for generalizable disease detection from retinal images. *Nature* 2023. RETFound: self-supervised pretraining on 1.6 million retinal images, with strong downstream performance from minimal fine-tuning. **Collins GS, et al.** TRIPOD+AI statement. *BMJ* 2024;385:e078378. The current reporting standard for prediction models using regression or machine learning. **Bedi S, Cui H, Fuentes M, et al.** MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks. *Nature Medicine* 2025. The most comprehensive LLM evaluation for clinical tasks, testing 9 frontier models across 121 tasks. **Wiegrebe S, et al.** Deep learning for survival analysis: a review. *Artificial Intelligence Review*, Springer, 2024. Comprehensive taxonomy covering DeepSurv, DeepHit, neural ODEs, and transformer-based approaches. For hands-on practice, Stanford CS231n (computer vision) and CS224n (NLP) offer free lecture materials. MIT 6.S191 (Introduction to Deep Learning) at [introtodeeplearning.com](https://introtodeeplearning.com/) provides accessible video lectures updated annually.