Introductory Machine Learning in Biostatistics

Machine learning for health data, clinical prediction and biostatistical modelling.

A structured course for students who want to understand prediction modelling, validation, overfitting, calibration, clinical usefulness and responsible machine learning in medical research.

The course combines statistical thinking, R-based modelling, applied interpretation and health-data examples so students learn not only how models are fitted, but how they should be judged.

Course aim

Prediction, validation and interpretation for health data.

The course is built around medical machine learning as a disciplined biostatistical workflow, not a collection of algorithms.

Start learning View all modules Open case study

Course snapshot

Core modules

Structured lessons

Browser-based practice

Applied case studies

Biostatistical prediction thinking

The course does not treat machine learning as button-clicking. It explains what a prediction target is, when predictors are measured, how outcomes are defined and why validation must match the clinical question.

Validation before complexity

Students learn why a simple validated model can be more useful than a complex model that leaks information, overfits, or performs poorly on new patients.

R-based practical learning

Selected lessons include R and WebR-style practice so students can connect theory with real modelling workflows while still focusing on interpretation.

Course structure

Five modules from foundations to applied medical ML.

Start with the language of prediction, then move through supervised learning, model evaluation, regularisation, ensembles and applied health-data case studies.

Available5 lessons

Foundations of Machine Learning in Biostatistics

Build the language of prediction, supervised learning, train/test validation, overfitting, leakage, thresholds and responsible biostatistical ML workflow.

Open module →

Preparing5 lessons

Supervised Learning for Clinical and Health Data

Study regression as prediction, logistic classification, k-nearest neighbours, decision trees and clinical modelling pipelines.

Open module →

Preparing5 lessons

Model Evaluation, Validation and Performance

Learn train/test splits, cross-validation, bootstrap validation, classification metrics, calibration, leakage and reproducibility.

Open module →

Preparing5 lessons

Regularisation, Ensembles and Modern Prediction Models

Move into ridge, lasso, elastic net, random forests, gradient boosting, support vector machines and responsible model comparison.

Open module →

Preparing5 lessons

Applied Biostatistical ML Case Studies

Apply the full workflow to risk prediction, survival outcomes, omics, missing data, imbalance, fairness and transparent reporting.

Open module →

What makes this course different

Designed for responsible prediction, not shortcuts.

Clinical prediction rather than generic machine learning

Validation, calibration and usefulness explained carefully

Overfitting and data leakage treated as central topics

R-based workflow with interpretation-first teaching

Case studies based on health-data-style modelling questions

Clear links between statistics, biostatistics and ML

Case studies

Applied medical ML reports.

Case studies turn the modelling workflow into report-style interpretation with figures, metrics, clinical judgement, limitations and transparent conclusions.

View case studies Open diabetes case study

Course assets

Data, scripts and figures.

The course includes a shared dataset, local R scripts, selected WebR lesson labs and generated figures used inside lessons and case-study pages.

Download shared diabetes CSV →Lesson 1.1 R script →Lesson 1.2 R script →Lesson 1.3 R script →Lesson 1.4 R script →Lesson 1.5 R script →Case Study 1 R script →

Start the course

Begin with Module 1: Foundations of Machine Learning in Biostatistics.

Start with prediction thinking, predictor timing, validation, overfitting, leakage and the responsible reporting workflow.

Open Module 1 →Start Lesson 1.1