Biostatistical prediction thinking
The course does not treat machine learning as button-clicking. It explains what a prediction target is, when predictors are measured, how outcomes are defined and why validation must match the clinical question.
Introductory Machine Learning in Biostatistics
A structured course for students who want to understand prediction modelling, validation, overfitting, calibration, clinical usefulness and responsible machine learning in medical research.
The course combines statistical thinking, R-based modelling, applied interpretation and health-data examples so students learn not only how models are fitted, but how they should be judged.
Course aim
The course is built around medical machine learning as a disciplined biostatistical workflow, not a collection of algorithms.
Course snapshot
5
Core modules
25
Structured lessons
R
Browser-based practice
5
Applied case studies
The course does not treat machine learning as button-clicking. It explains what a prediction target is, when predictors are measured, how outcomes are defined and why validation must match the clinical question.
Students learn why a simple validated model can be more useful than a complex model that leaks information, overfits, or performs poorly on new patients.
Selected lessons include R and WebR-style practice so students can connect theory with real modelling workflows while still focusing on interpretation.
Course structure
Start with the language of prediction, then move through supervised learning, model evaluation, regularisation, ensembles and applied health-data case studies.
01
Build the language of prediction, supervised learning, train/test validation, overfitting, leakage, thresholds and responsible biostatistical ML workflow.
02
Study regression as prediction, logistic classification, k-nearest neighbours, decision trees and clinical modelling pipelines.
03
Learn train/test splits, cross-validation, bootstrap validation, classification metrics, calibration, leakage and reproducibility.
04
Move into ridge, lasso, elastic net, random forests, gradient boosting, support vector machines and responsible model comparison.
05
Apply the full workflow to risk prediction, survival outcomes, omics, missing data, imbalance, fairness and transparent reporting.
What makes this course different
Clinical prediction rather than generic machine learning
Validation, calibration and usefulness explained carefully
Overfitting and data leakage treated as central topics
R-based workflow with interpretation-first teaching
Case studies based on health-data-style modelling questions
Clear links between statistics, biostatistics and ML
Case studies
Case studies turn the modelling workflow into report-style interpretation with figures, metrics, clinical judgement, limitations and transparent conclusions.
Course assets
The course includes a shared dataset, local R scripts, selected WebR lesson labs and generated figures used inside lessons and case-study pages.
Start the course
Start with prediction thinking, predictor timing, validation, overfitting, leakage and the responsible reporting workflow.