Resampling

TL;DR

Create new samples from an existing dataset to estimate statistics (mean, variance) and their variability.
Evaluate model generalization and compare models using repeated data splits.
Two primary approaches: bootstrapping (sampling with replacement) and cross-validation (k-fold splits).

Definition

Resampling is a statistical method used to analyze and understand a dataset by generating new samples from it. It allows for the estimation of statistical properties, such as mean and variance, and can be used to test hypotheses, assess model performance, and select the best models for a given dataset.

Explanation

Resampling produces new samples from an observed dataset to draw inferences about statistical properties or model performance without collecting additional data. The two main approaches described are:

Bootstrapping: Randomly sample with replacement from the original dataset to create new samples (bootstrap samples). Repeating this process many times yields a collection of bootstrap samples whose sample statistics (for example, mean and standard deviation) form empirical distributions used to estimate the corresponding properties of the original dataset.
Cross-validation: Divide the dataset into k equal-sized folds. Train the model on k−1 folds and evaluate it on the remaining fold; repeat this process k times so each fold serves once as the evaluation set. Average performance metrics (for example, accuracy and F1 score) across the k iterations to estimate the model’s generalization performance.

Examples

Bootstrapping example

Consider a dataset of 10 individuals with their respective heights (in inches). The mean height is 67 inches and the standard deviation is 4 inches. To generate a bootstrap sample, randomly select heights from the original dataset with replacement and calculate the mean and standard deviation of the new sample. If this process is repeated 100 times, there will be 100 bootstrap samples with corresponding means and standard deviations. Examining the distribution of those means and standard deviations provides insight into the spread and variability of the original dataset.

Cross-validation example

Suppose a dataset of 1000 observations is used to build a classification model to predict whether an individual will have heart disease. Using 10-fold cross-validation, divide the dataset into 10 equal-sized folds, with 100 observations in each fold. Train the model on the first 9 folds and evaluate it on the 10th fold, then repeat so each of the 10 folds serves once as the evaluation set. Calculate the accuracy for each iteration and average the results to obtain an overall estimate of the model’s performance. This assesses the model’s ability to generalize to new data rather than only its performance on the training data.

Use cases

Estimating statistical properties (mean, variance) of a dataset.
Testing hypotheses using empirical sampling distributions.
Assessing machine learning model performance and selecting among models.

Notes or pitfalls

Bootstrapping depends on the assumption that the original sample is representative of the population; if that assumption fails, bootstrap estimates may be unreliable.
Cross-validation can be less effective for small datasets where there are insufficient observations to split into multiple folds.

Bootstrapping
Bootstrap sample
Cross-validation
k-fold cross-validation