Bootstrap

TL;DR

Approximate a statistic’s sampling distribution by repeatedly resampling with replacement from the original sample.
Use the resulting distribution of statistics (e.g., means) to compute standard errors, confidence intervals, or p-values when the population distribution is unknown.
Applicable to many statistics (mean, median, standard deviation) and can be combined with hypothesis testing or regression analysis.

Definition

Bootstrap is a statistical method used to estimate the sampling distribution of a statistic through the use of resampling techniques. It involves repeatedly sampling with replacement from a dataset, calculating the statistic of interest for each sample, and then using the resulting sample of statistics to estimate the sampling distribution.

Explanation

Bootstrapping generates an empirical approximation of a statistic’s sampling distribution by drawing repeated samples with replacement from the observed data. For each bootstrap sample, the statistic of interest is computed. Repeating this process many times (for example, 1000 iterations) produces a distribution of bootstrap statistics. Summary measures of that distribution (such as its mean and standard deviation) can be used to form confidence intervals, calculate standard errors, or obtain p-values. A key advantage is that bootstrapping relies only on the sample data, so it can estimate sampling distributions even when the underlying population distribution is unknown or complex.

Examples

Estimating a population mean from a sample of heights

Suppose we are interested in estimating the mean height of a population of students. We take a sample of 10 students and measure their heights, finding a sample mean of 170cm. We can use bootstrapping to estimate the sampling distribution of the sample mean, which can then be used to construct confidence intervals or perform hypothesis tests.

To do this, we first create a bootstrap sample by sampling with replacement from the original sample of 10 students. This means that we randomly select one of the 10 heights, record it, and then put it back in the sample so it can be selected again. We repeat this process a large number of times (e.g. 1000), creating a new sample of 10 heights for each iteration.

For each bootstrap sample, we calculate the sample mean. This results in a sample of 1000 sample means, which we can use to estimate the sampling distribution of the sample mean. For example, we can calculate the mean and standard deviation of the bootstrap sample means, which can be used to construct a confidence interval for the population mean.

Hypothesis testing for a population mean

Suppose we are interested in testing the hypothesis that the population mean height is equal to 175cm. We can use bootstrapping to estimate the sampling distribution of the sample mean, and then use this distribution to calculate a p-value for the hypothesis test. This can be done by calculating the proportion of bootstrap sample means that are at least as extreme as the observed sample mean, given the null hypothesis.

Bootstrap in regression analysis

Suppose we have a dataset with two variables, x and y, and we want to fit a linear regression model to predict y based on x. We can use bootstrapping to estimate the sampling distribution of the regression coefficients, which can then be used to construct confidence intervals for the coefficients or perform hypothesis tests.

Use cases

Estimating the sampling distribution of statistics such as the mean, median, or standard deviation.
Constructing confidence intervals and obtaining standard errors when the population distribution is unknown or complex.
Performing hypothesis tests (e.g., computing p-values) using the empirical bootstrap distribution.
Estimating sampling distributions of regression coefficients in regression analysis.

Notes or pitfalls

One key advantage of bootstrapping is that it can be used when the underlying population distribution is unknown or complex because it relies only on the sample data rather than assumptions about the population distribution.
Bootstrapping is described as a powerful and flexible statistical method that can provide more accurate and robust estimates than other methods in many situations.

Sampling distribution
Resampling
Confidence interval
Hypothesis testing
Regression analysis
Sample mean
Median
Standard deviation