Imputation

TL;DR

Replaces missing values with estimates so more observations can be used.
Helps avoid bias and unreliable results that arise from missing data.
Common methods include mean substitution and multiple imputation.

Definition

Imputation is the process of replacing missing data with estimated values in order to increase the sample size and improve the accuracy of the results.

Explanation

Missing data can lead to biased and unreliable results, particularly in statistical analyses. Imputation fills in those missing values with estimates so that analysts can include all available observations rather than excluding cases with missing entries. Different imputation methods produce different estimated values; selecting an appropriate method affects the resulting accuracy and reliability.

Two methods described in the source are:

Mean substitution: replace missing values with the mean of the non-missing values for the same variable.
Multiple imputation: create several imputed datasets with different estimated values for the missing data and combine results across those datasets to produce a single estimate.

Examples

Mean substitution

Replace missing values with the mean of the non-missing values in the same variable. For instance, if a researcher is studying the salaries of employees in a company and some salaries are missing, they can use mean substitution to fill in the missing values with the average salary of the other employees. This allows the researcher to include all employees in the analysis and avoid bias caused by excluding the missing data.

Multiple imputation

Use multiple sets of estimated values to fill in the missing data. This involves creating several imputed datasets, each with different imputed values for the missing data, and then combining results from each dataset to produce a single, more accurate estimate of the true value. For instance, a researcher studying the relationship between education level and income may have some missing data on education level. Using multiple imputation, the researcher can create multiple imputed datasets where the missing education levels are replaced with different estimated values. These values may be based on the individual’s income, occupation, or other relevant factors. The results from each dataset are then combined to produce a more accurate estimate of the relationship between education and income.

Use cases

Handling missing data in statistical analyses so that all available data can be included and biased results from excluding missing values are reduced.

Notes or pitfalls

Missing data can lead to biased and unreliable results.
Excluding observations with missing values can introduce bias; imputation seeks to avoid that by filling in estimated values.

Mean substitution
Multiple imputation