Multicollinearity
- Highly correlated predictor variables make it hard to separate each variable’s unique effect in a regression.
- Produces unstable coefficient estimates, inflated standard errors and p-values, and ambiguous interpretation.
- Common in regression models with multiple similar indicators; can be addressed with principal component analysis or by selecting a subset of predictors.
Definition
Section titled “Definition”Multicollinearity refers to a situation in which two or more predictor variables in a regression model are highly correlated with each other. This can lead to unstable and inaccurate coefficient estimates, as well as problems with model interpretation.
Explanation
Section titled “Explanation”When predictor variables are highly correlated, the regression model struggles to estimate the unique contribution of each predictor. As a result, coefficient estimates can become unstable and sensitive to the specific model specification. Multicollinearity also tends to inflate standard errors and p-values, which makes it more difficult to identify statistically significant predictors. Because the unique contribution of each predictor may be unclear, interpreting which factors drive the relationship between predictors and the response becomes difficult, increasing the risk of incorrect conclusions and poor decision-making.
Examples
Section titled “Examples”Income indicators
Section titled “Income indicators”Using multiple indicators of income in a regression model, such as annual salary, hourly wage, and overtime pay, can produce multicollinearity. These variables are likely highly correlated (individuals with higher salaries are likely to earn more overtime pay and have higher hourly wages), so the model may have difficulty accurately estimating each predictor’s unique contribution, leading to unstable coefficient estimates and interpretation difficulties.
Education indicators
Section titled “Education indicators”Using multiple indicators of education in a regression model, such as years of education, level of degree, and type of institution attended, can produce multicollinearity. These variables are likely highly correlated (individuals with more years of education are likely to have higher level degrees and attend more prestigious institutions), so the model may have difficulty accurately estimating each predictor’s unique contribution, leading to unstable coefficient estimates and interpretation difficulties.
Use cases
Section titled “Use cases”- Regression modeling: multicollinearity is a common problem encountered when building regression models with multiple related predictors.
Notes or pitfalls
Section titled “Notes or pitfalls”- Multicollinearity can lead to unstable coefficient estimates that vary with model specification.
- It can inflate standard errors and p-values, making it harder to detect significant predictors.
- It obscures the unique contribution of individual predictors, complicating interpretation and risking incorrect conclusions.
Related terms
Section titled “Related terms”- Regression model
- Predictor variables
- Coefficient estimates
- Standard errors
- P-values
- Principal component analysis
- Selecting a subset of predictor variables