Least Squares Cross Validation

TL;DR

Split data into training and validation sets; fit the model on training data and evaluate on validation data.
Use the least squares error (LSE) on the validation set to compare models or parameter choices.
Select the model or parameters that yield the smallest LSE on the validation set.

Definition

Least squares cross-validation is a technique used in machine learning to evaluate a model’s performance by finding the model that minimizes the error between predicted values and actual values, using the least squares error on a validation set.

Explanation

Least squares cross-validation proceeds by partitioning the dataset into at least two parts: a training set and a validation set. The training set is used to fit the model (for example, estimate parameters such as slope and intercept in a linear regression). The validation set is used to evaluate the fitted model by comparing predicted values to actual values and computing the least squares error (LSE), which squares the differences between predicted and actual values to quantify error. Different candidate models or parameter values are compared by their LSE on the validation set; the candidate with the smallest LSE is selected as the best model. In practice, this approach can be extended to include a separate test set (training / validation / test) where the test set is used to assess performance of the final selected model on unseen data.

Examples

Linear regression example

Consider a simple linear regression where the line is defined by slope and intercept. Split the dataset into a training set and a validation set. Fit the line to the training set to estimate slope and intercept. Compute the least squares error on the validation set; try different slope and intercept values and select the pair that yields the smallest LSE.

Small dataset split example

Given a dataset with 10 data points, split it into a training set with 8 data points and a validation set with 2 data points. Fit the model on the training set, calculate the LSE for the validation set, try different parameter values, and select the model with the smallest LSE.

Training / validation / test example

When evaluating a machine learning algorithm on a dataset, split the data into three parts: the training set to fit the model, the validation set to evaluate performance and select the best model, and the test set to evaluate the performance of the final model on unseen data.

Use cases

Selecting optimal values for a model’s parameters.
Choosing the best model among candidates based on validation LSE.
Evaluating model performance and improving the accuracy of predictions.

least squares error (LSE)
training set
validation set
test set
linear regression