Least Squares Estimation

TL;DR

Finds the best-fit linear model (line or plane) for data by adjusting parameters to reduce prediction errors.
Measures error as the sum of the squares of vertical distances between points and the fit.
Parameters (e.g., slope and intercept) can be found by iterative optimization such as gradient descent.

Definition

Least squares estimation is a statistical technique used to find the line or plane of best fit for a given set of data by minimizing the sum of the squares of the vertical distances between the data points and the line or plane.

Explanation

The method defines a parametric linear function (a line for one predictor, a plane for multiple predictors) and chooses the parameter values that minimize the sum of squared vertical distances from the observed data points to the fitted function. For a single predictor, the fitted line is written as

y = mx + b,

where y is the response, x is the predictor, m is the slope, and b is the y-intercept. For two predictors (a plane), the fitted surface is written as

y = ax + bz + c,

where y is the response, x and z are predictors, a and b are directional slopes, and c is the intercept. Optimization of the parameters may be performed by iterative methods such as gradient descent, which updates parameter values to progressively reduce the sum of squared vertical distances.

Examples

Example 1: Heights and weights

A dataset contains heights (in inches) and weights (in pounds) for a group of individuals. Plotting these points on a scatter plot, the line of best fit is defined as

y = mx + b,

where y is weight, x is height, m is slope, and b is the y-intercept. The values of m and b are chosen to minimize the sum of the squares of the vertical distances between the data points and the line. Gradient descent is given as an example of an optimization technique that iteratively updates m and b to reduce this sum. The resulting line represents the relationship between height and weight that best fits the data.

Example 2: Scores on two exams

A dataset contains students’ scores on two different exams. Plotting these points, the plane of best fit is defined as

y = ax + bz + c,

where y is the score on the first exam, x is the score on the second exam, a is the slope in the x-direction, b is the slope in the z-direction, and c is the y-intercept. The values of a, b, and c are chosen to minimize the sum of the squares of the vertical distances between the data points and the plane. An optimization technique similar to that used for the one-dimensional case finds the optimal parameters. The resulting plane represents the relationship between the two exam scores that best fits the data.

Line of best fit
Plane of best fit
Gradient descent