Gradient Descent

TL;DR

Iterative method that adjusts parameters to reduce a cost function using partial derivatives (gradients).
Starts from an initial guess and moves opposite the gradients by a small step called the learning rate until a minimum is reached.
Commonly used in machine learning to find optimal model weights (for example, in linear regression).

Definition

Gradient descent is an optimization algorithm used to find the values of parameters that minimize a given cost function. It is commonly used in machine learning to find the optimal weights for a model.

Explanation

Begin with an initial guess for the parameters. Compute the partial derivatives of the cost function with respect to each parameter; these derivatives indicate the directions in which the cost increases most. Move the parameters in the opposite direction of those partial derivatives by a small step (the learning rate) to reduce the cost. Repeat this process iteratively until a minimum value of the cost function is reached.

The method is illustrated with a simple two-variable example: a cost function depending on x and y (for instance, the sum of the squares of the differences between predicted and actual values). The goal is to find the values of x and y that minimize that cost function by repeatedly computing partial derivatives and updating x and y in the opposite direction of the derivatives.

Examples

Two-variable example

A cost function depending on two variables, x and y, can be the sum of the squares of the differences between predicted values and actual values. The goal is to find x and y that minimize this cost function by computing partial derivatives with respect to x and y and moving opposite those derivatives using a learning rate.

Linear regression example

In linear regression, fit a line to data by minimizing the sum of the squares of differences between predicted values and actual values. The cost function can be written as:

J(w) = \sum (y - \hat{y})^2

Here, w is the vector of weights to find. Use gradient descent by starting with an initial guess for w, computing partial derivatives of J with respect to each weight, and updating w in the opposite direction of those partial derivatives by the learning rate until a minimum is reached.

Use cases

Finding optimal weights for machine learning models.
Applying to linear regression to minimize the sum of squared errors.

Notes or pitfalls

Requires an initial guess for the parameters.
Updates use a learning rate, a small step size that determines how far parameters move opposite the gradient.
The update-and-evaluate process is repeated until the cost function reaches a minimum value.

Cost function
Partial derivatives
Learning rate
Linear regression
Weights (model parameters)