Cost Function

TL;DR

Quantifies the prediction error of a learning algorithm averaged over the dataset.
Different cost functions suit different tasks (e.g., regression vs classification).
The learning algorithm seeks parameters that minimize the cost, typically via gradient descent.

Definition

A cost function is a measure of how well a learning algorithm is doing in terms of being able to predict the correct output values for a given input. It measures the accuracy of the algorithm in making predictions. The goal of any learning algorithm is to find the set of parameters that minimize the cost function.

Explanation

Many different cost functions exist and are chosen according to the specific problem.
In practice the cost function is typically computed as the average of the errors for each example in the dataset, because the objective is to minimize overall error, not only the error for a single example.
Minimizing the cost function is usually performed with an optimization algorithm such as gradient descent, which iteratively updates parameters to reduce the cost.

Examples

Mean squared error (MSE)

This cost function is commonly used in regression problems, where the goal is to predict a continuous value. MSE is calculated as the average of the squared differences between the predicted values and the true values. Mathematically, it is represented as:

\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - y^i)^2

where yi is the true value and y^i is the predicted value for the ith example, and n is the total number of examples.

Cross-entropy

This cost function is commonly used in classification problems, where the goal is to predict a class label. Cross-entropy is calculated as the average of the negative log likelihood of the predicted class labels. Mathematically, it is represented as:

\text{CE} = -\frac{1}{n} \sum_{i=1}^{n} \big[ y_i \cdot \log(y^i) + (1 - y_i) \cdot \log(1 - y^i) \big]

where yi is the true class label and y^i is the predicted class label for the ith example, and n is the total number of examples.

Use cases

Mean squared error: regression problems (predicting continuous values).
Cross-entropy: classification problems (predicting class labels or class probabilities).

Notes or pitfalls

MSE penalizes large errors more than small errors, so optimization will prioritize reducing large deviations.
Cross-entropy measures the distance between true class probabilities and predicted class probabilities, encouraging predicted probabilities to match true probabilities.
Cost functions are typically computed as averages over the dataset to reflect overall performance.
Gradient descent is a common method used to find parameter values that minimize the cost function.

Mean squared error (MSE)
Cross-entropy
Learning algorithm
Gradient descent
Optimization algorithm