Underfitting

TL;DR

A model that is too simple or trained on too little data can fail to learn the true patterns in the dataset.
Underfitting causes poor predictive performance, especially on unseen data.
Typical remedies are using a more complex model or increasing the amount of training data.

Definition

Underfitting occurs when a machine learning model is not able to capture the underlying trend in the data. It can happen for a variety of reasons, including having a model that is too simple for the data, or not having enough data to train the model. Underfitting can lead to poor model performance, as the model will be unable to accurately make predictions on unseen data.

Explanation

Underfitting means the model’s hypothesis is too limited to represent the relationships present in the data. When a model underfits, it fails to capture important factors or patterns that determine the target variable, so its predictions are inaccurate both on training and unseen data. Causes explicitly mentioned include choosing a model that is too simple for the complexity of the data or providing insufficient training data. Remedy options given are using a more complex model or increasing the amount of training data available.

Examples

Example 1: Predicting Housing Prices

You build a model to predict housing prices using features such as size, number of bedrooms, and location, and choose a linear regression model because it is simple and well understood. After training and testing on unseen data, the model is not very accurate: it consistently underpredicts prices in some neighborhoods and overpredicts in others.

Closer examination reveals that housing prices are influenced by many factors—school quality, proximity to amenities, and neighborhood desirability—that are not well represented by a linear trend. A linear regression model, which only considers a single predictor variable, is not able to capture these complex trends in the data and therefore underfits the data.

Example 2: Classifying Email Spam

You train a decision tree classifier to label emails as spam or not spam. After testing on unseen emails, the model is not very accurate and frequently misclassifies spam as not spam, and vice versa.

On inspection, the model cannot capture subtle distinctions that separate spam from non-spam—for example, spam often contains words or phrases such as “earn money fast” or “double your income.” A decision tree classifier, which only considers a single feature at a time, is not able to capture these complex patterns in the data and therefore underfits the data.

In both examples, the models were unable to capture the underlying trends in the data and therefore underfitted, leading to poor performance and inaccurate predictions.

Notes or pitfalls

Causes explicitly noted: model too simple for the data, or insufficient training data.
Consequence: poor model performance and inability to accurately predict unseen data.
Suggested remedies: use a more complex model or increase the amount of training data.

Linear regression (mentioned as an example of a simple model)
Decision tree classifier (mentioned as an example of a simple model)
Training
Unseen data
Predictor variable