One Hot Encoding
- Converts a categorical variable into multiple binary dummy variables, one per unique category.
- Enables machine learning algorithms that cannot handle categorical inputs directly and avoids implying ordinal relationships between categories.
- Can increase dimensionality and produce sparse matrices, which may affect model complexity and efficiency.
Definition
Section titled “Definition”One hot encoding is a technique used to represent categorical variables in a machine learning model. It creates a new dummy variable for each unique category in the categorical variable, and assigns a “1” to the dummy variable corresponding to the category the observation belongs to, and “0” to all other dummy variables.
Explanation
Section titled “Explanation”One hot encoding transforms a single categorical column into multiple binary columns (dummy variables), each indicating membership in one specific category. This encoding allows algorithms that require numeric input to use categorical information without introducing an artificial numeric ordering among categories. Each observation will have a “1” in exactly one of the dummy variables corresponding to its category and “0” in the rest.
Examples
Section titled “Examples”Example 1
Section titled “Example 1”Suppose we have a dataset containing information about different animals, including the species of each animal. The species variable is a categorical variable with three categories: “Dog”, “Cat”, and “Bird”. Using one hot encoding, we would create three new dummy variables: “species_Dog”, “species_Cat”, and “species_Bird”. If an animal is a dog, the “species_Dog” dummy variable would be “1” and the other two dummy variables would be “0”. If an animal is a cat, the “species_Cat” dummy variable would be “1” and the other two dummy variables would be “0”. If an animal is a bird, the “species_Bird” dummy variable would be “1” and the other two dummy variables would be “0”.
Example 2
Section titled “Example 2”Suppose we have a dataset containing information about customers at a store, including their gender. The gender variable is a categorical variable with two categories: “Male” and “Female”. Using one hot encoding, we would create two new dummy variables: “gender_Male” and “gender_Female”. If a customer is male, the “gender_Male” dummy variable would be “1” and the “gender_Female” dummy variable would be “0”. If a customer is female, the “gender_Female” dummy variable would be “1” and the “gender_Male” dummy variable would be “0”.
Use cases
Section titled “Use cases”- Preparing categorical variables for machine learning models that cannot handle non-numeric inputs.
- Preventing models from interpreting categorical labels as ordinal or implying relative importance (for example, avoiding encoding “Male”=1 and “Female”=2).
Notes or pitfalls
Section titled “Notes or pitfalls”- One hot encoding can produce a large number of dummy variables when a categorical variable has many unique categories, increasing model complexity and reducing interpretability.
- It can create a sparse matrix where most entries are “0”, which may be inefficient for some algorithms.
- In cases where these issues are problematic, alternative encoding techniques such as ordinal encoding or binary encoding may be more appropriate.
Related terms
Section titled “Related terms”- Dummy variable
- Ordinal encoding
- Binary encoding