Skip to content

Exploratory Data Analysis

  • Early-stage, iterative analysis to understand and summarize a dataset.
  • Uses visualizations and statistical tests to reveal patterns, relationships, anomalies, and outliers.
  • Guides further analysis and can inform the development of predictive models.

Exploratory data analysis (EDA) is a step in the data science process that uses various techniques and tools to understand and summarize the characteristics of a dataset, with the goal of identifying patterns, trends, relationships, and detecting anomalies and outliers.

EDA applies visualization methods and statistical procedures to reveal the distributional properties of variables, relationships between variables, and unexpected data issues. It is a flexible, iterative process: as analysts explore the data, new insights and questions often arise. EDA also highlights potential issues or biases and helps determine directions for subsequent analysis or model development.

  • Histogram: Quickly visualize the distribution of a numerical variable (example given: income).
  • Scatter plot: Inspect relationships between two numerical variables (example given: age and income).
  • Box plot: Compare distributions across multiple groups (example given: different income brackets).
  • t-test: Compare the means of two groups (example given: income of men and women).
  • Chi-square test: Test for association between two categorical variables (example given: education level and income).
  • ANOVA: Compare means across multiple groups (example given: income of different age groups).
  • Provide a better understanding of the data.
  • Identify potential issues or biases in the dataset.
  • Guide the direction of further analysis.
  • Inform the development of predictive models.
  • EDA should be treated as flexible and iterative; new insights and questions commonly emerge during exploration.
  • A central aim of EDA is to detect anomalies and outliers that may affect later analysis.
  • Histogram
  • Scatter plot
  • Box plot
  • t-test
  • Chi-square test
  • ANOVA
  • Predictive models