Eda
- Uses visualization and summary approaches to reveal trends, relationships, and outliers in datasets.
- Helps determine distributions and relationships between variables before further modeling.
- Supports better decision making and predictions by improving understanding of the data.
Definition
Section titled “Definition”Exploratory data analysis (EDA) is a method of analyzing and understanding data sets to gain insights and identify patterns. It involves visualizing and summarizing data to uncover trends and relationships.
Explanation
Section titled “Explanation”EDA is an important step in the data science process. It focuses on plotting and summarizing data to reveal the shape of distributions, detect outliers or clusters, and examine relationships between variables. By exploring data visually and through summaries, EDA helps analysts understand the dataset and surface potential insights that inform subsequent modeling or decision making.
Examples
Section titled “Examples”Histograms for continuous variables
Section titled “Histograms for continuous variables”A histogram displays the frequency of data within a range of values, called bins. By plotting a histogram, you can see the shape of the distribution, identify any outliers, and understand the spread of the data. For example, plotting a histogram of students’ heights can show whether the data is normally distributed or skewed and highlight any students who are significantly taller or shorter than the average.
Scatter plots for relationships between two variables
Section titled “Scatter plots for relationships between two variables”A scatter plot displays the relationship between two numeric variables by plotting their values on a two-dimensional coordinate system. By plotting a scatter plot, you can see if there is any linear or nonlinear relationship between the variables, and identify any outliers or clusters. For example, plotting test scores versus study hours can indicate whether studying more is associated with higher test scores or whether other factors may be influencing outcomes.
Related terms
Section titled “Related terms”- Histogram
- Scatter plot
- Data visualization