Empirical Distribution Function

TL;DR

Estimates a dataset’s underlying probability distribution without assuming a specific parametric form.
Constructed by ordering data and plotting the cumulative percentage at or below each value.
Useful for large, complex, or non-normal datasets and can be updated as new data arrive.

Definition

The empirical distribution function, also known as the empirical cumulative distribution function, is a statistical tool used to estimate the underlying probability distribution of a given dataset.

Explanation

The empirical distribution function is constructed by ordering observed data from lowest to highest and plotting the cumulative proportion (percentage) of observations at or below each value. It provides an estimate of the distribution of the data without requiring assumptions about the specific form of the underlying probability distribution. Because it is based directly on observed data, it can be updated easily as new observations become available.

Examples

Stock prices

Suppose we have a dataset of daily closing prices for a particular stock over the past year. To estimate the underlying probability distribution, construct the empirical distribution function by ordering the data from lowest to highest and plotting the cumulative percentage of data points at or below each value. This plot can be used to estimate the probability that the stock will close at or below a given value on a future day.

Exam scores

Suppose we have a dataset of exam scores for all students in a class. To estimate the underlying probability distribution, construct the empirical distribution function by ordering the scores from lowest to highest and plotting the cumulative percentage of students who scored at or below each value. This plot can be used to estimate the probability that a randomly selected student will score at or below a given value on a future exam.

Use cases

Estimating a dataset’s distribution when it is difficult to calculate the exact probability distribution.
Working with large datasets where parametric assumptions are undesirable.
Analyzing complex or non-normal datasets without specifying a distributional form.
Updating distribution estimates as new data become available.

Empirical cumulative distribution function (alias)