Skip to content

Dplyr

TL;DR

An R package (part of the tidyverse) for manipulating and analyzing tabular data.
Provides concise functions to filter/subset rows and to group and summarise data.
Common workflow uses the pipe (%>%) with verbs like filter(), group_by(), and summarise().

Definition

Dplyr is a powerful R package for data manipulation and analysis. It is a part of the tidyverse, a collection of packages designed for data science in R. Dplyr offers a set of convenient functions for filtering, grouping, and summarizing datasets, making it an essential tool for data analysis.

Explanation

One key feature of dplyr is filtering and subsetting data based on criteria using filter(). For example, to select rows for houses with 3 bedrooms:

housing_prices %>%

filter(bedrooms == 3)

This returns a new dataset containing only rows where bedrooms == 3, allowing focused analysis on that subset.

Another useful capability is grouping data and applying summary statistics using group_by() together with summarise(). For example, to compute the total amount spent by each customer:

customer_transactions %>%

group_by(customer_id) %>%

summarise(total_spent = sum(amount))

This produces a dataset with the total amount spent per customer, enabling comparison of spending across customers.

Examples

Filtering example

housing_prices %>%

filter(bedrooms == 3)

Grouping and summarising example

customer_transactions %>%

group_by(customer_id) %>%

summarise(total_spent = sum(amount))

tidyverse
R
filter()
group_by()
summarise()