How to Use dplyr in R: Data Manipulation Examples

Learn data manipulation in R using dplyr! Step-by-step tutorial with practical examples for effective data analysis. Boost your skills today!

Key points:

  • Dplyr is a package that provides a grammar of data manipulation in R, consisting of verbs that help you perform everyday data manipulation tasks.
  • You can use dplyr to count values that meet certain conditions, find descriptive statistics, apply functions to each data frame row, and join two tables.
  • You can install and load dplyr with install.packages("dplyr") and library(dplyr).
  • You can chain multiple dplyr functions using the pipe operator %>%, which reads as "then".

How to Use dplyr in R: A Tutorial on Data Manipulation with Examples

Hi, I'm Zubair Goraya, a data analyst and a writer for Data Analysis, a website that offers tutorials on Rstudio. In this article, I will show you how to use dplyr, a powerful package for data manipulation in R. 

You will learn to count values that meet certain conditions, find descriptive statistics, and apply functions to each data frame row. You will also learn how to install and load dplyr and its main features and benefits.

R Data Manipulation with dplyr: A Hands-On Guide

Dplyr is a package that provides a grammar of data manipulation, consisting of verbs that help you perform everyday data manipulation tasks. These functions are:

  • filter(): selects rows based on their values.
  • slice(): selects rows based on their position.
  • arrange(): changes the order of the rows.
  • select(): chooses columns based on their names.
  • rename(): changes the names of columns.
  • mutate(): modifies the values of columns and creates new ones.
  • relocate(): changes the order of the columns.
  • summarise(): reduces multiple values to a single summary.
  • group_by(): groups rows by one or more variables.
  • Dplyr also provides several functions for joining two tables, such as inner_join(), left_join(), right_join(), and full_join().

Dplyr works with data frames and tibbles, which are modern reimaginations of data frames. Tibbles are particularly useful for large datasets because they only print the first few rows and have nice formatting. 

You can convert data frames to tibbles with as_tibble().

Dplyr also supports other data sources, such as databases, Apache Spark, and Arrow, through the dplyr, sparklyr, and arrow packages.

How to install and load dplyr?

To install dplyr, you can use the following code:
# Install the whole tidyverse package, which includes dplyr, and other valuable packages
install.packages("tidyverse")
# Alternatively, install only dplyr
install.packages("dplyr")
To load dplyr, you can use the following code:
# Load dplyr
library(dplyr)

How to use dplyr?

To illustrate how to use dplyr, we will use the starwars dataset, which contains information about 87 characters from Star Wars movies. The dataset comes from the Star Wars API and is included in dplyr. You can see its structure with str(starwars) or its documentation with ?starwars.

str(starwars)
?starwars

starwars data set using dplyr


All dplyr functions take a data frame (or tibble) as the first argument. You can then chain multiple functions together using the pipe operator %>%, which reads as "then". 

For example, x %>% f(y) is equivalent to f(x, y). This way, you can write your code clearly and readably.

How do you count values that meet certain conditions?

One of the most common tasks in data analysis is to count how many values in a column satisfy a specific condition. 

For example, you may want to know how many characters in the starwars dataset are droids, or how many have a mass greater than 10 or less than 10.

You can use the filter() function to count values that meet certain conditions, which selects rows based on their values. You can then use the nrow() function to get the number of rows in the filtered data frame. For example:

# Count how many characters are droids
nrow(filter(starwars, species == "Droid"))
# Count how many characters have a mass greater than 10
nrow(filter(starwars, mass > 10))
# Count how many characters have a mass less than 10
nrow(filter(starwars, mass < 10))
filter() function to count values that meet certain conditions

Alternatively, you can use the count() function, which combines filtering and counting in one step. You can also specify multiple conditions with logical operators such as & (and), | (or), and ! (not)

For example:

# Count how many characters are droids
count(starwars, species == "Droid")
# Count how many characters have a mass greater than 10 or less than 10
count(starwars, mass > 10 | mass < 10)
# Count how many characters are not droids and have a mass greater than 10
count(starwars, !species == "Droid" & mass > 10)
Count how many characters

How to find descriptive statistics?

Another common data analysis task is finding descriptive statistics, such as a column's 
  • mean, 
  • median, 
  • standard deviation, 
  • minimum, and 
  • maximum. 

For example, you should know the average height or mass of the characters in the starwars dataset.

You can use the summarise() function to find descriptive statistics, which reduces multiple values to a single summary. You can also use the group_by() function to perform the summarisation by groups. For example:

# Find the mean height of all characters
starwars %>% 
  summarise( mean_height = mean(height, na.rm = TRUE))

# Find the mean height of each species
starwars %>%group_by(species) %>%  
  summarise(mean_height = mean(height, na.rm = TRUE))

# Find the mean and standard deviation of mass for each gender
starwars %>%group_by(gender) %>%  
  summarise(mean_mass = mean(mass, na.rm = TRUE), sd_mass = sd(mass, na.rm = TRUE))


summarise() function to find descriptive statistics

Note that we use the argument na.rm = TRUE to remove missing values before calculating the statistics.

How to apply functions to each row of a data frame?

You can apply a function to each data frame row and create a new column with the results. 

For example, using the formula, calculate each character's body mass index (BMI) in the starwars dataset.

BMI=mass/ (height/100)2​

To apply functions to each data frame row, you can use the mutate() function, which modifies the values of columns and creates new ones. 

For conditional statements, you can use other dplyr functions inside mutate(), such as if_else(). 

For example:

# Calculate the BMI of each character and create a new column
starwars %>% select(mass, height) %>% 
  mutate(bmi = mass / ((height / 100) ^ 2)) %>% print(n=20)

Calculate the BMI of each character and create a new column

How to join two tables with dplyr?

Another helpful feature of dplyr is the ability to join two tables based on one or more common variables. For example, combine the starwars dataset with another dataset containing additional information about the characters' birth planets.

To join two tables with dplyr, you can use one of the following functions, depending on how you want to handle the rows and columns that do not match:

  • inner_join(): keeps only the rows that match in both tables and all the columns from both tables.
  • left_join(): keeps all the rows from the left table and all the columns from both tables. The rows that do not match in the right table are filled with missing values.
  • right_join(): keeps all the rows from the right table and all the columns from both tables. The rows that do not match in the left table are filled with missing values.
  • full_join(): keeps all the rows and columns from both tables. The rows that do not match are filled with missing values.

Conclusion

In this article, you learned how to use dplyr, a powerful package for data manipulation in R. You learned how to count values that meet certain conditions, how to find descriptive statistics, how to apply functions to each row of a data frame, and how to join two tables. 

You also learned how to install and load dplyr and its main features and benefits. Dplyr is an excellent tool for data analysis because it provides a grammar of data manipulation that is consistent, expressive, and efficient. It also works well with other packages in the tidyverse ecosystem, such as ggplot2 for data visualization.

To learn more about dplyr, check out its documentation or cheat sheet here. 

I hope you enjoyed this article and found it useful. If you have any questions or feedback, please leave a comment below. If you need help with your data analysis projects, contact me at info@rstudiodatalab.com or hire me. Thank you for reading! 😊

FAQs:

Q: What are the benefits of using dplyr?

A: Dplyr is an excellent tool for data analysis because it provides a grammar of data manipulation that is consistent, expressive, and efficient. It also works well with other packages in the tidyverse ecosystem, such as ggplot2 for data visualization.

Q: How can I filter rows based on their values?

A: You can use the filter() function, which selects rows based on their values. You can also use the count() function, which combines filtering and counting in one step.

Q: How can I reduce multiple values to a single summary?

A: You can use the summarise() function, which reduces multiple values to a single summary. You can also use the group_by() function to perform the summarisation by groups.

Q: How can I modify the values of columns and create new ones?

A: You can use the mutate() function, which modifies the values of columns and creates new ones. For conditional statements, you can use other dplyr functions inside mutate(), such as if_else().

Q: How can I join two tables based on one or more common variables?

A: You can use one of the following functions, depending on how you want to handle the rows and columns that do not match:

  • inner_join(): keeps only the rows that match in both tables and all the columns from both tables.
  • left_join(): keeps all the rows from the left table and all the columns from both tables. The rows that do not match in the right table are filled with missing values.
  • right_join(): keeps all the rows from the right table and all the columns from both tables. The rows that do not match in the left table are filled with missing values.
  • full_join(): keeps all the rows and columns from both tables. The rows that do not match are filled with missing values.


About the author

Zubair Goraya
Ph.D. Scholar | Certified Data Analyst | Blogger | Completed 5000+ data projects | Passionate about unravelling insights through data.

Post a Comment

Ad blocker detected!

We have detected that you are using adblocking plugin in your browser.
The revenue we earn by the advertisements is used to manage this website, we request you to whitelist our website in your adblocking plugin.