Remove Outliers and Perform Data Cleaning in R

Outliers and data cleaning for data science. Outliers differ from dataset and affect statistics. Data cleaning removes errors and missing values.

Key Points

  • Outliers are data points that are significantly different from the rest of the data and can affect the results of statistical tests and machine learning models.
  • There are different ways to detect outliers, such as graphical methods (boxplots and histograms) and statistical methods (z-scores, interquartile range, Dixon’s test, and Rosner’s test).
  • There are different ways to remove outliers from a dataset, such as using logical operators and subsetting, using the subset() function, or using the filter() function from the dplyr package.
  • There are different ways to impute missing values in a dataset, such as mean, median, or mode imputation, multiple imputations by chained equations (MICE), or K-nearest neighbours (KNN) imputation.
  • There are different ways to encode categorical variables in a dataset, such as label encoding, one-hot encoding, or ordinal encoding.
Remove Outliers and Perform Data Cleaning in RStudio


Description of Functions and Packages

Function/Package Description
boxplot() Creates a boxplot for a numeric variable
hist() Creates a histogram for a numeric variable
scale() Calculates z-scores for a numeric variable
IQR() Calculates interquartile range for a numeric variable
outlierTest() from the car package Performs Dixon’s test for one outlier in a small dataset
rosnerTest() from the EnvStats package Performs Rosner’s test for multiple outliers in a large dataset
subset() Creates a new dataset that contains only the rows that meet a certain criterion
filter() from the dplyr package Creates a new dataset that excludes the rows that match a certain condition
na.mean(), na.median(), na.mode() from the imputeTS package Performs mean, median, or mode imputation for missing values
mice() from the mice package Performs multiple imputation by chained equations (MICE) for missing values
knnImputation() from the DMwR package Performs K-nearest neighbours (KNN) imputation for missing values
as.numeric() or as.factor() Converts a variable into numeric or factor type, respectively
model.matrix() Creates a design matrix with dummy variables for each category of a factor variable
factor() Creates an ordered factor variable with specified levels

I’m Zubair Goraya, a Ph.D. Scholar, Certified data analyst, and Freelancer, and I love to share my knowledge and experience with R programming. In this article, I’m going to show you how to deal with outliers in data using R.

Outliers are data points that are significantly different from the rest of the data and can affect the results of statistical tests and machine learning models. Therefore, it is important to identify and remove outliers before performing any analysis on the data.

Here are the main topics that I will cover in this article:

  • What is an outlier, and how do we detect it using boxplots and histograms?
  • How to find outliers using z-scores, interquartile range, Dixon’s and Rosner’s test?
  • How do we remove outliers from data using Rstudio functions?
  • How do we impute missing values and handle categorical variables in data?
  • How do we check the presence of outliers after data cleaning?

By the end of this article, you will have a clear understanding of how to perform outlier analysis in R and improve the quality of your data. You will also learn some useful tips and tricks for data science and machine learning projects. So, let’s get started!

What is an outlier?

An outlier is a value that is very different from the other values in a dataset. For example, if you have a dataset of heights of people, a value of 2 meters or 50 centimetres would be considered as an outlier. Outliers can be caused by various factors such as measurement errors, data entry errors, natural variability, or rare events.

Outlier Detection

One way to detect outliers is to use graphical methods such as boxplots and histograms. A boxplot is a type of plot that shows the distribution of a numeric variable using five summary statistics: minimum, first quartile, median, third quartile, and maximum. The box represents the middle 50% of the data, while the whiskers extend to the most extreme values within 1.5 times the interquartile range (IQR). Any value beyond the whiskers is considered an outlier and is marked with a dot or a circle.

A histogram is another type of plot that shows the frequency of values in a numeric variable using bars. The height of each bar represents the number of observations in each bin or interval. A histogram can also help to identify outliers by showing the shape and spread of the data. If the data is skewed or has long tails, it may indicate the presence of outliers.

How do we detect it using boxplots and histograms?

To create boxplots and histograms in R, you can use the boxplot() and hist() functions, respectively. For example, let’s use the mtcars dataset that comes with R and create boxplots and histograms for the hp variable.

# Load the mtcars dataset
data(mtcars)
# Create a boxplot for mpg
boxplot(mtcars,
        main = "Boxplot for Mtcars Data set",
        xlab = "mpg",
        ylab = "Frequency")

Boxplot for Outlier Detection for mtcars data set
boxplot(mtcars$hp,
        main = "Boxplot for hp",
        xlab = "hp",
        ylab = "Frequency")

Boxplot for hp using mtcars data
hist(mtcars$hp,
     main = "Histogram for hp",
     xlab = "hp",
     ylab = "Frequency")
Histogram for hp

From these plots, we can see that there is one outlier in the hp variable: one with a very high value (around 350). These values are far away from the rest of the data and may affect the mean and standard deviation of the variable.

How do we find outliers using statistical methods?

Another way to find outliers in R is to use statistical methods that calculate a measure of deviation or distance from the center or average of the data. Some common methods are:

  • Z-scores: It is a standardized score that measures how many standard deviations a value is away from the mean. A value with a z score greater than 3 or less than -3 is usually considered as an outlier.
  • Interquartile range (IQR): The IQR is the difference between the third quartile (Q3) and the first quartile (Q1) of a variable. It represents the middle 50% of the data. A value with an IQR score greater than 1.5 times the IQR above Q3 or below Q1 is usually considered as an outlier.
  • Dixon’s test: It is a method that detects one outlier at a time in a small dataset (less than 30 observations). It compares the ratio of the gap between the outlier and its nearest neighbour to the range of the data. If this ratio exceeds a critical value based on the sample size and significance level, then the value is an outlier.

  • Rosner’s test: It is a method that detects multiple outliers at a time in a large dataset (more than 30 observations). It is an iterative procedure that starts with Dixon’s test and then removes the detected outlier, and repeats the test until no more outliers are found.

To apply these methods in R, you can use the following functions:

  • scale() to calculate z-scores
  • IQR() to calculate IQR scores
  • outlierTest() from the car package to perform Dixon’s test
  • rosnerTest() from the EnvStats package to perform Rosner’s test

Find Outliers in RStudio

For example, let’s use these functions to find outliers in the mpg variable of the mtcars dataset. If packages were not installed, first install the packages and then load the libraries. don't know How to Import and Install Packages in R: A Comprehensive Guide

Outlier Detection using Z-scores

Z-scores, also known as standard scores, z-values, normal scores, z-scores or standardized values, measure how many standard deviations away a value is from the mean of a distribution. They are useful for comparing data with different units, scales, or ranges. They can also help us test a dataset's normality, find outliers, and calculate probabilities. Read more: Did You Know How to Calculate Z-Score in R?

# Load the car and EnvStats packages
install.packages("car")
install.packages("EnvStats")
library(car)
library(EnvStats)
# Calculate z-scores for mpg
z_scores <- scale(mtcars$hp)
z_scores
# Find values with z-scores greater than 2 or less than -2
z_outliers <- mtcars$hp[abs(z_scores) > 2]
# Print z_outliers
z_outliers
The output of this command is:
[1] 335

This means that the value 335 is an outlier according to the z-score method.

Inter Quartile Range (IQR)

# Inter Quartile Range (IQR)
# Calculate IQR scores for hp
iqr_scores <- (mtcars$hp - median(mtcars$hp)) / IQR(mtcars$hp)
# Find values with IQR scores greater than 1.5 or less than -1.5
iqr_outliers <- mtcars$hp[abs(iqr_scores) > 1.5]
# Print iqr_outliers
iqr_outliers
The output of this command is:
[1] 264 335

This means that the same values are outliers according to the IQR method.

Outlier Detection Using Dixon’s test 

Dixon test has some limitation, this function is only handle the sample size of 3 to 30. If the sample size is large, it did not work. So, for this tutorial, we subset the data set and take the only first 30 observations. 

# Load the outliers package
library(outliers)
# Perform Dixon's test for hp
dixon_test <- outliers::dixon.test(mtcars$hp[1:30])
# Print dixon_test
print(dixon_test)
The output of this command is:

Dixon test for outliers

This means that the value for 264 is an outlier according to Dixon’s test at a significance level of 0.05. Our Results were not reliable because I had a large sample size.

Outlier Detection Using Rosner's test

# Perform Rosner's test for hp
rosner_test <- rosnerTest(mtcars$hp, k = 2)
# Print rosner_test
rosner_test
The output of this command is:
Rosner's Test for Outliers

This means that the values of 335 and 264 are outliers at a significance level of 0.05.

How do we remove outliers from data?

Once you have identified the outliers in your dataset, remove them before performing any further analysis of the data. There are different ways to remove outliers from data, such as:

Using logical operators and subsetting

You can use logical operators such as <, >, ==, !=, etc., to create a condition that filters out the outliers from your dataset and then uses subsetting to select only the rows that satisfy the condition.

Using the subset() function: You can use the subset() function to create a new dataset that contains only the rows that meet a certain criterion and exclude the outliers.

Using the `filter () function from the dplyr package

You can use the filter() function from the dplyr package to create a new dataset that excludes the rows that match a certain condition and keeps the rest of the data.

For example, let’s use these methods to remove the outliers from the mpg variable of the mtcars dataset.

# Load the dplyr package
library(dplyr)
# Remove outliers using logical operators and subsetting
mtcars_no_outliers1 <- mtcars[mtcars$hp > 300 & mtcars$mpg < 300, ]
mtcars_no_outliers1
# Remove outliers using the subset() function
mtcars_no_outliers2 <- subset(mtcars, hp > 300 & mpg < 300)
mtcars_no_outliers2
# Remove outliers using the filter() function from the dplyr package
mtcars_no_outliers3 <- filter(mtcars, hp > 300, mpg < 300)
mtcars_no_outliers3

Remove outliers using logical operators and subsetting, subset() function, the filter() function from the dplyr package

The output of these commands is a new dataset that contains only 30 rows and excludes the outliers from the mpg variable.

How to impute missing values and handle categorical variables in a dataset?

Another common issue that you may encounter in your dataset is missing values. Missing values are values that are not recorded or available for some reason. They can be represented by symbols such as NA, NULL, or ?. It can affect the accuracy and validity of your analysis and may introduce bias or errors in your results.

One way to deal with this is to impute them. Imputation is a process of replacing missing values with plausible values based on some criteria or assumptions.

There are different methods, such as:

Mean or median imputation

This method replaces missing values with the mean or median of the variable. It is simple and easy to implement, but it may reduce the variability and distort the distribution of the data.

Mode or most frequent imputation

This method replaces missing values with the mode or most frequent value of the variable. It is suitable for categorical variables, but it may introduce bias and overrepresent some categories.

Regression imputation

This method replaces missing values with predicted values based on a regression model that uses other variables as predictors. It is more sophisticated and realistic, but it may increase the complexity and uncertainty of the model.

K-nearest neighbours (KNN) imputation

This method replaces missing values with the average or weighted average of the k nearest neighbours of the observation based on some distance metric. It is more flexible and adaptive, but it may be computationally expensive and sensitive to outliers.

To perform imputation in R, you can use various functions and packages, such as:

  • na.mean(), na.median(), na.mode() from the imputeTS package to perform mean, median, or mode imputation
  • mice() from the mice package to perform multiple imputation by chained equations (MICE), which is a general method that can handle different types of variables and models
  • knnImputation() from the DMwR package to perform KNN imputation

Imputation of Missing Values in R

For example, let’s use these functions to impute missing values in a simulated dataset that contains numeric and categorical variables.
# Load the imputeTS, mice, and DMwR packages
#install.packages("imputeTS")
library(imputeTS)
#install.packages("mice")
library(mice)
#install.packages("DMwR2")
library(DMwR2)
# Create a simulated dataset with numeric and categorical variables
set.seed(123)
df <- data.frame(
  x = rnorm(100, mean = 50, sd = 10),
  y = sample(c("A", "B", "C"), 100, replace = TRUE),
  z = runif(100, min = 0, max = 100)
)
# Introduce some missing values randomly
df[sample(1:100, 10), "x"] <- NA
df[sample(1:100, 10), "y"] <- NA
df[sample(1:100, 10), "z"] <- NA
# Check which columns contain missing values
colSums(is.na(df))
The output of this code is:

Create a simulated dataset with numeric and categorical variables

As you can see, this dataset has a lot of missing values in all three variables: x, y, and z.

Impute Missing values

Let’s use the imputeTS package to perform mean, median, and mode imputation for the numeric and categorical variables, respectively.

# Load the imputeTS package
library(imputeTS)

# Perform mean imputation for x and z variables
df$x <- na_mean(df$x)
df$z <- na_mean(df$z)

# Perform mode imputation for y variable
# Custom function to impute mode for a vector
impute_mode <- function(x) {
  uniq_x <- unique(x)
  table_x <- table(x)
  mode_val <- uniq_x[which.max(table_x)]
  x[is.na(x)] <- mode_val
  return(x)
}

# Impute mode for the 'y' variable
df$y <- impute_mode(df$y)
# Print df after imputation
head(df,10)
# Check which columns contain missing values
colSums(is.na(df))
The output of this code is:

mode imputation for the numeric and categorical variables

As you can see, this dataset has no more missing values in any of the variables: x, y, and z.

Handle categorical variables in R.

Another issue that you may face in your dataset is the presence of categorical variables. Categorical variables are variables that have a finite number of possible values or categories, such as gender, colour, or type of car.

Categorical variables can be either nominal or ordinal.

  • Nominal variables are variables that have no inherent order or ranking among the categories, such as gender or color.
  • Ordinal variables are variables that have a natural order or ranking among the categories, such as education level or satisfaction rating.

One way to deal with categorical variables is to encode them into numeric values that can be used for analysis and modelling.

There are different methods of encoding categorical variables, such as:

  • Label encoding: This method assigns a unique integer value to each category of the variable, starting from zero or one.
  • One-hot encoding: This method creates a new binary variable for each category of the variable, with a value of one if the observation belongs to that category and zero otherwise.
  • Ordinal encoding: This method assigns an integer value to each category of the variable based on the order or ranking of the categories.

Encoding categorical variables in R

To perform encoding in R, you can use various functions and packages, such as:

  • as.numeric() or as.factor() to convert a variable into numeric or factor type, respectively.
  • model.matrix() to create a design matrix with dummy variables for each category of a factor variable.
  • factor() to create an ordered factor variable with specified levels.

For example, let’s use these functions to encode the y variable of the df dataset that we created earlier.(

# Encode y variable using label encoding
df$y<-as.factor(df$y)
df$y_label <- as.numeric(df$y) -1
# Encode y variable using one-hot encoding
df$y_onehot <- model.matrix(~ y -1, data = df)
# Encode y variable using ordinal encoding
df$y_ordinal <- factor(df$y, levels = c("A", "B", "C"), ordered = TRUE)
# Print df after encoding
df

Encode y variable using label encoding

I have encoded the y variable using three different methods:

  1. Label Encoding
  2. One-hot encoding
  3. Ordinal encoding

Label encoding

It assigns a unique integer value to each category of the variable, starting from zero or one. For example, category A is encoded as zero, B as one, and C as two.

Lable Encoding using R

One-hot encoding

It creates a new binary variable for each category of the variable, with a value of one if the observation belongs to that category and zero otherwise. For example, the category A is encoded as a vector of (1,0,0), B as (0,1,0), and C as (0,0,1).

Ordinal encoding

It assigns an integer value to each category of the variable based on the order or ranking of the categories. For example, category A is encoded as one, B as two, and C as three.

You can see the results of each encoding method in the new columns that I have added to the dataset: y_label, y_onehotA, y_onehotB, y_onehotC, and y_ordinal.

Encoding categorical variables can help to transform them into numeric values that can be used for analysis and modelling. However, it would be best if you were careful about choosing the appropriate method for your data and your purpose.

Advantages and Disadvantages of each method

Some advantages and disadvantages of each method are:

  • Label encoding is simple and easy to implement, but it may imply a false sense of order or magnitude among the categories that may not exist in reality.
  • One-hot encoding is more expressive and avoids the problem of order or magnitude, but it may create a large number of new variables that may increase the dimensionality and sparsity of the data.
  • Ordinal encoding is suitable for ordinal variables that have a natural order or ranking among the categories, but it may not work well for nominal variables that have no inherent order or ranking.

Conclusion

This article shows you how to perform outlier analysis and imputation in R using various methods and functions. You have learned how to identify and remove outliers, how to replace missing values with plausible values, and how to transform categorical variables into numeric values. These steps can help you to improve the quality of your data and prepare it for further analysis and modelling.

If you want to learn more about R programming and data analysis, you can check out our latest R posts on our website: Data Analysis. You can also contact us at info@rstudiodatalab.com or hire us at Order Now if you need any help with your data science or machine learning projects.

Thank you for reading, and happy coding! 

Frequently Asked Questions (FAQs)

What is an outlier? 

An outlier is a data point that is significantly different from the rest of the data and can affect the results of statistical tests and machine learning models.

How can I detect outliers using boxplots? 

A boxplot is a type of plot that shows the distribution of a numeric variable using five summary statistics: minimum, first quartile, median, third quartile, and maximum. The box represents the middle 50% of the data, while the whiskers extend to the most extreme values within 1.5 times the interquartile range (IQR). Any value beyond the whiskers is considered an outlier and is marked with a dot or a circle.

How can I detect outliers using z-scores?

A z-score is a standardized score that measures how many standard deviations a value is away from the mean. A value with a z-score greater than 3 or less than -3 is usually considered an outlier.

How can I remove outliers from a dataset using logical operators and subsetting?

You can use logical operators such as <, >, ==, !=, etc., to create a condition that filters out the outliers from your dataset and then uses subsetting to select only the rows that satisfy the condition. For example, if you want to remove outliers from the mpg variable of the mtcars dataset, you can use this command:

mtcars_no_outliers <- mtcars[mtcars$mpg > 10 & mtcars$mpg < 34, ]

How can I remove outliers from a dataset using the subset() function?

You can use the subset() function to create a new dataset that contains only the rows that meet a certain criterion and exclude the outliers. For example, if you want to remove outliers from the mpg variable of the mtcars dataset, you can use this command:

mtcars_no_outliers <- subset(mtcars, mpg > 10 & mpg < 34)

How can I remove outliers from a dataset using the filter() function from the dplyr package?

You can use the filter() function from the dplyr package to create a new dataset that excludes the rows that match a certain condition and keeps the rest of the data. For example, if you want to remove outliers from the mpg variable of the mtcars dataset, you can use this command:

library(dplyr)
mtcars_no_outliers <- filter(mtcars, mpg > 10, mpg < 34)

What is imputation?

Imputation is a process of replacing missing values with plausible values based on some criteria or assumptions.

How can I impute missing values using mean, median, or mode imputation?

You can use the na.mean(), na.median(), or na.mode() functions from the imputeTS package to perform mean, median, or mode imputation for missing values. For example, if you want to impute missing values in the x variable of the df dataset using mean imputation, you can use this command:

library(imputeTS)
df$x <- na_mean(df$x)

How can I impute missing values using multiple imputations by chained equations (MICE)?

You can use the mice() function from the mice package to perform multiple imputation by chained equations (MICE) for missing values. MICE is a general method that can handle different types of variables and models. For example, if you want to impute missing values in the df dataset using MICE, you can use this command:

library(mice)
df_imputed <- mice(df)

How can I encode categorical variables into numeric values?

You can use various methods to encode categorical variables into numeric values, such as label encoding, one-hot encoding, or ordinal encoding. For example, if you want to encode the y variable of the df dataset using one-hot encoding, you can use this command:

df$y_onehot <- model.matrix(~ y -1, data = df)
)

Join Our Community Allow us to Assist You

About the author

Zubair Goraya
Ph.D. Scholar | Certified Data Analyst | Blogger | Completed 5000+ data projects | Passionate about unravelling insights through data.

Post a Comment

Ad blocker detected!

We have detected that you are using adblocking plugin in your browser.
The revenue we earn by the advertisements is used to manage this website, we request you to whitelist our website in your adblocking plugin.