Remove Outliers and Perform Data Cleaning in R

Q: How can I remove outliers from a dataset using logical operators and subsetting?

You can use logical operators such as , ==, !=, etc., to create a condition that filters out the outliers from your dataset and then uses subsetting to select only the rows that satisfy the condition. For example, if you want to remove outliers from the mpg variable of the mtcars dataset, you can use this command: mtcars_no_outliers 10 & mtcars$mpg < 34, ]

Key Points

Outliers are data points that are significantly different from the rest of the data and can affect the results of statistical tests and machine learning models.
There are different ways to detect outliers, such as graphical methods (boxplots and histograms) and statistical methods (z-scores, interquartile range, Dixon’s test, and Rosner’s test).
There are different ways to remove outliers from a dataset, such as using logical operators and subsetting, using the subset() function, or using the filter() function from the dplyr package.
There are different ways to impute missing values in a dataset, such as mean, median, or mode imputation, multiple imputations by chained equations (MICE), or K-nearest neighbours (KNN) imputation.
There are different ways to encode categorical variables in a dataset, such as label encoding, one-hot encoding, or ordinal encoding.

Remove Outliers and Perform Data Cleaning in RStudio

Description of Functions and Packages

Function/Package	Description
`boxplot()`	Creates a boxplot for a numeric variable
`hist()`	Creates a histogram for a numeric variable
`scale()`	Calculates z-scores for a numeric variable
`IQR()`	Calculates interquartile range for a numeric variable
`outlierTest()` from the `car` package	Performs Dixon’s test for one outlier in a small dataset
`rosnerTest()` from the `EnvStats` package	Performs Rosner’s test for multiple outliers in a large dataset
`subset()`	Creates a new dataset that contains only the rows that meet a certain criterion
`filter()` from the `dplyr` package	Creates a new dataset that excludes the rows that match a certain condition
`na.mean()`, `na.median()`, `na.mode()` from the `imputeTS` package	Performs mean, median, or mode imputation for missing values
`mice()` from the `mice` package	Performs multiple imputation by chained equations (MICE) for missing values
`knnImputation()` from the `DMwR` package	Performs K-nearest neighbours (KNN) imputation for missing values
`as.numeric()` or `as.factor()`	Converts a variable into numeric or factor type, respectively
`model.matrix()`	Creates a design matrix with dummy variables for each category of a factor variable
`factor()`	Creates an ordered factor variable with specified levels

I’m Zubair Goraya, a Ph.D. Scholar, Certified data analyst, and Freelancer, and I love to share my knowledge and experience with R programming. In this article, I’m going to show you how to deal with outliers in data using R.

Outliers are data points that are significantly different from the rest of the data and can affect the results of statistical tests and machine learning models. Therefore, it is important to identify and remove outliers before performing any analysis on the data.

Here are the main topics that I will cover in this article:

What is an outlier, and how do we detect it using boxplots and histograms?
How to find outliers using z-scores, interquartile range, Dixon’s and Rosner’s test?
How do we remove outliers from data using Rstudio functions?
How do we impute missing values and handle categorical variables in data?
How do we check the presence of outliers after data cleaning?

By the end of this article, you will have a clear understanding of how to perform outlier analysis in R and improve the quality of your data. You will also learn some useful tips and tricks for data science and machine learning projects. So, let’s get started!

What is an outlier?

An outlier is a value that is very different from the other values in a dataset. For example, if you have a dataset of heights of people, a value of 2 meters or 50 centimetres would be considered as an outlier. Outliers can be caused by various factors such as measurement errors, data entry errors, natural variability, or rare events.

Outlier Detection

One way to detect outliers is to use graphical methods such as boxplots and histograms. A boxplot is a type of plot that shows the distribution of a numeric variable using five summary statistics: minimum, first quartile, median, third quartile, and maximum. The box represents the middle 50% of the data, while the whiskers extend to the most extreme values within 1.5 times the interquartile range (IQR). Any value beyond the whiskers is considered an outlier and is marked with a dot or a circle.

A histogram is another type of plot that shows the frequency of values in a numeric variable using bars. The height of each bar represents the number of observations in each bin or interval. A histogram can also help to identify outliers by showing the shape and spread of the data. If the data is skewed or has long tails, it may indicate the presence of outliers.

How do we detect it using boxplots and histograms?

To create boxplots and histograms in R, you can use the boxplot() and hist() functions, respectively. For example, let’s use the mtcars dataset that comes with R and create boxplots and histograms for the hp variable.

# Load the mtcars dataset
data(mtcars)
# Create a boxplot for mpg
boxplot(mtcars,
        main = "Boxplot for Mtcars Data set",
        xlab = "mpg",
        ylab = "Frequency")

Boxplot for Outlier Detection for mtcars data set

boxplot(mtcars$hp,
        main = "Boxplot for hp",
        xlab = "hp",
        ylab = "Frequency")

hist(mtcars$hp,
     main = "Histogram for hp",
     xlab = "hp",
     ylab = "Frequency")

From these plots, we can see that there is one outlier in the hp variable: one with a very high value (around 350). These values are far away from the rest of the data and may affect the mean and standard deviation of the variable.

How do we find outliers using statistical methods?

Another way to find outliers in R is to use statistical methods that calculate a measure of deviation or distance from the center or average of the data. Some common methods are:

Z-scores: It is a standardized score that measures how many standard deviations a value is away from the mean. A value with a z score greater than 3 or less than -3 is usually considered as an outlier.

Interquartile range (IQR): The IQR is the difference between the third quartile (Q3) and the first quartile (Q1) of a variable. It represents the middle 50% of the data. A value with an IQR score greater than 1.5 times the IQR above Q3 or below Q1 is usually considered as an outlier.

Dixon’s test: It is a method that detects one outlier at a time in a small dataset (less than 30 observations). It compares the ratio of the gap between the outlier and its nearest neighbour to the range of the data. If this ratio exceeds a critical value based on the sample size and significance level, then the value is an outlier.

Rosner’s test: It is a method that detects multiple outliers at a time in a large dataset (more than 30 observations). It is an iterative procedure that starts with Dixon’s test and then removes the detected outlier, and repeats the test until no more outliers are found.

To apply these methods in R, you can use the following functions:

scale() to calculate z-scores
IQR() to calculate IQR scores
outlierTest() from the car package to perform Dixon’s test
rosnerTest() from the EnvStats package to perform Rosner’s test

Find Outliers in RStudio

For example, let’s use these functions to find outliers in the mpg variable of the mtcars dataset. If packages were not installed, first install the packages and then load the libraries. don't know How to Import and Install Packages in R: A Comprehensive Guide

Outlier Detection using Z-scores

Z-scores, also known as standard scores, z-values, normal scores, z-scores or standardized values, measure how many standard deviations away a value is from the mean of a distribution. They are useful for comparing data with different units, scales, or ranges. They can also help us test a dataset's normality, find outliers, and calculate probabilities. Read more: Did You Know How to Calculate Z-Score in R?

# Load the car and EnvStats packages
install.packages("car")
install.packages("EnvStats")
library(car)
library(EnvStats)
# Calculate z-scores for mpg
z_scores <- scale(mtcars$hp)
z_scores
# Find values with z-scores greater than 2 or less than -2
z_outliers <- mtcars$hp[abs(z_scores) > 2]
# Print z_outliers
z_outliers

The output of this command is:

[1] 335

This means that the value 335 is an outlier according to the z-score method.

Inter Quartile Range (IQR)

# Inter Quartile Range (IQR)
# Calculate IQR scores for hp
iqr_scores <- (mtcars$hp - median(mtcars$hp)) / IQR(mtcars$hp)
# Find values with IQR scores greater than 1.5 or less than -1.5
iqr_outliers <- mtcars$hp[abs(iqr_scores) > 1.5]
# Print iqr_outliers
iqr_outliers

The output of this command is:

[1] 264 335

This means that the same values are outliers according to the IQR method.

Outlier Detection Using Dixon’s test

Dixon test has some limitation, this function is only handle the sample size of 3 to 30. If the sample size is large, it did not work. So, for this tutorial, we subset the data set and take the only first 30 observations.

# Load the outliers package
library(outliers)
# Perform Dixon's test for hp
dixon_test <- outliers::dixon.test(mtcars$hp[1:30])
# Print dixon_test
print(dixon_test)

The output of this command is:

This means that the value for 264 is an outlier according to Dixon’s test at a significance level of 0.05. Our Results were not reliable because I had a large sample size.

Outlier Detection Using Rosner's test

# Perform Rosner's test for hp
rosner_test <- rosnerTest(mtcars$hp, k = 2)
# Print rosner_test
rosner_test

The output of this command is:

This means that the values of 335 and 264 are outliers at a significance level of 0.05.

How do we remove outliers from data?

Once you have identified the outliers in your dataset, remove them before performing any further analysis of the data. There are different ways to remove outliers from data, such as:

Using logical operators and subsetting

Using the subset() function: You can use the subset() function to create a new dataset that contains only the rows that meet a certain criterion and exclude the outliers.

Using the `filter `()` function from the `dplyr` package

You can use the filter() function from the dplyr package to create a new dataset that excludes the rows that match a certain condition and keeps the rest of the data.

For example, let’s use these methods to remove the outliers from the mpg variable of the mtcars dataset.

# Load the dplyr package
library(dplyr)
# Remove outliers using logical operators and subsetting
mtcars_no_outliers1 <- mtcars[mtcars$hp > 300 & mtcars$mpg < 300, ]
mtcars_no_outliers1
# Remove outliers using the subset() function
mtcars_no_outliers2 <- subset(mtcars, hp > 300 & mpg < 300)
mtcars_no_outliers2
# Remove outliers using the filter() function from the dplyr package
mtcars_no_outliers3 <- filter(mtcars, hp > 300, mpg < 300)
mtcars_no_outliers3

The output of these commands is a new dataset that contains only 30 rows and excludes the outliers from the mpg variable.

How to impute missing values and handle categorical variables in a dataset?

Another common issue that you may encounter in your dataset is missing values. Missing values are values that are not recorded or available for some reason. They can be represented by symbols such as NA, NULL, or ?. It can affect the accuracy and validity of your analysis and may introduce bias or errors in your results.

One way to deal with this is to impute them. Imputation is a process of replacing missing values with plausible values based on some criteria or assumptions.

There are different methods, such as:

Mean or median imputation

This method replaces missing values with the mean or median of the variable. It is simple and easy to implement, but it may reduce the variability and distort the distribution of the data.

Mode or most frequent imputation

This method replaces missing values with the mode or most frequent value of the variable. It is suitable for categorical variables, but it may introduce bias and overrepresent some categories.

Regression imputation

This method replaces missing values with predicted values based on a regression model that uses other variables as predictors. It is more sophisticated and realistic, but it may increase the complexity and uncertainty of the model.

K-nearest neighbours (KNN) imputation

This method replaces missing values with the average or weighted average of the k nearest neighbours of the observation based on some distance metric. It is more flexible and adaptive, but it may be computationally expensive and sensitive to outliers.

To perform imputation in R, you can use various functions and packages, such as:

na.mean(), na.median(), na.mode() from the imputeTS package to perform mean, median, or mode imputation
mice() from the mice package to perform multiple imputation by chained equations (MICE), which is a general method that can handle different types of variables and models
knnImputation() from the DMwR package to perform KNN imputation

Imputation of Missing Values in R

For example, let’s use these functions to impute missing values in a simulated dataset that contains numeric and categorical variables.

# Load the imputeTS, mice, and DMwR packages
#install.packages("imputeTS")
library(imputeTS)
#install.packages("mice")
library(mice)
#install.packages("DMwR2")
library(DMwR2)
# Create a simulated dataset with numeric and categorical variables
set.seed(123)
df <- data.frame(
  x = rnorm(100, mean = 50, sd = 10),
  y = sample(c("A", "B", "C"), 100, replace = TRUE),
  z = runif(100, min = 0, max = 100)
)
# Introduce some missing values randomly
df[sample(1:100, 10), "x"] <- NA
df[sample(1:100, 10), "y"] <- NA
df[sample(1:100, 10), "z"] <- NA
# Check which columns contain missing values
colSums(is.na(df))

The output of this code is:

Create a simulated dataset with numeric and categorical variables

As you can see, this dataset has a lot of missing values in all three variables: x, y, and z.

Impute Missing values

Let’s use the imputeTS package to perform mean, median, and mode imputation for the numeric and categorical variables, respectively.

# Load the imputeTS package
library(imputeTS)

# Perform mean imputation for x and z variables
df$x <- na_mean(df$x)
df$z <- na_mean(df$z)

# Perform mode imputation for y variable
# Custom function to impute mode for a vector
impute_mode <- function(x) {
  uniq_x <- unique(x)
  table_x <- table(x)
  mode_val <- uniq_x[which.max(table_x)]
  x[is.na(x)] <- mode_val
  return(x)
}

# Impute mode for the 'y' variable
df$y <- impute_mode(df$y)
# Print df after imputation
head(df,10)
# Check which columns contain missing values
colSums(is.na(df))

The output of this code is:

mode imputation for the numeric and categorical variables

As you can see, this dataset has no more missing values in any of the variables: x, y, and z.

Handle categorical variables in R.

Another issue that you may face in your dataset is the presence of categorical variables. Categorical variables are variables that have a finite number of possible values or categories, such as gender, colour, or type of car.

Categorical variables can be either nominal or ordinal.

Nominal variables are variables that have no inherent order or ranking among the categories, such as gender or color.
Ordinal variables are variables that have a natural order or ranking among the categories, such as education level or satisfaction rating.

One way to deal with categorical variables is to encode them into numeric values that can be used for analysis and modelling.

There are different methods of encoding categorical variables, such as:

Label encoding: This method assigns a unique integer value to each category of the variable, starting from zero or one.
One-hot encoding: This method creates a new binary variable for each category of the variable, with a value of one if the observation belongs to that category and zero otherwise.
Ordinal encoding: This method assigns an integer value to each category of the variable based on the order or ranking of the categories.

Encoding categorical variables in R

To perform encoding in R, you can use various functions and packages, such as:

as.numeric() or as.factor() to convert a variable into numeric or factor type, respectively.
model.matrix() to create a design matrix with dummy variables for each category of a factor variable.
factor() to create an ordered factor variable with specified levels.

For example, let’s use these functions to encode the y variable of the df dataset that we created earlier.(

# Encode y variable using label encoding
df$y<-as.factor(df$y)
df$y_label <- as.numeric(df$y) -1
# Encode y variable using one-hot encoding
df$y_onehot <- model.matrix(~ y -1, data = df)
# Encode y variable using ordinal encoding
df$y_ordinal <- factor(df$y, levels = c("A", "B", "C"), ordered = TRUE)
# Print df after encoding
df

I have encoded the y variable using three different methods:

Label Encoding
One-hot encoding
Ordinal encoding

Label encoding

It assigns a unique integer value to each category of the variable, starting from zero or one. For example, category A is encoded as zero, B as one, and C as two.

One-hot encoding

It creates a new binary variable for each category of the variable, with a value of one if the observation belongs to that category and zero otherwise. For example, the category A is encoded as a vector of (1,0,0), B as (0,1,0), and C as (0,0,1).

Ordinal encoding

It assigns an integer value to each category of the variable based on the order or ranking of the categories. For example, category A is encoded as one, B as two, and C as three.

You can see the results of each encoding method in the new columns that I have added to the dataset: y_label, y_onehotA, y_onehotB, y_onehotC, and y_ordinal.

Encoding categorical variables can help to transform them into numeric values that can be used for analysis and modelling. However, it would be best if you were careful about choosing the appropriate method for your data and your purpose.

Advantages and Disadvantages of each method

Some advantages and disadvantages of each method are:

Label encoding is simple and easy to implement, but it may imply a false sense of order or magnitude among the categories that may not exist in reality.
One-hot encoding is more expressive and avoids the problem of order or magnitude, but it may create a large number of new variables that may increase the dimensionality and sparsity of the data.
Ordinal encoding is suitable for ordinal variables that have a natural order or ranking among the categories, but it may not work well for nominal variables that have no inherent order or ranking.

Conclusion

This article shows you how to perform outlier analysis and imputation in R using various methods and functions. You have learned how to identify and remove outliers, how to replace missing values with plausible values, and how to transform categorical variables into numeric values. These steps can help you to improve the quality of your data and prepare it for further analysis and modelling.

If you want to learn more about R programming and data analysis, you can check out our latest R posts on our website: Data Analysis. You can also contact us at info@rstudiodatalab.com or hire us at Order Now if you need any help with your data science or machine learning projects.

Thank you for reading, and happy coding!

Frequently Asked Questions (FAQs)

What is an outlier?

An outlier is a data point that is significantly different from the rest of the data and can affect the results of statistical tests and machine learning models.

How can I detect outliers using boxplots?

A boxplot is a type of plot that shows the distribution of a numeric variable using five summary statistics: minimum, first quartile, median, third quartile, and maximum. The box represents the middle 50% of the data, while the whiskers extend to the most extreme values within 1.5 times the interquartile range (IQR). Any value beyond the whiskers is considered an outlier and is marked with a dot or a circle.

How can I detect outliers using z-scores?

A z-score is a standardized score that measures how many standard deviations a value is away from the mean. A value with a z-score greater than 3 or less than -3 is usually considered an outlier.

How can I remove outliers from a dataset using logical operators and subsetting?

You can use logical operators such as <, >, ==, !=, etc., to create a condition that filters out the outliers from your dataset and then uses subsetting to select only the rows that satisfy the condition. For example, if you want to remove outliers from the mpg variable of the mtcars dataset, you can use this command:

mtcars_no_outliers <- mtcars[mtcars$mpg > 10 & mtcars$mpg < 34, ]

How can I remove outliers from a dataset using the subset() function?

You can use the subset() function to create a new dataset that contains only the rows that meet a certain criterion and exclude the outliers. For example, if you want to remove outliers from the mpg variable of the mtcars dataset, you can use this command:

mtcars_no_outliers <- subset(mtcars, mpg > 10 & mpg < 34)

How can I remove outliers from a dataset using the filter() function from the dplyr package?

You can use the filter() function from the dplyr package to create a new dataset that excludes the rows that match a certain condition and keeps the rest of the data. For example, if you want to remove outliers from the mpg variable of the mtcars dataset, you can use this command:

library(dplyr)
mtcars_no_outliers <- filter(mtcars, mpg > 10, mpg < 34)

What is imputation?

Imputation is a process of replacing missing values with plausible values based on some criteria or assumptions.

How can I impute missing values using mean, median, or mode imputation?

You can use the na.mean(), na.median(), or na.mode() functions from the imputeTS package to perform mean, median, or mode imputation for missing values. For example, if you want to impute missing values in the x variable of the df dataset using mean imputation, you can use this command:

library(imputeTS)
df$x <- na_mean(df$x)

How can I impute missing values using multiple imputations by chained equations (MICE)?

You can use the mice() function from the mice package to perform multiple imputation by chained equations (MICE) for missing values. MICE is a general method that can handle different types of variables and models. For example, if you want to impute missing values in the df dataset using MICE, you can use this command:

library(mice)
df_imputed <- mice(df)

How can I encode categorical variables into numeric values?

You can use various methods to encode categorical variables into numeric values, such as label encoding, one-hot encoding, or ordinal encoding. For example, if you want to encode the y variable of the df dataset using one-hot encoding, you can use this command:

df$y_onehot <- model.matrix(~ y -1, data = df)

)

Join Our Community Allow us to Assist You

We don't just fix data errors We Transform Your Data into actionable insights.

Our Services

Data Preprocessing

Data Cleaning

Handling Missing Values

Outlier Detection and Removal

Data Transformation

Data Integration

Data Reduction

Normalization and Standardization

Data Encoding

Data Sampling

Data Validation

Descriptive Analysis

Frequency Distribution

Measures of Central Tendency

Measures of Dispersion

Percentile Analysis

Cross-Tabulation

Data Summarization

Trend Analysis

Data Profiling

Visualization of Summaries

Report Generation

Inferential Statistics

Hypothesis Testing

Confidence Interval Estimation

Significance Testing (p-values)

Nonparametric Tests

Parametric Tests

Chi-Square Tests

Correlation Analysis

Variance Analysis

Sample Size Determination

Power Analysis

Regression Analysis

Simple Linear Regression

Multiple Linear Regression

Logistic Regression

Polynomial Regression

Stepwise Regression

Ridge and Lasso Regression

Interaction Effects Modeling

Residual Analysis

Model Diagnostics

Regression Validation

Time Series Analysis

Trend Analysis

Seasonal Decomposition

Stationarity Testing

Autocorrelation Analysis

Smoothing Techniques

Forecasting Models

ARIMA Modeling

Exponential Smoothing

Time Series Regression

Error Measurement

Multivariate Analysis

Principal Component Analysis (PCA)

Factor Analysis

Cluster Analysis

Discriminant Analysis

MANOVA

Canonical Correlation Analysis

Multidimensional Scaling

Correspondence Analysis

Structural Equation Modeling

Multivariate Regression

Predictive Modeling

Classification Algorithms

Decision Trees

Ensemble Methods

Random Forests

Support Vector Machines

Neural Networks

Model Training and Testing

Cross-Validation Techniques

Feature Selection

Quality Control

Control Charts

We don't just fix data errors
We Transform Your Data into actionable insights.

Using the `filter `()` function from the `dplyr` package