Shapiro-Wilk Normality Test | shapiro.test in R

Master the Shapiro-Wilk test for normality in R with our step-by-step guide. Learn to perform the shapiro-wilk test and analyze your data effectively!

Are you confident in your data analysis? 

Shapiro-Wilk test in R is essential to ensuring your data fits a normal distribution, but how well do you understand its mechanisms and implications? Can you enhance the reliability of your research findings and uncover deeper insights by learning this test? Start learning the complexities of this crucial statistical tool and challenge your understanding of data analysis.

The Shapiro-Wilk test is a widely utilized statistical method for assessing the normality of data distributions. It is used in various fields due to its robustness and effectiveness, especially when dealing with small sample sizes. The test was developed by Samuel Shapiro and Martin Wilk in 1965 and has since become a standard tool in statistical analysis. 

Shapiro-Wilk Normality Test | shapiro.test in R

The primary function of the Shapiro-Wilk test is to evaluate the null hypothesis that a given sample is drawn from a normally distributed population. This is crucial in many statistical analyses, as many parametric tests assume the normality of the data. In practical applications, the Shapiro-Wilk test is often employed in studies that require validation of the normality assumption before conducting further statistical analyses.

# Perform the Shapiro-Wilk test
shapiro.test(mtcars_data$mpg)
Table of Contents

Application of Shapiro-Wilk test

Researchers have noted that the data were normally distributed, as confirmed by the Shapiro-Wilk test in wine tourism apps, which is essential for the subsequent chi-square analysis [1]. The Shapiro-Wilk test examines water quality in river systems. It ensures that the assumptions of normality and homogeneity of variances are met before applying other statistical methods [2]. The test's effectiveness is particularly pronounced in small sample sizes, where it has been shown to outperform other normality tests.

The Shapiro-Wilk test is a statistical test used to determine if a sample of data comes from a normal distribution.

Samuel Shapiro and Martin Wilk

For example, Baharum emphasizes that the Shapiro-Wilk test is suitable for sample sizes less than 50, making it a preferred choice in various research scenarios [3]. Studies also showed that the Shapiro-Wilk test effectively detects deviations from normality, especially in small datasets [4, 5]. Moreover, the Shapiro-Wilk test is not only limited to univariate data but has also been adapted for multivariate scenarios, enhancing its applicability across different research domains. Recent advancements have extended the Shapiro-Wilk test to assess multivariate normality, which is crucial in complex data analyses. This adaptability reassures researchers of its versatility and makes it a valuable tool in academic and practical research settings. 

Shapiro-Wilk test using software

In terms of statistical software implementation, the Shapiro-Wilk test is readily available in various statistical packages, including R, SPSS, and Python, facilitating its use among researchers. The ease of access to this test through software tools has contributed to its widespread adoption in empirical research.

For example, in veterinary studies, the Shapiro-Wilk test is frequently applied to assess the normality of data distributions before conducting further analyses [7].  Interpreting the Shapiro-Wilk test results involves examining the test statistic (W) and the associated p-value. A p-value less than the significance level (commonly set at 0.05) indicates a rejection of the null hypothesis, suggesting that the data do not follow a normal distribution. This interpretation is critical for researchers, as it guides the choice between parametric and non-parametric statistical methods.

the Shapiro-Wilk test is frequently applied to assess the normality of data distributions

For instance, in a study analyzing the performance of various normality tests, the Shapiro-Wilk test was highlighted for its superior power in detecting non-normality compared to other tests like the Kolmogorov-Smirnov test [8]. Furthermore, the Shapiro-Wilk test is often complemented by graphical methods such as Q-Q plots, which are essential for visually assessing the normality of data in R. and histograms, providing visual insights into the data distribution. These visual tools, alongside the Shapiro-Wilk test, enhance the robustness of normality assessments, allowing researchers to make more informed decisions regarding their data analysis strategies [9].

Normality Test Parametric/Non-parametric Best for Sample Size Sensitivity Strengths Weaknesses Suitable Data Type Test Purpose
Shapiro-Wilk Test Parametric < 5000 High Accurate for small datasets Sensitive to large sample sizes Numeric Testing if data is normally distributed
Kolmogorov-Smirnov Test Non-parametric Large Medium Works for any distribution Less powerful for small datasets Numeric Comparing data to any theoretical distribution
Anderson-Darling Test Non-parametric Small to Medium High More weight in the tails More complex calculation Numeric Testing normality, focus on tails
Lilliefors Test Non-parametric Medium Medium Extension of KS test Assumes mean and variance unknown Numeric Testing normality for unknown parameters
Jarque-Bera Test Parametric Large Low Easy to use Not suitable for small datasets Numeric Testing skewness and kurtosis for normality
D'Agostino's K-squared Test Parametric Medium to Large Medium Tests skewness and kurtosis Requires large sample size Numeric Testing normality through skewness and kurtosis
Cramer-von Mises Test Non-parametric Small to Medium High Alternative to Anderson-Darling Less common Numeric Testing goodness-of-fit
Pearson's Chi-Square Test Non-parametric Large Medium Works with categorical data Requires binning for continuous data Categorical/Numeric Testing goodness-of-fit

Understanding the Shapiro-Wilk Test

The Shapiro-Wilk test checks whether your data follows a normal distribution. Many statistical tests, like t-tests or ANOVA, require your data to be normally distributed to work properly. Using the Shapiro-Wilk test helps ensure that your data meets these requirements. 

Null and Alternative Hypotheses

The Shapiro-Wilk test is based on two hypotheses. 

Null hypothesis! The null hypothesis states that the data is normally distributed, meaning it has no significant differences from a normal curve.


Alternate Hypothesis! On the other hand, the alternative hypothesis states that the data is not normally distributed.

You decide which hypothesis to accept by looking at the p-value; the test gives you insights into the normality of the data in R. If the p-value is less than 0.05, you reject the null hypothesis, meaning the data is not normal. However, if the p-value is greater than 0.05, you do not reject the null hypothesis, meaning the data appears normal. Understanding these hypotheses will help determine if your data is ready for other statistical tests that require a normal distribution.

Test Statistic (W) and P-Value

The Shapiro-Wilk test gives you the W statistic, which ranges between 0 and 1. If W is close to 1, your data is normally distributed. Along with the W statistic, you also get a p-value that helps you decide if the data is normal. If the p-value is above 0.05, you do not reject the null hypothesis, which means the data is likely normal. However, if the p-value is below 0.05, the data is not normal, and you might need to consider other ways to analyze it. By looking at both the W statistic and p-value, you can understand how close your data is to being normally distributed.

Test Statistic (W) and P-Value using shapiro.test in R

Performing the Shapiro-Wilk Test in R

Load Necessary Packages

You can run the Shapiro-Wilk test using the Shapiro-Wilk test for normality in R, which is a standard practice among data scientists. shapiro.test() function, which is part of the R stats package. This package is already included with R, so you do not need to install anything extra. Here's how you load the stats package:
# Load the stats package
library(stats)
The stats package is an essential part of R and is used for many kinds of statistical analysis. It helps you run different tests, make models, and visualize data.

Prepare Your Data for the Test

Before running the Shapiro-Wilk test, make sure your data is in a numeric vector format. We can use R's built-in dataset, mtcars, a popular dataset used by data scientists for regression analysis in R, and focus on the miles per gallon (mpg) column. Getting your data ready in the right format is important for obtaining correct results.
# Load the mtcars dataset
mtcars_data <- mtcars
# Select the mpg variable
mpg_data <- mtcars_data$mpg

The mtcars dataset includes various car statistics, making it a good example. Preparing your data well helps ensure the test results are accurate and easy to understand.

Prepare Your Data for the Test using shapiro.test in R

Apply the Shapiro-Wilk Test

To use the Shapiro-Wilk test, you call the shapiro.test() function on your dataset. Here's an example using the mpg column from the mtcars dataset.
# Perform the Shapiro-Wilk test
shapiro_test_result <- shapiro.test(mpg_data)
# Display the results
print(shapiro_test_result)
This code will give you a W statistic and a p-value to help you see if the mpg values are normally distributed. By understanding these numbers, you can decide if the data is ready for more analysis or if adjustments are needed.
Apply the Shapiro-Wilk Test using R

Limitations of the Shapiro-Wilk Test

One of the biggest challenges with the Shapiro-Wilk test is that it is sensitive to the size of your dataset. If your dataset is extensive, even minor differences from normality can result in a low p-value in the Shapiro-Wilk test for normality in R. 

Limitations of the Shapiro-Wilk Test

The p-value might make you think your data is abnormal when the differences are minor. On the other hand, if you have a very small dataset (fewer than three data points), the test may not work well. Because of this, you should be careful when using this test and consider the size of your dataset when interpreting the results.

People also read

Recommendations to Address Limitations

To deal with the limitations of the Shapiro-Wilk test, you can use visual tools like: 

  • Histograms 
  • Q-Q plots in addition to the test. 
These visual tools give you a better look at how your data is distributed and make it easier to understand what is happening. 

Histogram for Normality test

A histogram allows you to see the overall shape of your data distribution and quickly spot deviations from normality. 

# Histogram of mpg
hist(mpg_data, main="Histogram of MPG", xlab="Miles per Gallon", col="lightblue")
Histogram for Normality test of mpg data set

Q-Q plot for Normality test

A Q-Q plot (quantile-quantile plot) is another useful visual method that compares the quantiles of your dataset to the quantiles of a normal distribution using R. If the points in the Q-Q plot follow a straight line, your data will likely be normally distributed.
# Q-Q plot of mpg
qqnorm(mpg_data)
qqline(mpg_data, col="red")
Using histograms and Q-Q plots helps you visually verify the results of the Shapiro-Wilk test. If your dataset is very large, these visual methods can help you determine if any deviations are significant or simply due to sample size. Combining the Shapiro-Wilk test with these visual tools ensures that your assessment of normality is well-rounded and more reliable.
Q-Q plot for Normality test

You can also support your results using other normality tests, like the Kolmogorov-Smirnov test or the Anderson-Darling test. Using a mix of these methods gives you a more complete view of your data and helps ensure your findings are accurate.

Conclusion

The Shapiro-Wilk test is a helpful tool for checking if your data is normally distributed, which is essential for many types of statistical analysis. By understanding the null and alternative hypotheses, using the suitable R functions, and recognizing the limitations of the test, you can ensure your data analysis is accurate. Using visual tools alongside the Shapiro-Wilk test is also helpful to get the best picture of your data. This way, you can be confident that your data meets the requirements for more advanced tests and that your analysis will be accurate and meaningful.

Frequently Asked Question

How can you use the Shapiro-Wilk test in R to check if your data is suitable for further statistical analysis?

To use the Shapiro-Wilk test in R, use the shapiro.test() function on your dataset. It will produce a W statistic and a p-value. If the p-value is greater than 0.05, the data is normally distributed, making it suitable for statistical tests like ANOVA or t-tests assuming normality.

What are the main differences between the Shapiro test in R and the Kolmogorov-Smirnov test in R for testing normality?

The Shapiro test in R is generally better for smaller datasets (n < 5000) and specifically tests for normality. In contrast, the Kolmogorov-Smirnov test compares your data to any distribution and works better for larger datasets but is less sensitive for small sample sizes when testing for normality.

Why is it essential to use the Shapiro-Wilk normality test before applying statistical tests like ANOVA?

ANOVA requires that your data be normally distributed to provide accurate results. The Shapiro-Wilk normality test helps ensure this condition is met, which helps validate the conclusions you draw from an ANOVA test by reducing the chances of incorrect interpretations.

When should you consider using a different normality test in R other than the Shapiro-Wilk test?

If your dataset has over 5000 observations, the Shapiro-Wilk test can be overly sensitive, leading to significant p-values for minor deviations. Consider the Kolmogorov-Smirnov test or the Anderson-Darling test, which are more suited for larger datasets.

How can visual tools, such as histograms and Q-Q plots, support the results of the Shapiro-Wilk test for normality?

Visual tools like histograms and Q-Q plots help you visually inspect your data's distribution. Histograms in R programming can be useful for visualizing the data's distribution and showing its overall shape. Q-Q plots help determine how closely your data fits a normal distribution. Using these tools and the Shapiro-Wilk test for normality allows for a more comprehensive assessment.

What key factors affect the Shapiro-Wilk test interpretation, and how do you determine if your data is normal?

The interpretation of the Shapiro-Wilk test depends on the p-value. The data can be considered normal if the p-value is greater than 0.05. Factors like sample size also affect interpretation, as very small or large datasets can be either non-significant or overly sensitive. Therefore, always consider the context, sample size, and visual plots alongside the Shapiro-Wilk test.

In which scenarios would the Shapiro-Wilk test in R be more suitable than the Kolmogorov-Smirnov test?

The Shapiro-Wilk test in R is more suitable for smaller datasets and is specifically designed to test for normality. It is more powerful and reliable for this purpose than the Kolmogorov-Smirnov test, which is better for larger datasets or testing against distributions other than normality in R.

How do you perform the test de Shapiro Wilk in R, and what does the output tell you about your dataset?

To perform the test de Shapiro Wilk in R, use shapiro.test() with your data. The output includes a W statistic and a p-value. A p-value greater than 0.05 means you cannot reject the null hypothesis, suggesting that your data is likely normal. If it is below 0.05, the data significantly deviates from normality.

What is the relationship between the Shapiro-Wilk test and other normality tests when determining the normality of data?

The Shapiro-Wilk test is considered one of the most powerful tests for small sample sizes. Other normality tests, like the Kolmogorov-Smirnov or Anderson-Darling tests, can support the Shapiro-Wilk test findings or verify normality in larger datasets. Using multiple tests provides a fuller picture of your data’s distribution.

How do you determine if your dataset is ready for ANOVA by using the Shapiro-Wilk test R and interpreting the results correctly?

To determine if your dataset is ready for ANOVA, run the Shapiro-Wilk test R using shapiro.test(). If the p-value exceeds 0.05, your dataset is normally distributed and suitable for ANOVA. Additionally, visual inspections like histograms or Q-Q plots should be used to confirm that the data reasonably fits a normal distribution.[

Reference:
[1] D. Dimitrovski, V. Joukes, S. Rachão, & M. Tibério, "Wine tourism apps as wine destination branding instruments: content and functionality analysis," Journal of Hospitality and Tourism Technology, vol. 10, no. 2, p. 136-152, 2019. https://doi.org/10.1108/jhtt-10-2017-0115

[2] K. Nyakeya, "Trends in water quality in a tropical kenyan river-estuary system: responses to anthropogenic activities", Asian Journal of Biology, vol. 20, no. 6, p. 34-51, 2024. https://doi.org/10.9734/ajob/2024/v20i6413

[3] Z. Baharum, "The critical factors for built-up edge formation in stainless steel milling", International Journal of Advanced Trends in Computer Science and Engineering, vol. 9, no. 1.4, p. 282-288, 2020. https://doi.org/10.30534/ijatcse/2020/4291.42020

[4] A. Owusu, M. Asare, & R. Owusu, "Using gis to understand cervical cancer screening behaviors among women living with HIV (with) in Ghana," Asian Pacific Journal of Environment and Cancer, vol. 5, no. 1, p. 17-23, 2022. https://doi.org/10.31557/apjec.2022.5.1.17-23

[5] R. Rahmalia, "Digital transformation on financial performance: unleashing corporate excellence through mobile banking adoption in Malaysia's public listed banks", International Journal of Academic Research in Business and Social Sciences, vol. 14, no. 1, 2024. https://doi.org/10.6007/ijarbss/v14-i1/20576

[6] Z. Meng and Z. Jiang, "Cauchy combination omnibus test for normality," Plos One, vol. 18, no. 8, p. e0289498, 2023. https://doi.org/10.1371/journal.pone.0289498

[7] R. Evans, "Verifying model assumptions and testing normality," Veterinary Surgery, vol. 53, no. 1, p. 17-17, 2023. https://doi.org/10.1111/vsu.14034

[8] A. Jo, G. Bm, & F. George, "Performances of several univariate tests of normality: an empirical study", Journal of Biometrics & Biostatistics, vol. 07, no. 04, 2016. https://doi.org/10.4172/2155-6180.1000322 [9] R. Souza, "Teaching descriptive statistics and hypothesis tests measuring water density", Journal of Chemical Education, vol. 100, no, data scientists often prefer to analyze this using R packages. 11, p. 4438-4448, 2023. https://doi.org/10.1021/acs.jchemed.3c00402

Session info:

sessionInfo()

R version 4.4.1 (2024-06-14 ucrt)

Platform: x86_64-w64-mingw32/x64

Running under: Windows 11 x64 (build 22631)

Matrix products: default

tzcode source: internal

attached base packages:

[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):

[1] compiler_4.4.1 tools_4.4.1



Transform your raw data into actionable insights. Let my expertise in R and advanced data analysis techniques unlock the power of your information. Get a personalized consultation and see how I can streamline your projects, saving you time and driving better decision-making. Contact me today at contact@rstudiodatalab.com or visit to schedule your discovery call.

About the author

Zubair Goraya
Ph.D. Scholar | Certified Data Analyst | Blogger | Completed 5000+ data projects | Passionate about unravelling insights through data.

Post a Comment