Stepwise Logistic Regression in R: A Complete Guide

Learn stepwise logistic regression in R for streamlined model building. Learn how it works, implementation, and best practices.

Key points

  • Stepwise logistic regression is a technique for building a logistic model that iteratively selects or deselects predictors based on their statistical significance.
  • Stepwise logistic regression can minimize model complexity and enhance model performance by removing irrelevant or redundant variables; nevertheless, it has significant drawbacks and limitations, such as sensitivity, bias, and ignorance of interactions or nonlinear effects.
  • Stepwise logistic regression can be performed in R using the stepAIC function from the MASS package, which allows choosing the direction of the stepwise procedure, either "both," "backward," or "forward."
  • Stepwise logistic regression should be interpreted and evaluated using various criteria, such as AIC, deviance, coefficients, p-values, odds ratios, confidence intervals, accuracy, precision, recall, F1-score, ROC curve, AUC, cross-validation, bootstrap, or hold-out test set.
  • Stepwise logistic regression should be used cautiously and supplemented with other variable selection methods, such as domain knowledge, exploratory data analysis, correlation analysis, or regularization techniques.
Stepwise Logistic Regression in R: A Complete Guide

Hello, this is Zubair Goraya, a data analyst and a writer for Data Analysis, a website that provides tutorials related to RStudio. This article will discuss Stepwise Logistic regression in R, a powerful technique for modeling binary outcomes.

Stepwise Logistic Regression in R: A Complete Guide

Logistic Regression is a popular method for predicting binary outcomes, such as whether or not a client would purchase a product. 

However, when you have many potential predictors, how do you choose the best ones for your model? 

One way to do this is by using stepwise logistic regression, a procedure that iteratively adds and removes variables based on their statistical significance and predictive power.

In this article, you will learn:

  • What is stepwise logistic regression, and why use it
  • How to perform stepwise logistic regression in R using the stepAIC function
  • How to compare different stepwise methods, such as forward, backward, and both-direction selection?
  • How to interpret and evaluate the results of stepwise logistic regression?
  • What are the advantages and disadvantages of stepwise logistic regression
  • How to avoid some common pitfalls and challenges of stepwise logistic regression
By the end of this article, you will have a solid understanding of logistic regression in R and how to apply it to your data analysis projects. You will also learn some tricks and tips to improve your logistic regression skills and avoid common pitfalls.

What is Stepwise Logistic Regression, and Why Use It?

Stepwise logistic regression is a variable selection technique that aims to find the optimal subset of predictors for a logistic regression model. It does this by starting with an initial model, either with no predictors (forward selection) or with all predictors (backward elimination), and then adding or removing variables one at a time based on a criterion such as the Akaike information criterion (AIC) or the Bayesian information criterion (BIC).

Stepwise logistic regression can avoid overfitting, multicollinearity, and high variance and increase interpretability and generalizability. However, stepwise logistic regression also has some drawbacks and limitations, such as:
  • Sensitive to the order of variable entry or removal, which can lead to different final models depending on the starting point and direction of the procedure.
  • Due to the multiple testing and data snooping involved, it can produce biased estimates of the coefficients, standard errors, inflated p-values, and confidence intervals.
  • It can ignore meaningful interactions or nonlinear effects among the variables and potential confounding or moderating factors.
  • It can be computationally intensive and time-consuming, especially when dealing with large data sets or many potential predictors.
Therefore, stepwise logistic regression should be used cautiously and supplemented with other variable selection methods, such as domain knowledge, exploratory data analysis, correlation analysis, or regularization techniques.

How to Perform Stepwise Logistic Regression in R using the stepAIC Function

One of the easiest ways to perform stepwise logistic regression in R is using the stepAIC function from the MASS package. This function performs model selection by AIC and allows you to specify the direction of the stepwise procedure, either "both," "backward," or "forward."

To use the stepAIC function, you must have two models: 

  • Base model that defines the initial set of variables in the procedure 
  • Scope model that defines the range of variables that can be added or removed from the base model.

Using Stepwise Logistic Regression to Predict if a Patient Has Diabetes! 

Suppose you want to use stepwise logistic regression to predict whether a patient has diabetes based on several clinical variables. For this purpose, you can use the PimaIndiansDiabetes2 data set from the mlbench package. 

The data set contains 392 observations and 9 variables:

  • diabetes: Factor indicating whether the patient has diabetes (pos) or not (neg)
  • pregnant: Number of times pregnant
  • glucose: Plasma glucose concentration
  • pressure: Diastolic blood pressure
  • triceps: Triceps skin fold thickness
  • insulin: 2-Hour serum insulin
  • mass: Body mass index
  • pedigree: Diabetes pedigree function
  • age: Age in years

Data Loading and Preprocessing

You can load the data set and remove any missing values as follows:

# Load the data and remove NAs
library(mlbench)
data("PimaIndiansDiabetes2", package = "mlbench")
PimaIndiansDiabetes2 <- na.omit(PimaIndiansDiabetes2)
str(PimaIndiansDiabetes2)# Inspect the data 
Structure of pima indian data set

Split the data set

Next, you can split the data into training and test sets using the createDataPartition function from the caret package and dplyr library. This function ensures that the proportion of the outcome variable is preserved in both sets. You can use a random seed for reproducibility. If the caret package was not installed, run the command first time only install.packages("caret"). Learn more about "How to Import and Install Packages in R: A Comprehensive Guide."
# Split the data into training and test set
#install.packages("caret")
library(caret)
library(dplyr)
set.seed(123)
training.samples <- PimaIndiansDiabetes2$diabetes %>%
  createDataPartition(p = 0.8, list = FALSE)
train.data <- PimaIndiansDiabetes2[training.samples, ]
test.data <- PimaIndiansDiabetes2[-training.samples, ]
dim(train.data)
dim(test.data)
Split the data into train and test data

Base and Scope models 

Now, you can define the base and scope models for the stepwise procedure. For the base model, you can use either an intercept-only model or a model with one or more essential or relevant predictors for the outcome. For the scope model, you can use either a complete model with all predictors or a model with a subset of predictors that you want to consider for the procedure.

For example, you can use the following models:

# Define the base model (intercept-only)
base.model <- glm(diabetes ~ 1, data = train.data, family = binomial)
# Define the scope model (full model)
scope.model <- glm(diabetes ~ ., data = train.data, family = binomial)
Define base and Scope model

Perform stepwise logistic regression

Then, you can use the stepAIC function to perform the stepwise logistic regression. You need to specify the base model, the direction of the procedure, and the scope model as arguments. You can also set trace = FALSE to suppress the output of each step.
# Perform stepwise logistic regression
library(MASS)
step.model <- stepAIC(base.model, direction = "both", 
                      scope = scope.model, trace = FALSE)
Perform stepwise logistic regression

Summarize the final selected model

The step.model object contains the final selected model after the stepwise procedure. You can use the summary function to view the details of the model, such as the coefficients, the standard errors, the p-values, and the AIC.
# Summarize the final selected model
summary(step.model)
Summarize the final selected model

The logistic regression analysis conducted here aimed to predict diabetes based on a single predictor variable. Surprisingly, the model was built using only the constant term (intercept) without any specific predictor included. The coefficient estimate for the intercept was found to be -0.7027, representing the overall log odds of having diabetes across all individuals in the dataset. This coefficient was highly significant (p-value < 0.001), indicating a real effect on the outcome. However, with additional predictors, the model's predictive power is unlimited. The null deviance and residual deviance were both 398.8, suggesting that the intercept-only model does not improve the goodness of fit. Furthermore, the Akaike Information Criterion (AIC) was 400.8, reflecting the model's poor goodness of fit and emphasizing the need for additional predictor variables. To build a more informative and accurate model for diabetes prediction, researchers should consider incorporating relevant predictors into the analysis.

How to Compare Different Stepwise Methods, such as Forward, Backward, and Both-Direction Selection

As mentioned earlier, there are different ways to perform stepwise logistic regression, depending on the direction of the procedure. 

The three main methods are:

  1. Forward selection: This method starts with an intercept-only model and adds variables one at a time based on their significance and contribution to the model fit. The procedure stops when no more variables can be added or when the AIC increases.
  2. Backward elimination: This method starts with a complete model with all variables and removes variables one at a time based on their significance and contribution to the model fit. The procedure stops when no more variables can be removed, or the AIC increases.
  3. Both-direction selection: This method combines forward and backward selection by adding and removing variables at each step based on their significance and contribution to the model fit. The procedure stops when no more variables can be added or removed, or the AIC increases.
To compare these methods, you can use the same base and scope models as before but change the direction argument in the stepAIC function
For example, you can use the below codes to perform forward and backward selection. After Pefroming selection compare the results of these models by using the ANOVA function with test = "Chisq" to perform a likelihood ratio test between each pair of models.
# Perform forward selection
forward.model <- stepAIC(base.model, 
                         direction = "forward", scope = scope.model, trace = FALSE)
backward.model <- stepAIC(scope.model, 
                          direction = "backward", scope = scope.model, trace = FALSE)
# Compare the forward model and both-direction model
ANOVA(forward.model, step.model, test = "Chisq")

Different Stepwise Methods, such as Forward, Backward, and Both-Direction Selection

You can see no difference between the two models regarding deviance and degrees of freedom from the output. Both methods have selected the same set of predictors for the final model.

Compare AIC values of all three models

You can also compare the AIC values of each model by using the AIC function. For example, you can use the following code to compare the AIC values of all three models:
# Compare AIC values of all three models
AIC(base.model, forward.model, backward.model, step.model)
Compare AIC values of all three models

The AIC values represent the goodness of fit for four different logistic regression models, each with varying complexity. The Base Model, Forward Model, and Step Model all share the same degree of freedom, implying using only one predictor variable to predict diabetes. Surprisingly, all three models yield identical AIC values of 400.8003, indicating comparable goodness of fit and predictive performance. However, the most striking finding emerges with the Backward Model, which incorporates six predictor variables. Despite its higher complexity, the Backward Model exhibits a substantially lower AIC value of 279.7859. This marked reduction in AIC underscores the Backward Model's superior goodness of fit and predictive accuracy compared to the other models. Researchers seeking an optimal logistic regression model for diabetes prediction should consider adopting the Backward Model, as it incorporates a more comprehensive set of relevant predictor variables, leading to improved performance in making accurate predictions for diabetes.

How to Interpret and Evaluate the Results of Stepwise Logistic Regression

Once you have performed stepwise logistic regression and selected a final model, you need to interpret and evaluate the model's results regarding its fit, performance, explanation, and validation.

One way to do this is by using the following steps:

  1. Fit: You can use the summary function to view the details of the model fit, such as the coefficients, standard errors, p-values, AIC, deviance, etc. You can also use the ANOVA function to compare different models based on their deviance or likelihood ratio test.
  2. Performance: You can use various metrics to measure the performance of the model on the training data or a test data set, such as accuracy, precision, recall, F1-score, ROC curve, AUC, etc. You can calculate these metrics using functions from packages such as caret or pROC.
  3. Explanation: You can use various methods to explain and interpret the model's results regarding its predictors and the outcome variable, such as odds ratios, confidence intervals, marginal effects, etc. You can calculate these methods using functions from packages such as broom or margins.
  4. Validation: You can use various methods to validate and generalize the model's results to new or unseen data, such as cross-validation, bootstrap, or hold-out test set. You can use functions from packages such as caret or Boot to perform these methods.

For example, you can use the following code to evaluate the results of the both-direction model that you selected earlier:

# Performance: Calculate the accuracy, precision, recall, and F1-score of the both-direction model on the test data
library(caret)
# Convert the predicted classes to a factor with the same levels as test.data$diabetes
pred.class <- factor(ifelse(pred > 0.5, "pos", "neg"), levels = levels(test.data$diabetes))
# Create the confusion matrix
cm <- confusionMatrix(pred.class, test.data$diabetes
cm
Performance: Calculate the accuracy, precision, recall, and F1-score of the both-direction model on the test data


The confusion matrix and associated statistics reveal the performance of a classification model for predicting diabetes outcomes. The matrix displays the number of true negatives (TN), false positives (FP), false negatives (FN), and true positives (TP) for the predicted versus the actual diabetes classes. However, a noteworthy concern arises as there are no optimistic predictions (pos) in the output, indicating that the model failed to identify any positive cases, resulting in zero TP and FP. 

The absence of optimistic predictions leads to undefined specificity, sensitivity,  and negative predictive values (Neg Pred Value). The model's overall accuracy is 0.6667, suggesting that it correctly classifies approximately two-thirds of the cases. Nevertheless, this accuracy metric should be interpreted cautiously, as it needs to provide a complete picture due to the model's inability to predict positive cases.

The Kappa statistic is zero, indicating no agreement between the predicted and actual classes beyond what might be expected by chance. This lack of agreement reinforces the model's limitations in capturing meaningful patterns in the data.

Additionally, Mcnemar's Test P-Value is significantly small (9.443e-07), suggesting a substantial difference in the model's performance compared to a model with an equal number of false negatives and positives.

On the other hand, the balanced accuracy is 0.5000, reflecting the same issues as the sensitivity and specificity, as it considers the performance across both classes.

In conclusion, the model's failure to predict positive cases significantly hampers its usefulness in practical applications. The absence of sensitivity and specificity values, combined with the low Kappa statistic, implies that the model is not effectively capturing the underlying patterns of diabetes outcomes. Thus, further refinement of the model or exploration of different predictive algorithms is crucial to improve its performance and make it suitable for accurate diabetes prediction.

What are the Advantages and Disadvantages of Stepwise Logistic Regression?

Stepwise logistic regression has some advantages and disadvantages you should know before using it.

Some of the advantages are:

  • It can reduce the complexity and improve the model's performance by eliminating irrelevant or redundant variables.
  • It can help to avoid overfitting, multicollinearity, and high variance, as well as to increase interpretability and generalizability.
  • It can be easy and fast to implement and automate using functions such as stepAIC.

Some of the disadvantages are:

  • It can be sensitive to the order of variable entry or removal, which can lead to different final models depending on the starting point and direction of the procedure.
  • Due to the multiple testing and data snooping involved in the process, it can produce biased estimates of the coefficients and standard errors, as well as inflated p-values and confidence intervals.
  • It can ignore meaningful interactions or nonlinear effects among the variables and potential confounding or moderating factors.
  • It can be computationally intensive and time-consuming, especially when dealing with large data sets or many potential predictors.
Therefore, stepwise logistic regression should be used cautiously and supplemented with other variable selection methods, such as domain knowledge, exploratory data analysis, correlation analysis, or regularization techniques.

How to Avoid Some Common Pitfalls and Challenges of Stepwise Logistic Regression

Stepwise logistic regression can be a valuable tool for variable selection, but it also comes with pitfalls and challenges that you should avoid or overcome.

Common pitfalls and Challenges:

Appropriate criterion for variable selection

The stepAIC function uses the AIC as the default criterion for variable selection, but you can also use other criteria such as BIC or Cp. The choice of criterion can affect the final model and its performance, so you should compare different criteria and choose the one that best suits your data and problem.

Assumptions and diagnostics of logistic regression

Stepwise logistic regression is based on logistic regression, which has some assumptions and diagnostics you should check before and after performing it. For example, you should check for linearity in the logit, independence of errors, absence of multicollinearity, outliers, influential points, etc. You can use functions from packages such as car or ggplot2 to perform these checks.

Validating and generalizing the results

Stepwise logistic regression can produce a model that fits nicely on the training data but may need to generalize better to new or unseen data. Therefore, you should validate and generalize your results using cross-validation, bootstrap, or hold-out test set methods. You should also report your results with appropriate measures of uncertainty, such as standard errors, confidence intervals, or prediction intervals.

FAQs

What is the difference between forward and backward stepwise regression?

Forward stepwise regression starts with an intercept-only model and adds variables one at a time based on their significance and contribution to the model fit. Backward stepwise regression starts with a full model and removes variables one at a time based on their significance and contribution to the model fit.

What is the advantage of both-direction stepwise regression?

Both-direction stepwise regression combines forward and backward steps, adding and removing variables based on their significance and contribution to the model fit. This more flexible method can explore more possible models than forward or backward stepwise regression alone.

How to choose the best criterion for variable selection in stepwise logistic regression?

The best criterion for variable selection depends on the data and problem. Some standard criteria are AIC, BIC, or Cp. AIC tends to select more variables than BIC or Cp, which can lead to more complex but less parsimonious models. BIC or Cp tend to select fewer variables than AIC, which can lead to more straightforward but more economical models. You should compare different criteria and choose the one that best suits your data and problem.

How to check the assumptions and diagnostics of logistic regression before and after performing stepwise logistic regression?

Logistic regression has some assumptions and diagnostics that you should check before and after performing stepwise logistic regression. For example, you should check for linearity in the logit, independence of errors, absence of multicollinearity, outliers, influential points, etc. You can use functions from packages such as car or ggplot2 to perform these checks.

How do we validate and generalize the results of stepwise logistic regression to new or unseen data?

You can use various methods to validate and generalize the results of stepwise logistic regression to new or unseen data. For example, you can use cross-validation, bootstrap, or hold-out test set. Cross-validation splits the data into k folds and uses k-1 folds for training and one fold for testing. Bootstrap resamples the data with replacement and uses each resamples for training and testing. The hold-out test set splits the data into training and test sets once and uses them for training and testing. You should report your results with appropriate measures of uncertainty, such as standard errors, confidence intervals, or prediction intervals.

Conclusion

In this article, you learned:
  • What is stepwise logistic regression, and why use it
  • How to perform stepwise logistic regression in R using the stepAIC function
  • How to compare different stepwise methods, such as forward, backward, and both-direction selection
  • How to interpret and evaluate the results of stepwise logistic regression
  • What are the advantages and disadvantages of stepwise logistic regression?
  • How to avoid some common pitfalls and challenges of stepwise logistic regression
We hope that this article has helped you understand and apply stepwise logistic regression in R. If you have any questions or feedback, please feel free to contact us at info@rstudiodatalab.com or Stuck with code join our community or comment on this post. 
You can also hire us for your data analysis projects by filling out this form:Get a Quote

Join Our Community  Allow us to Assist You 
Source and Output File.zip RStudio < 1MB zip
 



About the author

Zubair Goraya
Ph.D. Scholar | Certified Data Analyst | Blogger | Completed 5000+ data projects | Passionate about unravelling insights through data.

Post a Comment