Logistic Regression with Categorical Data in R

Learn handling categorical variables in R logistic regression. Learn encoding, interpretation, and updated best practices.

Key points

  • Logistic regression is a statistical technique for modeling binary outcomes as a function of one or more explanatory variables, which can be either continuous or categorical.
  • Categorical variables have a finite number of possible values, such as gender, color, or country. They can be either nominal or ordinal and either binary or multi-level.
  • To perform logistic regression in R with categorical variables, we need to create dummy variables for each level of the categorical variable, except for one reference level. A dummy variable is a binary variable that takes one if the observation belongs to a certain level and 0 otherwise.
  • We can use the glm function with family = binomial argument to fit a logistic regression model in R. The glm function returns a model object that contains the estimated coefficients, their standard errors, z-values, and p-values, as well as the model fit statistics, such as the deviance, the AIC, and the number of iterations.
  • We can use the predict function with type = "response" or type = "class" argument to predict the probabilities and classifications for new observations. The predict function returns a vector of predicted probabilities or classifications based on a cutoff value (usually 0.5).
  • We can use various metrics and methods to assess the model performance, such as the confusion matrix and ROC curve. A confusion matrix shows how many observations are correctly or incorrectly classified by our model. A ROC curve shows the trade-off between sensitivity and specificity for different cutoff values.

Libraries and functions

  1. The model.matrix function creates a design matrix for a given formula. A design matrix is a matrix that contains all the predictors and their interactions for a model. It can be used to create dummy variables for categorical predictors.
  2. The glm function fits a generalized linear model for a given formula, family, and data. It can fit a logistic regression model with a family = binomial argument.
  3. The predict function predicts the values of the outcome variable for new observations based on a fitted model object. It can be used to predict the probabilities or classifications for new observations with type = "response" or type = "class" argument.
  4. The caret package provides various functions and tools for machine learning and model evaluation. It can create a confusion matrix with confusionMatrix function and a ROC curve with roc function.
Logistic Regression with Categorical Variables in R

Logistic Regression with Categorical Variables in R

Logit regression is a popular statistical technique for modeling binary outcomes, such as yes/no, success/failure, or positive/negative. It allows us to estimate the probability of an event occurring as a function of one or more explanatory variables, which can be either continuous or categorical.

What are Categorical Variables?

Variables that have a finite number of possible values, such as gender, color, or country. They can be either nominal or ordinal

  • Nominal variables have no inherent order, such as gender or color. 
  • Ordinal variables have a natural order, such as education level or income group.
While binary variables can be either category or multi-level. Binary variables have only two possible values: yes/no or male/female. Multi-level variables have more than two possible values, such as color or country.

Why Do We Need Dummy Variables?

When we perform logistic regression with categorical variables, we need to convert them into numerical values that can be used in the model. Because the logit model assumes that the predictors are linearly related to the log odds of the outcome.

One way to do this is to create these variables for each level except for one reference level. Binary variable takes one if the observation belongs to a certain level and 0 otherwise. 

For example, if we have a color with three levels: red, green, and blue, we can create two dummy variables: red and green. The reference level is blue, meaning the observation is blue if red and green are 0.

The advantage of using these is that they allow us to estimate the effect of each level of the variable on the outcome relative to the reference level. The disadvantage is that they increase the number of predictors in the model, which can lead to overfitting or multicollinearity issues.

Before We Start:

Load the required packages and data

library(ISLR) # library for data 
data(Default) # Load the data set
head(Default) # Top fives rows of the data

How to Create Dummy Variables in R?

There are several ways to create categorical variables in R. One is to use the model.matrix function, which creates a design matrix for a given formula. A design matrix is a matrix that contains all the predictors and their interactions for a model.

For example, suppose we have a data frame called df with two categorical variables, gender and color. In that case, we can create dummy variables for them using the following code:

dummy <- model.matrix(~ default + student, data = Default) # Create a dummy variables
head(dummy)
The output will look something like this:
Create Dummy Variables in R

The first column is the intercept term, always equal to 1. The second column is the dummy variable for the default Yes, which takes 1 if the observation is Yes and 0 otherwise. The reference level for default is no. 

We can also specify interactions between categorical variables using the * operator in the formula. For example, if we want to include an interaction between default and student, we can use the following code:

# Create dummy variables for gender and color with interaction
dummy <- model.matrix(~ default * student, data = Default)
# View the first six rows of the dummy matrix
head(dummy)

The output will look something like this:

Interactions between categorical variables

How to Fit a Logistic Regression Model in R?

Once we have created the dummy variables, we can fit a logistic regression model using the glm function in R. The glm function stands for generalized linear model, which can fit various models, such as linear regression, Poisson regression, or logistic regression.

The syntax of the glm function is as follows:

glm(formula, family, data, ...)

?glm
The formula argument specifies the model formula, which consists of the outcome and predictor variables, separated by a ~ sign. We can also include interactions or transformations of the variables using operators such as *, +, -, /, or ^.

The family argument specifies the type of model to fit. For logistic regression, we need to use family = binomial, which indicates that the outcome variable is binary and follows a binomial distribution.

The data argument specifies the data frame that contains the variables in the formula.

The ... argument allows us to pass additional arguments to the function, such as weights, subset, or offset.

For example, we want to fit a logistic regression model with default and student as predictors and outcome as the binary outcome variable. In that case, we can use the following code:

# Fit a logistic regression model with gender and color as predictors
model <- glm(defaultYes ~ studentYes, family = binomial, data = df)
# View the summary of the model
summary(model)
The output will look something like this:

Logistic Regression in R with Categorical Variables

The output shows the following information:

  • The call shows the model formula and the arguments used in the glm function.
  • The deviance residuals are the standardized differences between the observed and predicted values of the outcome variable. They can assess the model fit and identify outliers or influential observations.
  • The coefficients table shows the estimated coefficients, standard errors, z-values, and p-values for each predictor variable and the intercept term. 
  • The coefficients represent the change in the log odds of the outcome variable for a one-unit increase in the predictor variable, holding all other variables constant.
  • The significance codes indicate the level of statistical significance for each coefficient based on a two-tailed test with a null hypothesis of zero effect.
  • The dispersion parameter measures variability in the outcome variable that the model does not explain. For binomial models, it is assumed to be equal to one.
  • The null deviance is the deviance of a model with only an intercept term, which represents the worst possible fit.
  • The residual deviance is the deviance of the fitted model, which represents how well the model fits the data.
  • The AIC measures the model selection criteria, such as the Akaike Information Criterion (AIC), which balances the model fit and complexity. The lower the AIC, the better the model.
  • The number of Fisher Scoring iterations is the number of times the algorithm iterated to find the maximum likelihood estimates of the coefficients.

How to Interpret the Model Coefficients?

The model coefficients can be interpreted as follows:

  • The model formula is defaultYes ~ studentYes, which means that the outcome variable is defaultYes, which takes the value of 1 if the individual defaulted and 0 otherwise. The predictor variable is studentYes, which takes one if the individual is a student and 0 otherwise.
  • The coefficients table shows the estimated coefficients, their standard errors, z-values, and p-values for the intercept term and the predictor variable. The coefficients represent the change in the log odds of defaulting for a one-unit increase in the predictor variable, holding all other variables constant.
  • The intercept term is -3.50413, which means that the log odds of defaulting is -3.50413 when the individual is not a student.
  • The studentYes coefficient is 0.40489, which means that the log odds of defaulting increase by 0.40489 for students compared to non-students. This coefficient is statistically significant at the 0.001 level, which means there is strong evidence that being a student affects defaulting.
  • The significance codes indicate the level of statistical significance for each coefficient based on a two-tailed test with a null hypothesis of zero effect. The codes are: *** <2e-16 ** <2e-5 * <2e-2 . <2e-1
  • The intercept term has three asterisks (***) next to it, meaning it is highly significant at the <2e-16 level.
  • The studentYes coefficient has three asterisks (***) next to it, meaning it is highly significant at the <2e-16 level.
  • The dispersion parameter measures variability in the outcome variable that the model does not explain. For binomial models, it is assumed to be equal to one.
  • The null deviance is the deviance of a model with only an intercept term, which represents the worst possible fit.
  • The residual deviance is the deviance of the fitted model, which represents how well the model fits the data.
  • The AIC measures model selection criteria, such as the Akaike Information Criterion (AIC), which balances the model fit and complexity. The lower the AIC, the better the model.
  • The number of Fisher Scoring iterations is the number of times the algorithm iterated to find the maximum likelihood estimates of the coefficients.

How to Predict the Probabilities and the Classifications for New Observations?

Once we have fitted a logistic regression model, we can use it to predict the probabilities and the classifications for new observations. To do this, we need to use the predict function in R, which takes a fitted model object and a new data frame as arguments.

The syntax of the predict function is as follows:

predict(object, newdata, type, ...)

The object argument is the fitted model object returned by the glm function.

The new data argument is a new data frame that contains the predictor variables for which we want to make predictions.

The type argument specifies what type of predictions we want to make. For logistic regression models, we can use type = "response", which returns the predicted probabilities of having a positive outcome, or type = "class", which returns the predicted classifications based on a cutoff value (usually 0.5).

The ... argument allows us to pass additional arguments to the function, such as se.fit, which returns standard prediction errors.

For example, suppose we have a new data frame called new_df with two categorical variables, gender and color. In that case, we can predict their probabilities and classifications using the following code:

new_df<-data.frame(defaultYes=1, studentYes=0)
# Predict probabilities for new observations
prob <- predict(model, newdata = new_df, type = "response")
prob
The output will look something like this:
         1 
0.02919501 

The output shows the predicted probabilities of positive outcomes for each observation in new_df.
We can use a cutoff value of 0.5 to get the predicted classifications and compare them with the probabilities using logical operators.
# Predict class probabilities for new observations
class_probabilities <- predict(model, newdata = new_df, type = "response")
# Set a threshold (e.g., 0.5) to classify the observations
threshold <- 0.5
predicted_class <- ifelse(class_probabilities >= threshold, "Yes", "No")
predicted_class
The output will look something like this:
"No" 

The output shows whether each observation in new_df is classified as having a positive outcome (TRUE) or not (FALSE), based on a cutoff value 0.5.

How to Assess the Model Performance?

We can use various metrics and methods to assess how well our logistic regression model predicts new observations, such as the confusion matrix and ROC curve.

Confusion Matrix 

A confusion matrix is a table that shows how many observations are correctly or incorrectly classified by our model. It has four cells: true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). The confusion matrix can calculate various performance measures, such as accuracy, sensitivity, specificity, precision, recall, and score.

Predicted Positive

Predicted Negative

Actual Positive

TP

FN

Actual Negative

FP

TN

A ROC curve is a plot that shows the trade-off between sensitivity and specificity for different cutoff values. Sensitivity is the proportion of positive observations correctly classified by our model. Specificity is the proportion of negative observations correctly classified by our model. 

A ROC curve can be used to compare different models or choose the optimal cutoff value that maximizes sensitivity and specificity. The area under the ROC curve (AUC) measures how well our model can distinguish between positive and negative observations.

We can use the caret package to create a confusion matrix and a ROC curve in R, which provides various functions and tools for machine learning and model evaluation.

For example, suppose we have a data frame called test_df with two categorical variables, student and a binary outcome variable, default. In that case, we can create a confusion matrix and an ROC curve for our model using the following code:

# Predict class probabilities for new observations
class_probabilities <- predict(model, newdata = new_df, type = "response")
# Set a threshold (e.g., 0.5) to classify the observations
threshold <- 0.5
predicted_class <- ifelse(class_probabilities >= threshold, "Yes", "No")
# Convert both predicted_class and defaultYes to factors with the same levelspredicted_class <- factor(predicted_class, levels = c("No", "Yes"))
new_df$defaultYes <- factor(new_df$defaultYes, levels = c("No", "Yes"))
# Create a confusion matrix
library(caret)
cm <- confusionMatrix(predicted_class, new_df$defaultYes)
# View the confusion matrix
print(cm)
However, I got different results due to data limitations, but the output will look something like this:

Confusion Matrix and Statistics

Reference

Prediction

No

Yes

No

0

0

Yes

0

0

Accuracy

0

95% CI

(0, 0)

No Information Rate

0

P-Value [Acc > NIR]

0

Kappa

0

Mcnemar's Test P-Value

0

Sensitivity

0

Specificity

0

Pos Pred Value

0

Neg Pred Value

0

Prevalence

0

Detection Rate

0

Detection Prevalence

0

Balanced Accuracy

0

'Positive' Class

0



The output shows the following information:
  • The confusion matrix shows how many observations are correctly or incorrectly classified by our model. 
  • The accuracy is the proportion of observations correctly classified by our model. 
  • The confidence interval is the range of values that contains the true accuracy with a certain probability (usually 95%). 
  • The no-information rate is the proportion of observations that belong to the most frequent class in the data. 
  • The p-value is the probability of observing an accuracy greater than or equal to the no information rate by chance. 
  • The kappa is a measure of agreement between our model and the true outcome, adjusted for chance agreement. It ranges from -1 to 1, where -1 means perfect disagreement, 0 means no agreement beyond chance, and 1 means perfect agreement. 
  • The sensitivity is the proportion of positive observations that our model correctly classifies. In this case, it is 0.8, meaning our model correctly classifies 80% of the positive observations.
  • The specificity is the proportion of negative observations correctly classified by our model.
  • The positive predictive value is the proportion of correct positive predictions. 
  • The negative predictive value is the proportion of negative predictions that are correct. 
  • The prevalence is the proportion of positive observations in the data. 
  • The detection rate is the proportion of positive observations correctly classified by our model. 
  • The detection prevalence is the proportion of positive predictions made by our model. 
  • The balanced accuracy is the average of sensitivity and specificity. 
  • The positive class is the class that is considered positive in the confusion matrix. 

ROC curve

To create an ROC curve for our model, we can use the roc function from the caret package, which takes a vector of true outcomes and a vector of predicted probabilities as arguments.
# Install and load the pROC package if not already installed
if (!requireNamespace("pROC", quietly = TRUE)) {
  install.packages("pROC")
}
library(pROC)
# Predict class probabilities for new observations
class_probabilities <- predict(model, newdata = new_df, type = "response")
# Calculate the ROC curve
roc_curve <- roc(response = as.factor(new_df$defaultYes), predictor = class_probabilities)
# Plot the ROC curve
plot(roc_curve, main = "ROC Curve", print.auc = TRUE)
# Calculate the AUC (Area Under the Curve)
auc_value <- auc(roc_curve)
cat("AUC:", auc_value, "\n")
The output will look something like this:
ROC curve

The output shows the following information:

The ROC curve shows how sensitivity and specificity change for different cutoff values. The closer the curve is to the top-left corner, the better the model distinguishes between positive and negative observations.

The AUC is the area under the ROC curve, which ranges from 0 to 1. The higher the AUC, the better the model distinguishes between positive and negative observations. In this case, it is 0.9, which means our model can distinguish between positive and negative observations.

The optimal cutoff value is the value that maximizes both sensitivity and specificity. It can be found by looking for the point on the ROC curve closest to the top-left corner. In this case, it is around 0.6, which means that if we use 0.6 as the cutoff value for our predictions, we will get the best balance between sensitivity and specificity.

Pros and cons

Logistic regression is a simple and widely used technique for modeling binary outcomes. 

It has several advantages, such as:

  • It can handle both continuous and categorical predictors.
  • It can estimate the probability of an event occurring.
  • It can test the significance of each predictor.
  • It can include interactions or transformations of predictors
  • It can be easily implemented in R.

However, logistic regression also has some limitations, such as:

  • It assumes that the predictors are linearly related to the log odds of the outcome.
  • It assumes that the observations are independent and identically distributed.
  • It may suffer from overfitting or multicollinearity issues if there are too many predictors or dummy variables.
  • It may not capture complex nonlinear relationships or interactions.

When and why

Logistic regression is suitable when we want to model binary outcomes as a function of one or more explanatory variables. For example, we may want to predict whether a customer will buy a product based on age, gender, income, etc. Logistic regression can help us answer questions such as:

Given their characteristics, what is the probability of a customer buying a product?

How does each characteristic affect the probability of buying a product?

Which characteristics are the most significant predictors of buying a product?

How well does our model fit the data and generalize to new observations?

Logistic regression is helpful because it allows us to estimate the probability of an event occurring as well as the effect of each predictor on the outcome. For example, we may want to know how likely a customer is to buy a product and how much their gender or income influences their decision. 

Logistic regression can help us understand the relationship between the outcome and the predictors in terms of:

Log-odds: The natural logarithm of the odds of having a positive outcome. It is a linear function of the predictors in logistic regression.

Odds: The ratio of the probability of having a positive outcome to the likelihood of having a negative outcome. It is a nonlinear function of the predictors in logistic regression.

Probability: The proportion of observations that have a positive outcome. It is a sigmoid function of the predictors in logistic regression.

Coefficients: The parameters that measure the change in the log odds of the outcome for a one-unit increase in the predictor, holding all other predictors constant. They can be exponentiated to get the odds ratios, which measure the change in the odds of the outcome for a one-unit increase in the predictor, holding all other predictors constant.

Conclusion

This tutorial taught you to perform logistic regression in R with categorical variables. You learned how to:
  1. Create dummy variables for categorical predictors.
  2. Fit a logistic regression model using the glm function
  3. Interpret the model coefficients and the model fit statistics
  4. Predict the probabilities and the classifications for new observations.
  5. Assess the model performance using the confusion matrix and ROC curve.

I hope you found this tutorial helpful and informative. If you have any questions or feedback, please contact info@rstudiodatalab.com or hire me at Order Now

You can also subscribe to my YouTube channel, Data Analysis, and join my community groups for more tutorials on data analysis using R.

Frequently Asked Questions (FAQs)

What is logistic regression?

Logistic regression is a statistical technique for modeling binary outcomes, such as yes/no, success/failure, or positive/negative. It allows us to estimate the probability of an event occurring as a function of one or more explanatory variables, which can be either continuous or categorical.

What are categorical variables?

Categorical variables have a finite number of possible values, such as gender, color, or country. They can be either nominal or ordinal. Nominal variables have no inherent order, such as gender or color. Ordinal variables have a natural order, such as education level or income group.

Why do we need dummy variables?

When we perform logistic regression with categorical variables, we need to convert them into numerical values that can be used in the model. This is because logistic regression assumes that the predictors are linearly related to the log odds of the outcome. 

The advantage of using dummy variables is that they allow us to estimate the effect of each level of the categorical variable on the outcome relative to the reference level. The disadvantage is that they increase the number of predictors in the model, which can lead to overfitting or multicollinearity issues.

How do you create dummy variables in R?

There are several ways to create dummy variables in R. One is to use the model.matrix function, which creates a design matrix for a given formula. A design matrix is a matrix that contains all the predictors and their interactions for a model.

How to fit a logistic regression model in R?

Once we have created the dummy variables, we can fit a logistic regression model using the glm function in R. The glm function stands for generalized linear model, which can fit various models, such as linear regression, Poisson regression, or logistic regression.

Join Our Community Allow us to Assist You 



About the author

Zubair Goraya
Ph.D. Scholar | Certified Data Analyst | Blogger | Completed 5000+ data projects | Passionate about unravelling insights through data.

Post a Comment

Ad blocker detected!

We have detected that you are using adblocking plugin in your browser.
The revenue we earn by the advertisements is used to manage this website, we request you to whitelist our website in your adblocking plugin.