13  Regression Approach to Classification

We have seen in Section 12.3 that classification is based on a decision rule applied to an estimate of a probability. Any method that can predict probabilities can thus be used to classify observations. The regression approach to classification relies on such models where \(\text{E}[Y]\) is a probability (binary case) or a vector of probabilities (multinomial case). Logistic regression and multinomial regression can be used for predicting the mean response—a probability (vector)—or class membership.

13.1 Binary Data

With binary data we can apply logistic regression to predict events (positives) and non-events (negatives). Recall from Chapter 10 that a logistic regression model (with logit link) is a generalized linear model where \[ \begin{align*} Y &\sim \text{Bernoulli}(\pi) \\ \pi &= \frac{1}{1+\exp\{-\eta\}} \\ \eta &= \textbf{x}^\prime \boldsymbol{\beta} \end{align*} \]

Once \(\boldsymbol{\beta}\) is estimated by maximum likelihood, predicted values on the link scale and on the mean scale are obtained as \[ \begin{align*} \widehat{\eta} &= \textbf{x}^\prime\widehat{\boldsymbol{\beta}} \\ \widehat{\pi} &= \frac{1}{1+\exp\{-\widehat{\eta}\}} \end{align*} \]

and an observation is classified as an event if \(\widehat{\pi} > c\). If the threshold \(c=0.5\), the Bayes classifier results.

Example: Credit Default–ISLR

We continue the analysis of the credit default data from Chapter 10, but with the goal to classify the observations in the test data set as defaulting/not defaulting on their credit and analyzing the performance of the model based on confusion matrix, ROC and Precision-recall curves.

Recall that the Default data is part of the ISLR2 library (James et al. 2021), a simulated data set with ten thousand observations. The target variable is default, whether a customer defaulted on their credit card debt. Input variables include a factor that indicates student status, account balance and income information.

As previously, we randomly split the data into 9,000 training observations and 1,000 test observations

library(ISLR2)
head(Default)
  default student   balance    income
1      No      No  729.5265 44361.625
2      No     Yes  817.1804 12106.135
3      No      No 1073.5492 31767.139
4      No      No  529.2506 35704.494
5      No      No  785.6559 38463.496
6      No     Yes  919.5885  7491.559
set.seed(765)
n <- nrow(Default)
testset <- sort(sample(n,n*0.1))
test <- Default[testset,]
train <- Default[-testset,]
nrow(train)
[1] 9000
nrow(test)
[1] 1000

Next we fit a logistic regression model to the training data.

log_reg <- glm(default ~ ., data=train, family=binomial)
summary(log_reg)

Call:
glm(formula = default ~ ., family = binomial, data = train)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.106e+01  5.252e-01 -21.060   <2e-16 ***
studentYes  -6.098e-01  2.530e-01  -2.410    0.016 *  
balance      5.811e-03  2.491e-04  23.323   <2e-16 ***
income       5.038e-06  8.731e-06   0.577    0.564    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2617.1  on 8999  degrees of freedom
Residual deviance: 1398.0  on 8996  degrees of freedom
AIC: 1406

Number of Fisher Scoring iterations: 8

Student status and account balance are significant predictors of credit default, the income seems less important given the other predictors (\(p\)-value of 0.5639).

The confusion matrix can be computed with the confusionMatrix function in the caret package. The option positive="Yes" identifies the level of the factor considered the “positive” level for the calculation of the statistics. This will not affect the overall confusion matrix but will affect the interpretation of sensitivity, specificity and other statistics. By default, the function uses the first level of a factor as the “positive” result which would be “No” in our case.

Before calling confusionMatrix we first calculate the predicted probabilities from the logistic regression model, then calculate the Bayes classifier (\(c=0.5\)). mode="everything" requests statistics based on sensitivity and specificity as well as statistics based on precision and recall.

library(caret)

predicted_prob_test <- predict(log_reg, newdata=test, type="response")
classify_test <- as.factor(ifelse(predicted_prob_test > 0.5,"Yes","No"))

confusionMatrix(classify_test,test$default, 
                positive="Yes",
                mode="everything")
Confusion Matrix and Statistics

          Reference
Prediction  No Yes
       No  958  26
       Yes   7   9
                                         
               Accuracy : 0.967          
                 95% CI : (0.954, 0.9772)
    No Information Rate : 0.965          
    P-Value [Acc > NIR] : 0.407906       
                                         
                  Kappa : 0.3384         
                                         
 Mcnemar's Test P-Value : 0.001728       
                                         
            Sensitivity : 0.2571         
            Specificity : 0.9927         
         Pos Pred Value : 0.5625         
         Neg Pred Value : 0.9736         
              Precision : 0.5625         
                 Recall : 0.2571         
                     F1 : 0.3529         
             Prevalence : 0.0350         
         Detection Rate : 0.0090         
   Detection Prevalence : 0.0160         
      Balanced Accuracy : 0.6249         
                                         
       'Positive' Class : Yes            
                                         

The accuracy of the model appears high with 96.7%, but the no-information rate of 0.965 shows that the inclusion of the three input variables did not improve the model much. If you were to simply classify all observations as “No”, this naïve approach would result in an accuracy of 96.5%, simply because defaults are very rare.

The sensitivity of the model is dismal with 0.2571, the false negative rate is high (1-0.2571 = 0.7429). If someone defaults, the model has only a 25.71% chance to detect that. Not surprisingly, the specificity is very high, 958/(958+7) = 0.9927. This is again driven by the high number of non-defaulters and a low false positive rate (FPR = 7 / (958 + 7) = 0.0073).

Would changing the threshold improve the model by increasing its sensitivity? If we declare a default if the predicted probability exceeds 0.25, we get more positive predictions. How does this affect accuracy and other measures?

classify_test <- as.factor(ifelse(predicted_prob_test > 0.25,"Yes","No"))

confusionMatrix(classify_test,
                test$default, 
                positive="Yes",
                mode="everything")
Confusion Matrix and Statistics

          Reference
Prediction  No Yes
       No  935  15
       Yes  30  20
                                         
               Accuracy : 0.955          
                 95% CI : (0.9402, 0.967)
    No Information Rate : 0.965          
    P-Value [Acc > NIR] : 0.96033        
                                         
                  Kappa : 0.4479         
                                         
 Mcnemar's Test P-Value : 0.03689        
                                         
            Sensitivity : 0.5714         
            Specificity : 0.9689         
         Pos Pred Value : 0.4000         
         Neg Pred Value : 0.9842         
              Precision : 0.4000         
                 Recall : 0.5714         
                     F1 : 0.4706         
             Prevalence : 0.0350         
         Detection Rate : 0.0200         
   Detection Prevalence : 0.0500         
      Balanced Accuracy : 0.7702         
                                         
       'Positive' Class : Yes            
                                         

The sensitivity of the model increases, as expected, while the specificity does not take much of a hit. Interestingly, the accuracy of this decision rule has now sunk below the no-information rate. Precision has gone down but the \(F_1\) score has gone up. This decision rule balances better between precision and recall than the Bayes classifier.

The next code blocks uses the ROCR package to compute the ROC curve (Figure 13.1), the AUC, the Precision-recall curve (Figure 13.2) and the AUC-PR. The first step in using ROCR is to call the prediction function to create a prediction object. The performance function of the package is then used to compute statistics and visualizations based on that object.

library(ROCR)

pred_prob <- predict(log_reg, newdata=test, type="response")

pred_obj <- prediction(pred_prob,test$default,
                       label.ordering=c("No","Yes"))

perf <- performance(pred_obj,"sens","fpr")
plot(perf,colorize=TRUE)

auc <- performance(pred_obj,"auc")
auc@y.values[[1]]   
[1] 0.9468246
Figure 13.1: ROC curve for credit default logistic regression.

The ROC curve looks quite good for this data–model combination. The cutoffs \(c\) corresponding to the steps in the plot are shown with different colors. As the cutoff drops below 0.35, the sensitivity of the decision rule increases sharply. The area under the curve of 0.9468 is impressive. However, we know that the data are highly unbalanced, so let’s take a look at the precision-recall plot (Figure 13.2)

perf <- performance(pred_obj,"prec","rec")
plot(perf,colorize=TRUE)

aucpr <- performance(pred_obj,"aucpr")
aucpr@y.values[[1]] 
[1] 0.4987905
Figure 13.2: Precision-recall curve for credit default logistic regression.

Another powerful feature of ROCR is the calculation of any measure as a function of the threshold value. The next code block computes and displays accuracy and \(F_1\) measure as a function of \(c\). Both are maximized for values in the neighborhood of \(c=0.4\)

f1 <- performance(pred_obj,"f")
acc <- performance(pred_obj,"acc")

par(mfrow=c(1,2))
plot(f1@x.values[[1]],f1@y.values[[1]],cex=0.4,
     xlab="Threshold c",
     ylab="F1",
     las=1,bty="l")
maxval <- which.max(f1@y.values[[1]])
abline(v=f1@x.values[[1]][maxval],col="red")

plot(acc@x.values[[1]],acc@y.values[[1]],cex=0.4,
     xlab="Threshold c",
     ylab="Accuracy",
     las=1,bty="l")
maxval <- which.max(acc@y.values[[1]])
abline(v=acc@x.values[[1]][maxval],col="red")

13.2 Multinomial Data

Suppose you draw \(n\) observations from a population where labels \(C_1, \cdots, C_k\) occur with probabilities \(\pi_1, \cdots, \pi_k\). The probability to observe \(C_1\) exactly \(y_1\) times, \(C_2\) exactly \(y_2\) times, and so forth, is \[ \begin{align*} \Pr(\textbf{Y}= [y_1,\cdots,y_k]) &= \frac{n!}{y_1!y_2!\cdots y_k!} \pi_1 \times \cdots \pi_k \\ &= \frac{n!}{\prod_{j=1}^ky_j} \prod_{j=1}^k \pi_j \end{align*} \] This is the probability mass function of the multinomial distribution. For \(k=2\) this reduces to the binomial distribution, \[ {n \choose y} \pi^y (1-\pi)^{n-y} \] The mean of the multinomial distribution is \(n\) times the vector of the category probabilities \(\boldsymbol{\pi} = [\pi_1, \cdots,\pi_k]^\prime\).

A classification model for multinomial data can thus be based on a multinomial regression model that predicts the category probabilities, and then applies a classification rule in a second step to determine the predicted category. How these regression models are constructed differs depending on whether the categories of the multinomial are unordered (nominal) or ordered.

Modeling Nominal Data

The models for unordered multinomial data are a direct extension of the logistic regression type models for binary data. The predicted probabilities in the logistic model are \[ \begin{align*} \Pr(Y=1 | \textbf{x}) &= \frac{1}{1+\exp\{-\textbf{x}^\prime\boldsymbol{\beta}\}} \\ \Pr(Y=0 | \textbf{x}) &= \frac{\exp\{-\textbf{x}^\prime\boldsymbol{\beta}\}}{1+\exp\{-\textbf{x}^\prime\boldsymbol{\beta}\}} \end{align*} \]

Softmax function and reference category

The generalization to \(k\) categories is \[ \Pr(Y=j | \textbf{x}) = \frac{\exp\{\textbf{x}^\prime\boldsymbol{\beta}_j\}}{\sum_{l=1}^k \exp\{\textbf{x}^\prime\boldsymbol{\beta}_l\}} \tag{13.1}\]

Each of the \(k\) categories has its own parameter vector \(\boldsymbol{\beta}_j\). But wait, why does the logistic regression model with \(k=2\) have only one (\(k-1\)) parameter vector? Since the category probabilities sum to 1, one of the probabilities is redundant, it can be calculated from the other probabilities. In multinomial logistic regression this constraint is built into the calculations by setting \(\boldsymbol{\beta}_j = \textbf{0}\) for one of the categories. This is called the reference category. Suppose we choose the first category as reference. Then Equation 13.1 becomes

\[ \begin{align*} \Pr(Y=1 | \textbf{x}) &= \frac{1}{1 + \sum_{l=2}^k \exp\{\textbf{x}^\prime\boldsymbol{\beta}_l\}} \\ \Pr(Y=j > 1 | \textbf{x}) &= \frac{\exp\{\textbf{x}^\prime\boldsymbol{\beta}_j\}}{1 + \sum_{l=2}^k \exp\{\textbf{x}^\prime\boldsymbol{\beta}_l\}} \end{align*} \] If the last category is chosen as the reference, Equation 13.1 becomes \[ \begin{align*} \Pr(Y=j < k | \textbf{x}) &= \frac{\exp\{\textbf{x}^\prime\boldsymbol{\beta}_j\}}{1 + \sum_{l=1}^{k-1} \exp\{\textbf{x}^\prime\boldsymbol{\beta}_l\}}\\ \Pr(Y=k | \textbf{x}) &= \frac{1}{1 + \sum_{l=1}^{k-1} \exp\{\textbf{x}^\prime\boldsymbol{\beta}_l\}} \end{align*} \]

Tip

When using software to model multinomial data, make sure to check how the code handles the reference category. There is no consistency across software packages, the choice is arbitrary. By default, SAS uses the last category as the reference, the nnet:mulitnom function in R uses the first category. The interpretation of the regression coefficients depends on the choice of the reference category. Fortunately, the predicted category probabilities do not depend on the choice of the reference category.

In addition, check on how the levels of the target variable are ordered.

In training neural networks, Equation 13.1 is called the softmax activation function (see Chapter 32). Activation functions have two important roles in neural networks: to introduce nonlinearity and to map between input and output of a network layer. The softmax activation function is used in networks that are built to classify data into \(k\) categories. Since neural networks are typically overparameterized, that is, they have more parameters than observations, the softmax transformation is applied there without constraining one of the parameter vectors to zero. This parallel development will lead us down the path in Section 34.2.3.3 to express multinomial regression for classification as a special case of a neural network (without hidden layers) and a softmax output activation.

Using the softmax criterion in multinomial logistic regression is the equivalence of the inverse logit link function in logistic regression. It maps from the linear predictor space to the mean of the target. The result is a probability. In neural networks, there are no distributional assumptions, so the softmax transformation should be seen as mapping \(\textbf{x}^\prime\boldsymbol{\beta}_1, \cdots, \textbf{x}^\prime\boldsymbol{\beta}_k\) to buckets of the (0,1) interval such that \[ \sum_{j=1}^k \frac{\exp\{\textbf{x}^\prime\boldsymbol{\beta}_j\}}{\sum_{l=1}^k \exp\{\textbf{x}^\prime\boldsymbol{\beta}_l\}} = 1 \] It is a stretch to think of the terms as probabilities in that context.

Multinomial Regression in R

Multiple packages can fit multinomial regression models in R.

  • The nnet::multinom function uses a neural network to estimate the parameters. It uses the first category as the reference.

  • The mlogit::mlogit function fits multinomial models by maximum likelihood and has the ability to include random effects (a multinomial mixed model). It uses a discrete choice formulation which is popular in econometrics.

We use the nnet::multinom function here and demonstrate model fitting, prediction, and classification for a simple model using the Iris data.

Before fitting a multinomial regression model, we split the data into training and test data sets with 1/3 of the observation for testing the model. Stratified sampling via caret::CreateDataPartition is used to make sure that all species are represented in the training and test data sets with appropriate proportions.

set.seed(654)
trainset <- caret::createDataPartition(iris$Species, p=2/3, list=FALSE,times=1)
iris_train <- iris[trainset,]
iris_test <- iris[-trainset,]

table(iris_train$Species)

    setosa versicolor  virginica 
        34         34         34 
table(iris_test$Species)

    setosa versicolor  virginica 
        16         16         16 

Before we start modeling, let’s confirm the ordering of the Species factor.

str(iris_train$Species)
 Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

There are three species, the first level of the factor is setosa. This will be the reference level in the multinomial regression—that is, the coefficients for the I. setosa category will be set to zero. To choose a different level as the reference level, you can rearrange the factor with relevel().

The following code fits a multinomial regression model with a single input variable (Petal.Length) and target Species. The model tries to predict the iris species from just the measured length of the flower petals.

library(nnet)
multi <- multinom(Species ~ Petal.Length, data=iris_train)
# weights:  9 (4 variable)
initial  value 112.058453 
iter  10 value 13.522921
iter  20 value 12.661974
iter  30 value 12.554082
iter  40 value 12.551329
iter  50 value 12.550885
iter  60 value 12.550774
iter  70 value 12.550401
iter  80 value 12.550267
iter  90 value 12.549892
final  value 12.549891 
converged
s <- summary(multi)
s
Call:
multinom(formula = Species ~ Petal.Length, data = iris_train)

Coefficients:
           (Intercept) Petal.Length
versicolor   -25.90912     10.07796
virginica    -68.32912     18.87635

Std. Errors:
           (Intercept) Petal.Length
versicolor    52.89464     21.27824
virginica     54.35859     21.43631

Residual Deviance: 25.09978 
AIC: 33.09978 

The fit converges after several iterations on the training data. The coefficient estimates for the multinomial regression model are

  • \(\boldsymbol{\beta}_{\text{setosa}} = [0, 0]\),

  • \(\boldsymbol{\beta}_{\text{versicolor}} =\) [ -25.9091, 10.078], and

  • \(\boldsymbol{\beta}_{\text{virgnica}} =\) [ -68.3291, 18.8764].

Suppose we wish to predict the predicted probabilities for the 35th observation in the training set, a flower with petal length of 4.7.

iris_train[35,]
   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
51            7         3.2          4.7         1.4 versicolor

First we need to compute the terms \(\exp\{\beta_0 + \beta_1 \text{Petal.Length}\}\) for all three species. The sum of those is the term in the denominator of Equation 13.1. The following code computes the linear predictors, the denominator and the category probabilities for that observation

# The input vector for the prediction
x_data <- c(1,iris_train[35,"Petal.Length"])

# The coefficient vectors for the three species 
b_setosa <- rep(0,2)
b_versicolor <- s$coefficients[1,]
b_virginica <- s$coefficients[2,]

# the linear predictors for the three species
eta_setosa <- x_data%*%b_setosa
eta_versicolor <- x_data%*%b_versicolor
eta_virginica <- x_data%*%b_virginica

# The denominator for the softmax criterion
denom <- sum(exp(eta_setosa) + 
             exp(eta_versicolor) +
             exp(eta_virginica))

# The category probabilities
pr_setosa <- exp(eta_setosa)/denom
pr_versicolor <- exp(eta_versicolor)/denom
pr_virginica <- exp(eta_virginica)/denom

cat("Pr(setosa | 4.7) = "    , round(pr_setosa,4),"\n")
Pr(setosa | 4.7) =  0 
cat("Pr(versicolor | 4.7) = ", round(pr_versicolor,4),"\n")
Pr(versicolor | 4.7) =  0.7441 
cat("Pr(virginica | 4.7) = " , round(pr_virginica,4),"\n")
Pr(virginica | 4.7) =  0.2559 

The calculation is confirmed by checking the fitted value for this observation:

round(s$fitted.values[35,],4)
    setosa versicolor  virginica 
    0.0000     0.7441     0.2559 

Predicted values for any value of the input(s) can also be constructed with the predict function:

x_pred <- data.frame("Petal.Length"=4.7)
round(predict(multi,newdata=x_pred,type="probs"),4)
    setosa versicolor  virginica 
    0.0000     0.7441     0.2559 

If you request a prediction of type "class", the function performs classification based on the largest predicted probability.

predict(multi,newdata=x_pred,type="class")
[1] versicolor
Levels: setosa versicolor virginica

In this case, the predicted and observed category agree.

How well does this single-regressor model classify iris species? To answer the question we turn to the confusion matrix for the test data:

pred_class <- predict(multi,newdata=iris_test,type="class")

cm <- caret::confusionMatrix(pred_class, iris_test$Species, mode="everything")
cm
Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         16          0         0
  versicolor      0         14         0
  virginica       0          2        16

Overall Statistics
                                          
               Accuracy : 0.9583          
                 95% CI : (0.8575, 0.9949)
    No Information Rate : 0.3333          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9375          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            0.8750           1.0000
Specificity                 1.0000            1.0000           0.9375
Pos Pred Value              1.0000            1.0000           0.8889
Neg Pred Value              1.0000            0.9412           1.0000
Precision                   1.0000            1.0000           0.8889
Recall                      1.0000            0.8750           1.0000
F1                          1.0000            0.9333           0.9412
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.2917           0.3333
Detection Prevalence        0.3333            0.2917           0.3750
Balanced Accuracy           1.0000            0.9375           0.9688

The model classifies extremely well based on just one input, Petal.Length. 2 I. versicolor are misclassified as I. virginica, the model has an accuracy of 0.9583.

In the binary classification, confusionMatrix returns a single column of confusion statistics. When \(k > 2\), a separate column is returned for each of the factor levels, comparing that level to all other levels combined. This is called the one-versus-all approach. For example, classifying I. setosa against the other two species, the model has perfect sensitivity, specificity, and recall. Classifying I. versicolor against the other species, the model has a sensitivity of 0.875.

Modeling Ordinal Data

Ordered multinomial data has category labels that imply an ordering in a greater-lesser sense. While numeric distances between the categories are not defined, we at least know that one category is more or less than another category. Examples of ordinal data are ratings (5-star scale), assessments of severity (minor, moderate, extreme), indications of sentiment (strongly disagree, disagree, agree, strongly agree), and so forth.