13 Regression Approach to Classification

We have seen in Section 12.3 that classification is based on a decision rule applied to an estimate of a probability. Any method that can predict probabilities can thus be used to classify observations. The regression approach to classification relies on such models where \(\text{E}[Y]\) is a probability (binary case) or a vector of probabilities (multinomial case). Logistic regression and multinomial regression can be used for predicting the mean response—a probability (vector)—or class membership.

13.1 Binary Data

With binary data we can apply logistic regression to predict events (positives) and non-events (negatives). Recall from Chapter 10 that a logistic regression model (with logit link) is a generalized linear model where \[ \begin{align*} Y &\sim \text{Bernoulli}(\pi) \\ \pi &= \frac{1}{1+\exp\{-\eta\}} \\ \eta &= \textbf{x}^\prime \boldsymbol{\beta} \end{align*} \]

Once \(\boldsymbol{\beta}\) is estimated by maximum likelihood, predicted values on the link scale and on the mean scale are obtained as \[ \begin{align*} \widehat{\eta} &= \textbf{x}^\prime\widehat{\boldsymbol{\beta}} \\ \widehat{\pi} &= \frac{1}{1+\exp\{-\widehat{\eta}\}} \end{align*} \]

and an observation is classified as an event if \(\widehat{\pi} > c\). If the threshold \(c=0.5\), the Bayes classifier results.

Example: Credit Default–ISLR

We continue the analysis of the credit default data from Chapter 10, but with the goal to classify the observations in the test data set as defaulting/not defaulting on their credit and analyzing the performance of the model based on confusion matrix, ROC and Precision-recall curves.

Recall that the Default data is part of the ISLR2 library (James et al. 2021), a simulated data set with ten thousand observations. The target variable is default, whether a customer defaulted on their credit card debt. Input variables include a factor that indicates student status, account balance and income information.

As previously, we randomly split the data into 9,000 training observations and 1,000 test observations

library(ISLR2)
head(Default)

  default student   balance    income
1      No      No  729.5265 44361.625
2      No     Yes  817.1804 12106.135
3      No      No 1073.5492 31767.139
4      No      No  529.2506 35704.494
5      No      No  785.6559 38463.496
6      No     Yes  919.5885  7491.559

set.seed(765)
n <- nrow(Default)
testset <- sort(sample(n,n*0.1))
test <- Default[testset,]
train <- Default[-testset,]
nrow(train)

[1] 9000

nrow(test)

[1] 1000

Next we fit a logistic regression model to the training data.

log_reg <- glm(default ~ ., data=train, family=binomial)
summary(log_reg)


Call:
glm(formula = default ~ ., family = binomial, data = train)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.106e+01  5.252e-01 -21.060   <2e-16 ***
studentYes  -6.098e-01  2.530e-01  -2.410    0.016 *  
balance      5.811e-03  2.491e-04  23.323   <2e-16 ***
income       5.038e-06  8.731e-06   0.577    0.564    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2617.1  on 8999  degrees of freedom
Residual deviance: 1398.0  on 8996  degrees of freedom
AIC: 1406

Number of Fisher Scoring iterations: 8

Student status and account balance are significant predictors of credit default, the income seems less important given the other predictors (\(p\)-value of 0.5639).

The confusion matrix can be computed with the confusionMatrix function in the caret package. The option positive="Yes" identifies the level of the factor considered the “positive” level for the calculation of the statistics. This will not affect the overall confusion matrix but will affect the interpretation of sensitivity, specificity and other statistics. By default, the function uses the first level of a factor as the “positive” result which would be “No” in our case.

Before calling confusionMatrix we first calculate the predicted probabilities from the logistic regression model, then calculate the Bayes classifier (\(c=0.5\)). mode="everything" requests statistics based on sensitivity and specificity as well as statistics based on precision and recall.

library(caret)

predicted_prob_test <- predict(log_reg, newdata=test, type="response")
classify_test <- as.factor(ifelse(predicted_prob_test > 0.5,"Yes","No"))

confusionMatrix(classify_test,test$default, 
                positive="Yes",
                mode="everything")

Confusion Matrix and Statistics

          Reference
Prediction  No Yes
       No  958  26
       Yes   7   9
                                         
               Accuracy : 0.967          
                 95% CI : (0.954, 0.9772)
    No Information Rate : 0.965          
    P-Value [Acc > NIR] : 0.407906       
                                         
                  Kappa : 0.3384         
                                         
 Mcnemar's Test P-Value : 0.001728       
                                         
            Sensitivity : 0.2571         
            Specificity : 0.9927         
         Pos Pred Value : 0.5625         
         Neg Pred Value : 0.9736         
              Precision : 0.5625         
                 Recall : 0.2571         
                     F1 : 0.3529         
             Prevalence : 0.0350         
         Detection Rate : 0.0090         
   Detection Prevalence : 0.0160         
      Balanced Accuracy : 0.6249         
                                         
       'Positive' Class : Yes

The accuracy of the model appears high with 96.7%, but the no-information rate of 0.965 shows that the inclusion of the three input variables did not improve the model much. If you were to simply classify all observations as “No”, this naïve approach would result in an accuracy of 96.5%, simply because defaults are very rare.

The sensitivity of the model is dismal with 0.2571, the false negative rate is high (1-0.2571 = 0.7429). If someone defaults, the model has only a 25.71% chance to detect that. Not surprisingly, the specificity is very high, 958/(958+7) = 0.9927. This is again driven by the high number of non-defaulters and a low false positive rate (FPR = 7 / (958 + 7) = 0.0073).

Would changing the threshold improve the model by increasing its sensitivity? If we declare a default if the predicted probability exceeds 0.25, we get more positive predictions. How does this affect accuracy and other measures?

classify_test <- as.factor(ifelse(predicted_prob_test > 0.25,"Yes","No"))

confusionMatrix(classify_test,
                test$default, 
                positive="Yes",
                mode="everything")

Confusion Matrix and Statistics

          Reference
Prediction  No Yes
       No  935  15
       Yes  30  20
                                         
               Accuracy : 0.955          
                 95% CI : (0.9402, 0.967)
    No Information Rate : 0.965          
    P-Value [Acc > NIR] : 0.96033        
                                         
                  Kappa : 0.4479         
                                         
 Mcnemar's Test P-Value : 0.03689        
                                         
            Sensitivity : 0.5714         
            Specificity : 0.9689         
         Pos Pred Value : 0.4000         
         Neg Pred Value : 0.9842         
              Precision : 0.4000         
                 Recall : 0.5714         
                     F1 : 0.4706         
             Prevalence : 0.0350         
         Detection Rate : 0.0200         
   Detection Prevalence : 0.0500         
      Balanced Accuracy : 0.7702         
                                         
       'Positive' Class : Yes

The sensitivity of the model increases, as expected, while the specificity does not take much of a hit. Interestingly, the accuracy of this decision rule has now sunk below the no-information rate. Precision has gone down but the \(F_1\) score has gone up. This decision rule balances better between precision and recall than the Bayes classifier.

The next code blocks uses the ROCR package to compute the ROC curve (Figure 13.1), the AUC, the Precision-recall curve (Figure 13.2) and the AUC-PR. The first step in using ROCR is to call the prediction function to create a prediction object. The performance function of the package is then used to compute statistics and visualizations based on that object.

library(ROCR)

pred_prob <- predict(log_reg, newdata=test, type="response")

pred_obj <- prediction(pred_prob,test$default,
                       label.ordering=c("No","Yes"))

perf <- performance(pred_obj,"sens","fpr")
plot(perf,colorize=TRUE)

auc <- performance(pred_obj,"auc")
auc@y.values[[1]]

[1] 0.9468246

Figure 13.1: ROC curve for credit default logistic regression.

The ROC curve looks quite good for this data–model combination. The cutoffs \(c\) corresponding to the steps in the plot are shown with different colors. As the cutoff drops below 0.35, the sensitivity of the decision rule increases sharply. The area under the curve of 0.9468 is impressive. However, we know that the data are highly unbalanced, so let’s take a look at the precision-recall plot (Figure 13.2)

perf <- performance(pred_obj,"prec","rec")
plot(perf,colorize=TRUE)

aucpr <- performance(pred_obj,"aucpr")
aucpr@y.values[[1]]

[1] 0.4987905

Figure 13.2: Precision-recall curve for credit default logistic regression.

Another powerful feature of ROCR is the calculation of any measure as a function of the threshold value. The next code block computes and displays accuracy and \(F_1\) measure as a function of \(c\). Both are maximized for values in the neighborhood of \(c=0.4\)

f1 <- performance(pred_obj,"f")
acc <- performance(pred_obj,"acc")

par(mfrow=c(1,2))
plot(f1@x.values[[1]],f1@y.values[[1]],cex=0.4,
     xlab="Threshold c",
     ylab="F1",
     las=1,bty="l")
maxval <- which.max(f1@y.values[[1]])
abline(v=f1@x.values[[1]][maxval],col="red")

plot(acc@x.values[[1]],acc@y.values[[1]],cex=0.4,
     xlab="Threshold c",
     ylab="Accuracy",
     las=1,bty="l")
maxval <- which.max(acc@y.values[[1]])
abline(v=acc@x.values[[1]][maxval],col="red")

13.2 Multinomial Data

Suppose you draw \(n\) observations from a population where labels \(C_1, \cdots, C_k\) occur with probabilities \(\pi_1, \cdots, \pi_k\). The probability to observe \(C_1\) exactly \(y_1\) times, \(C_2\) exactly \(y_2\) times, and so forth, is \[ \begin{align*} \Pr(\textbf{Y}= [y_1,\cdots,y_k]) &= \frac{n!}{y_1!y_2!\cdots y_k!} \pi_1 \times \cdots \pi_k \\ &= \frac{n!}{\prod_{j=1}^ky_j} \prod_{j=1}^k \pi_j \end{align*} \] This is the probability mass function of the multinomial distribution. For \(k=2\) this reduces to the binomial distribution, \[ {n \choose y} \pi^y (1-\pi)^{n-y} \] The mean of the multinomial distribution is \(n\) times the vector of the category probabilities \(\boldsymbol{\pi} = [\pi_1, \cdots,\pi_k]^\prime\).

A classification model for multinomial data can thus be based on a multinomial regression model that predicts the category probabilities, and then applies a classification rule in a second step to determine the predicted category. How these regression models are constructed differs depending on whether the categories of the multinomial are unordered (nominal) or ordered.

Modeling Nominal Data

The models for unordered multinomial data are a direct extension of the logistic regression type models for binary data. The predicted probabilities in the logistic model are \[ \begin{align*} \Pr(Y=1 | \textbf{x}) &= \frac{1}{1+\exp\{-\textbf{x}^\prime\boldsymbol{\beta}\}} \\ \Pr(Y=0 | \textbf{x}) &= \frac{\exp\{-\textbf{x}^\prime\boldsymbol{\beta}\}}{1+\exp\{-\textbf{x}^\prime\boldsymbol{\beta}\}} \end{align*} \]

Softmax function and reference category

The generalization to \(k\) categories is \[ \Pr(Y=j | \textbf{x}) = \frac{\exp\{\textbf{x}^\prime\boldsymbol{\beta}_j\}}{\sum_{l=1}^k \exp\{\textbf{x}^\prime\boldsymbol{\beta}_l\}} \tag{13.1}\]

Each of the \(k\) categories has its own parameter vector \(\boldsymbol{\beta}_j\). But wait, why does the logistic regression model with \(k=2\) have only one (\(k-1\)) parameter vector? Since the category probabilities sum to 1, one of the probabilities is redundant, it can be calculated from the other probabilities. In multinomial logistic regression this constraint is built into the calculations by setting \(\boldsymbol{\beta}_j = \textbf{0}\) for one of the categories. This is called the reference category. Suppose we choose the first category as reference. Then Equation 13.1 becomes

\[ \begin{align*} \Pr(Y=1 | \textbf{x}) &= \frac{1}{1 + \sum_{l=2}^k \exp\{\textbf{x}^\prime\boldsymbol{\beta}_l\}} \\ \Pr(Y=j > 1 | \textbf{x}) &= \frac{\exp\{\textbf{x}^\prime\boldsymbol{\beta}_j\}}{1 + \sum_{l=2}^k \exp\{\textbf{x}^\prime\boldsymbol{\beta}_l\}} \end{align*} \] If the last category is chosen as the reference, Equation 13.1 becomes \[ \begin{align*} \Pr(Y=j < k | \textbf{x}) &= \frac{\exp\{\textbf{x}^\prime\boldsymbol{\beta}_j\}}{1 + \sum_{l=1}^{k-1} \exp\{\textbf{x}^\prime\boldsymbol{\beta}_l\}}\\ \Pr(Y=k | \textbf{x}) &= \frac{1}{1 + \sum_{l=1}^{k-1} \exp\{\textbf{x}^\prime\boldsymbol{\beta}_l\}} \end{align*} \]

Tip

When using software to model multinomial data, make sure to check how the code handles the reference category. There is no consistency across software packages, the choice is arbitrary. By default, SAS uses the last category as the reference, the nnet:mulitnom function in R uses the first category. The interpretation of the regression coefficients depends on the choice of the reference category. Fortunately, the predicted category probabilities do not depend on the choice of the reference category.

In addition, check on how the levels of the target variable are ordered.

In training neural networks, Equation 13.1 is called the softmax activation function (see Chapter 32). Activation functions have two important roles in neural networks: to introduce nonlinearity and to map between input and output of a network layer. The softmax activation function is used in networks that are built to classify data into \(k\) categories. Since neural networks are typically overparameterized, that is, they have more parameters than observations, the softmax transformation is applied there without constraining one of the parameter vectors to zero. This parallel development will lead us down the path in Section 34.2.3.3 to express multinomial regression for classification as a special case of a neural network (without hidden layers) and a softmax output activation.

Softmax in neural nets

Using the softmax criterion in multinomial logistic regression is the equivalence of the inverse logit link function in logistic regression. It maps from the linear predictor space to the mean of the target. The result is a probability. In neural networks, there are no distributional assumptions, so the softmax transformation should be seen as mapping \(\textbf{x}^\prime\boldsymbol{\beta}_1, \cdots, \textbf{x}^\prime\boldsymbol{\beta}_k\) to buckets of the (0,1) interval such that \[ \sum_{j=1}^k \frac{\exp\{\textbf{x}^\prime\boldsymbol{\beta}_j\}}{\sum_{l=1}^k \exp\{\textbf{x}^\prime\boldsymbol{\beta}_l\}} = 1 \] It is a stretch to think of the terms as probabilities in that context.

Multinomial Regression in `R`

Multiple packages can fit multinomial regression models in R.

The nnet::multinom function uses a neural network to estimate the parameters. It uses the first category as the reference.
The mlogit::mlogit function fits multinomial models by maximum likelihood and has the ability to include random effects (a multinomial mixed model). It uses a discrete choice formulation which is popular in econometrics.

We use the nnet::multinom function here and demonstrate model fitting, prediction, and classification for a simple model using the Iris data.

Before fitting a multinomial regression model, we split the data into training and test data sets with 1/3 of the observation for testing the model. Stratified sampling via caret::CreateDataPartition is used to make sure that all species are represented in the training and test data sets with appropriate proportions.

set.seed(654)
trainset <- caret::createDataPartition(iris$Species, p=2/3, list=FALSE,times=1)
iris_train <- iris[trainset,]
iris_test <- iris[-trainset,]

table(iris_train$Species)


    setosa versicolor  virginica 
        34         34         34

table(iris_test$Species)


    setosa versicolor  virginica 
        16         16         16

Before we start modeling, let’s confirm the ordering of the Species factor.

str(iris_train$Species)

 Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

There are three species, the first level of the factor is setosa. This will be the reference level in the multinomial regression—that is, the coefficients for the I. setosa category will be set to zero. To choose a different level as the reference level, you can rearrange the factor with relevel().

The following code fits a multinomial regression model with a single input variable (Petal.Length) and target Species. The model tries to predict the iris species from just the measured length of the flower petals.

library(nnet)
multi <- multinom(Species ~ Petal.Length, data=iris_train)

# weights:  9 (4 variable)
initial  value 112.058453 
iter  10 value 13.522921
iter  20 value 12.661974
iter  30 value 12.554082
iter  40 value 12.551329
iter  50 value 12.550885
iter  60 value 12.550774
iter  70 value 12.550401
iter  80 value 12.550267
iter  90 value 12.549892
final  value 12.549891 
converged

s <- summary(multi)
s

Call:
multinom(formula = Species ~ Petal.Length, data = iris_train)

Coefficients:
           (Intercept) Petal.Length
versicolor   -25.90912     10.07796
virginica    -68.32912     18.87635

Std. Errors:
           (Intercept) Petal.Length
versicolor    52.89464     21.27824
virginica     54.35859     21.43631

Residual Deviance: 25.09978 
AIC: 33.09978

The fit converges after several iterations on the training data. The coefficient estimates for the multinomial regression model are

\(\boldsymbol{\beta}_{\text{setosa}} = [0, 0]\),
\(\boldsymbol{\beta}_{\text{versicolor}} =\) [ -25.9091, 10.078], and
\(\boldsymbol{\beta}_{\text{virgnica}} =\) [ -68.3291, 18.8764].

Suppose we wish to predict the predicted probabilities for the 35^th observation in the training set, a flower with petal length of 4.7.

iris_train[35,]

   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
51            7         3.2          4.7         1.4 versicolor

First we need to compute the terms \(\exp\{\beta_0 + \beta_1 \text{Petal.Length}\}\) for all three species. The sum of those is the term in the denominator of Equation 13.1. The following code computes the linear predictors, the denominator and the category probabilities for that observation

# The input vector for the prediction
x_data <- c(1,iris_train[35,"Petal.Length"])

# The coefficient vectors for the three species 
b_setosa <- rep(0,2)
b_versicolor <- s$coefficients[1,]
b_virginica <- s$coefficients[2,]

# the linear predictors for the three species
eta_setosa <- x_data%*%b_setosa
eta_versicolor <- x_data%*%b_versicolor
eta_virginica <- x_data%*%b_virginica

# The denominator for the softmax criterion
denom <- sum(exp(eta_setosa) + 
             exp(eta_versicolor) +
             exp(eta_virginica))

# The category probabilities
pr_setosa <- exp(eta_setosa)/denom
pr_versicolor <- exp(eta_versicolor)/denom
pr_virginica <- exp(eta_virginica)/denom

cat("Pr(setosa | 4.7) = "    , round(pr_setosa,4),"\n")

Pr(setosa | 4.7) =  0

cat("Pr(versicolor | 4.7) = ", round(pr_versicolor,4),"\n")

Pr(versicolor | 4.7) =  0.7441

cat("Pr(virginica | 4.7) = " , round(pr_virginica,4),"\n")

Pr(virginica | 4.7) =  0.2559

The calculation is confirmed by checking the fitted value for this observation:

round(s$fitted.values[35,],4)

    setosa versicolor  virginica 
    0.0000     0.7441     0.2559

Predicted values for any value of the input(s) can also be constructed with the predict function:

x_pred <- data.frame("Petal.Length"=4.7)
round(predict(multi,newdata=x_pred,type="probs"),4)

    setosa versicolor  virginica 
    0.0000     0.7441     0.2559

If you request a prediction of type "class", the function performs classification based on the largest predicted probability.

predict(multi,newdata=x_pred,type="class")

[1] versicolor
Levels: setosa versicolor virginica

In this case, the predicted and observed category agree.

How well does this single-regressor model classify iris species? To answer the question we turn to the confusion matrix for the test data:

pred_class <- predict(multi,newdata=iris_test,type="class")

cm <- caret::confusionMatrix(pred_class, iris_test$Species, mode="everything")
cm

Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         16          0         0
  versicolor      0         14         0
  virginica       0          2        16

Overall Statistics
                                          
               Accuracy : 0.9583          
                 95% CI : (0.8575, 0.9949)
    No Information Rate : 0.3333          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9375          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            0.8750           1.0000
Specificity                 1.0000            1.0000           0.9375
Pos Pred Value              1.0000            1.0000           0.8889
Neg Pred Value              1.0000            0.9412           1.0000
Precision                   1.0000            1.0000           0.8889
Recall                      1.0000            0.8750           1.0000
F1                          1.0000            0.9333           0.9412
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.2917           0.3333
Detection Prevalence        0.3333            0.2917           0.3750
Balanced Accuracy           1.0000            0.9375           0.9688

The model classifies extremely well based on just one input, Petal.Length. 2 I. versicolor are misclassified as I. virginica, the model has an accuracy of 0.9583.

In the binary classification, confusionMatrix returns a single column of confusion statistics. When \(k > 2\), a separate column is returned for each of the factor levels, comparing that level to all other levels combined. This is called the one-versus-all approach. For example, classifying I. setosa against the other two species, the model has perfect sensitivity, specificity, and recall. Classifying I. versicolor against the other species, the model has a sensitivity of 0.875.

Modeling Ordinal Data

Ordered multinomial data has category labels that imply an ordering in a greater-lesser sense. While numeric distances between the categories are not defined, we at least know that one category is more or less than another category. Examples of ordinal data are ratings (5-star scale), assessments of severity (minor, moderate, extreme), indications of sentiment (strongly disagree, disagree, agree, strongly agree), and so forth.

Cumulative link models

A statistical model for ordinal data must preserve the ordering of the data. One method of accomplishing that is to base the model on cumulative probabilities rather than category probabilities. If \(\pi_j = \Pr(Y = j)\) is the probability that \(Y\) takes on the label associated with the \(j\)^th category, then \(\gamma_j = \Pr(Y \leq j)\) is called the cumulative probability of the \(j\)^th category.

To classify an observation based on the cumulative probabilities, we calculate the category probabilities \[ \begin{align*} \pi_1 &= \gamma_1 \\ \pi_j &= \gamma_j - \gamma_{j-1} \quad \text{for } 1 < j < k\\ \pi_k &= 1 - \gamma_{k-1} \end{align*} \]

and then assign the class with the largest category probability.

The proportional odds model (POM) is a representative of this type of model. It is also known as a cumulative link model because the link function is applied to the cumulative probabilities. In case of a logit link, the POM is \[ \text{logit}(\gamma_j) = \log\{\frac{\gamma_j}{1-\gamma_j}\} = \eta_j = \textbf{x}^\prime\boldsymbol{\beta}_j \]

However, in contrast to the multinomial regression model for nominal data, the linear predictors in the proportional odds model are more constrained: only the intercepts vary between the categories: \[ \eta_j = \beta_{0j} + \beta_1 x_1 + \cdots + \beta_p x_p \] The slopes are the same for all categories.

Caution

Cumulative link models can be motivated in different ways. When formulated based on a latent variable approach —where some unobserved random variable carves out segments of its support—you end up with a linear predictor of the form \[ \eta_j = \beta_{0j} - \beta_1 x_1 - \cdots - \beta_p x_p \] As always, check the documentation! The MASS::polr function in R uses this formulation. SAS uses a linear predictor with plus signs.

In the logistic or multinomial regression model we could reduce the number of parameters because of the built-in constraint that the categories must sum to 1. A related constraint applies to cumulative link models. The cumulative probability in the last category is known to be 1, \(\gamma_k = \Pr(Y \leq k) = 1\). Thus we do not need to estimate a separate intercept for the last category. A proportional odds model with \(p=4\) inputs and \(k=3\) target categories has \(p + k-1 = 6\) parameters.

Example: Ordinal Ratings in Completely Randomized Design

For this exercise we use the data in Table 6.13 of Schabenberger and Pierce (2001, p .350). Four treatments (A, B, C, D) were assigned in a completely randomized design with four replications. The state of the replicates of the experimental units was rated as Poor, Average, or Good on four occasions (Table 13.1). For example, on the first measurement occasion all replicates of treatment A were rated in the Poor category. At the second occasion two replicates of treatment A were in Poor condition, two replicates were in Average condition.

Table 13.1: Observed frequencies for CRD measured on 4 occasions.

Rating	A	B	C	D
Poor	4,2,4,4	4,3,4,4	0,0,0,0	1,0,0,0
Average	0,2,0,0	0,1,0,0	1,0,4,4	2,2,4,4
Good	0,0,0,0	0,0,0,0	3,4,0,0	1,2,0,0

The following code creates the data in data frame format with three columns, factor rating for the target variable, factor tx for the treatment, and a date variable for the measurement occasion.

freqs <- c(4,2,4,4,4,3,4,4,1,2,1,1,4,4,2,2,4,4,3,4,1,2)
cats <- c(1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,3,3,3,3)
tx <- c(1,1,1,1,2,2,2,2,4,1,2,3,3,3,4,4,4,4,3,3,4,4)
date <- c(1,2,3,4,1,2,3,4,1,2,2,1,3,4,1,2,3,4,1,2,1,2)
df <- data.frame(freqs,resp=cats,Treatment=tx,date)

ordinal <- data.frame()
for (i in 1:nrow(df)) {
    for (j in 1:df$freqs[i]) {
        if (df$resp[i] == 1) {
            resp_ch = "Poor"
        } else if (df$resp[i] == 2) {
            resp_ch = "Average"
        } else {
            resp_ch = "Good"
        }
        ordinal <- rbind(ordinal,cbind(rating=resp_ch,
                                   date=df$date[i],
                                   tx=df$Treatment[i]))
    }
}
ordinal$date <- as.numeric(ordinal$date)
ordinal$rating <- factor(ordinal$rating)
ordinal$tx <- factor(ordinal$tx)

When working with ordinal target variables, we need to make sure that the factor levels are ordered correctly.

ordinal$rating[25:35]

 [1] Poor    Poor    Poor    Poor    Poor    Poor    Average Average Average
[10] Average Average
Levels: Average Good Poor

The response levels are ordered alphabetically, which is not the order in which the data should be processed. Use the factor() function to tell R how the levels should be arranged.

ordinal$rating <- factor(ordinal$rating, 
                         levels = c("Poor", "Average", "Good"))
ordinal$rating[25:35]

 [1] Poor    Poor    Poor    Poor    Poor    Poor    Average Average Average
[10] Average Average
Levels: Poor Average Good

Another issue to watch with factors is the factor-level coding. An input factor such as tx with 4 levels will not contribute 4 parameters to a model that also contains an intercept because the \(\textbf{X}\) matrix would be singular. One of the levels is usually dropped as the reference level.

R chooses the first level. SAS, for example, chooses the last level. You can use the relevel() function to tell R which level to choose as the reference. For example, the next statement makes the level with value 4 the reference level for the tx factor.

ordinal$tx <- relevel(ordinal$tx, ref=4)
ordinal$tx[1:20]

 [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2
Levels: 4 1 2 3

The POM can now be fit with the MASS::polr function. The method= parameter chooses the link function for the cumulative probabilities. The value of the parameter is “logistic” rather than “logit” because polr uses the latent variable formulation for the POM. Assuming that the latent variable follows a logistic distribution leads to a cumulative link model with logit link function. Because of the latent variable genesis of the model, polr constructs a linear predictor of the form \(\eta_j = \beta_{0j} - \beta_1x_1 - \cdots - \beta_p x_p\).

library(MASS)
pom <- polr(rating ~ tx + date, data=ordinal, 
            method="logistic")
pom_sum <- summary(pom)
pom_coef <- pom_sum$coefficients
pom_sum

Call:
polr(formula = rating ~ tx + date, data = ordinal, method = "logistic")

Coefficients:
       Value Std. Error t value
tx1  -5.7070     1.4154  -4.032
tx2  -6.5191     1.5922  -4.094
tx3   1.4468     0.8633   1.676
date -0.8773     0.3539  -2.479

Intercepts:
             Value   Std. Error t value
Poor|Average -5.6152  1.5569    -3.6066
Average|Good -0.4868  1.0022    -0.4857

Residual Deviance: 57.12394 
AIC: 69.12394

The output of the summary has two sections with parameter estimates. Coefficients lists the slopes of the proportional odds model. These apply the same to all categories. Intercepts displays the intercepts \(\beta_{0j}\) for the cumulative categories. The intercept labeled Poor|Average applies to \(\eta_1\), the intercept labeled Average|Good applies to \(\eta_2\).

The coefficients labeled tx1 through tx3 do not measure the effect of the treatments. They measure the difference between the treatment and the reference level. With a t value of more than 4 in absolute value, there is strong evidence that the ratings between A and D and between B and D are significantly different. Also, it appears that the rating distribution, once adjusted for treatments, changes over time; the t value for the date is larger than 2 in absolute value. You can supplement these calculations with \(p\)-values to make these statements more statistically precise.

To calculate the probability to get at most an average rating for treatment 2 at date 3 (observation # 22), the linear predictor and the (cumulative) probability are

linp_ave_cumul <- pom_coef[6,1] - pom_coef[2,1] - 3*pom_coef[4,1]
prob_ave_cumul<- 1/(1+exp(-linp_ave_cumul))
c(linp_ave_cumul,prob_ave_cumul)

[1] 8.6643181 0.9998274

Similarly, for the probability of (at most) a poor rating at date 3 for treatment 2

linp_poor <- pom_coef[5,1] - pom_coef[2,1] - 3*pom_coef[4,1]
prob_poor <- 1/(1+exp(-linp_poor))
c(linp_poor,prob_poor)

[1] 3.5358517 0.9716908

The category probabilities to get an average or good rating are obtained by subtraction:

prob_ave = prob_ave_cumul - prob_poor
prob_good = 1 - prob_ave_cumul
c(prob_poor,prob_ave, prob_good)

[1] 0.9716908233 0.0281365693 0.0001726074

This are the category probabilities in contrast to the cumulative probabilities. You can see the category probabilities for all levels of the response variables as

round(pom_sum$fitted.values[22,],4)

   Poor Average    Good 
 0.9717  0.0281  0.0002

Observation 22 would be classified as Poor since this has the largest predicted category probability.

You can compute predictions directly and more easily with the predict function. The type="probs" option produces the predicted category probabilities, type="class" produces the classification.

newx <- data.frame("tx"=as.factor(2),date=3)
predict(pom,newdata=newx,type="probs")

        Poor      Average         Good 
0.9716908233 0.0281365693 0.0001726074

predict(pom,newdata=newx,type="class")

[1] Poor
Levels: Poor Average Good

The confusion matrix of this model is a 3 x 3 table with correct classifications on the diagonal.

predicted_class <- predict(pom, type="class")
caret::confusionMatrix(predicted_class,ordinal$rating)

Confusion Matrix and Statistics

          Reference
Prediction Poor Average Good
   Poor      29       3    0
   Average    1      20    3
   Good       0       1    7

Overall Statistics
                                          
               Accuracy : 0.875           
                 95% CI : (0.7685, 0.9445)
    No Information Rate : 0.4688          
    P-Value [Acc > NIR] : 1.195e-11       
                                          
                  Kappa : 0.7935          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: Poor Class: Average Class: Good
Sensitivity               0.9667         0.8333      0.7000
Specificity               0.9118         0.9000      0.9815
Pos Pred Value            0.9062         0.8333      0.8750
Neg Pred Value            0.9688         0.9000      0.9464
Prevalence                0.4688         0.3750      0.1562
Detection Rate            0.4531         0.3125      0.1094
Detection Prevalence      0.5000         0.3750      0.1250
Balanced Accuracy         0.9392         0.8667      0.8407

The accuracy is calculated similar to the 2 x 2 case: the ratio of the sum of the diagonal cells versus the total number: (29+20+7)/64 = 0.875.

The proportional odds model has a misclassification rate of \(MCR = 1-0.875 = 0.125\).

The Kappa statistic is a measure of the strength of the agreement of the predicted and the observed values. The more counts are concentrated on the diagonal, the stronger the agreement and the larger the Kappa statistic. Values of Kappa between 0.4 and 0.6 indicate moderate agreement, 0.6-0.8 substantial agreement, > 0.8 very strong agreement.

As in the multinomial case, the confusion statistics are calculated for each category using an one-versus-all approach. For example, the sensitivity of 0.9667 for Poor is calculated by contrasting Poor against the two other classes combined: 29 / (29 + 1) = 0.9667.