
10 Discrete Target Variables
10.1 Introduction
The classical linear model (Chapter 7) is appropriate when the mean function is linear in the parameters and the errors are uncorrelated with constant variance. The error conditions are met in a random sample from a Gaussian distribution. It does not take much to shake the foundation of the classical model. For example, when the target variable is the time until a manufactured item fails, the exponential or Weibull distribution is more appropriate than the Gaussian and the assumption of constant variance no longer holds. A characteristic of these distributions is a dependence of the variance on the mean. A random variable \(Y\) has an exponential distribution if its density function is given by \[ f(y) = \left \{ \begin{array}{ll}\lambda e^{-\lambda y} & y \ge 0 \\ 0 & y < 0 \end{array} \right . \] The mean and variance are \(\text{E}[Y] = 1/\lambda\) and \(\text{Var}[Y] = 1/\lambda^2 = 1/\text{E}[Y]^2\). If the mean changes, for example as a function of inputs, the variance will change as well.
Also, the support of the exponential random variable is different from the Gaussian. The former takes on non-negative values (the time lapsed between telephone calls cannot be negative) whereas the Gaussian takes on values from \(-\infty\) to \(+\infty\). A linear model for the mean function is probably a bad idea because there is no guarantee that predicted values \(\widehat{y} = \textbf{x}^\prime \widehat{\boldsymbol{\beta}}\) are non-negative. To ensure that predicted values comply with the support of the random variable, a transformation such as \[ \widehat{y} = \exp\{ \textbf{x}^\prime \widehat{\boldsymbol{\beta}} \} \] might be useful. The linear predictor \(\widehat{\eta} = \textbf{x}^\prime\widehat{\boldsymbol{\beta}}\) can now take on values between \(-\infty\) and \(+\infty\) and the predicted values obey \(\widehat{y} \ge 0\). But the exponential transformation made the model for the mean nonlinear in the parameters. By considering a different distribution for the target variable, the classical linear model framework breaks down quickly–even for a continuous random variable like the time between some events.
When the target variable is discrete–rather than continuous–the linear model breaks down in the same way, the consequences for sticking with the classical linear model framework are more severe, however.
Recall from Section 1.2.0.4 that target variables are categorized as continuous or discrete depending on whether the number of possible values is countable. Discrete variables, with countable support, are further categorized into count variables and categorical variables. The value of a count variable is an integer that indicates how many times something occurred. The value of a categorical variable is a label that describes a state. For example, the number of vehicle accidents per hour is a count variable, the make of the vehicle involved in the accidents is a categorical variable.
In this chapter we consider regression models for discrete random variables: binary variables that take on two values, ordinal and nominal variables that take on more than two values and true count variables. The discussion will be somewhat informal. The models will be re-introduced and discussed more formally in a later chapter under the umbrella of generalized linear models (GLMs, Chapter 28).
How does the failure time example in this introduction relate to modeling discrete targets? If the time between events has an exponential distribution, the number of events that occur within a unit of time–a count variable–has a Poisson distribution. Both distributions are members of a special family of probability distributions, known as the exponential family. GLMs are statistical models for target variables that have a distribution in the exponential family—more details in Section 28.2.
Do not confuse the exponential distribution with the exponential family of distributions.
The term exponential in the family of distributions comes from a particular way of writing the mass or density function, using exponentiation, see Section 28.2. Many distributions can be written that way, including the exponential density shown earlier. The exponential distribution is one member of the exponential family, other members include the Binary, Binomial, Negative Binomial, Poisson, Gaussian, Beta, and Gamma distributions.
10.2 Modeling Binary Data
A categorical variable is called binary if it takes on two possible values. The two states (or classes) are sometimes referred to as “success” and “failure”, we prefer to use the generic terms “event” and “non-event”. When modeling the recurrence of cancer, it is best not to refer to the event being modeled as a “success”. The states are coded numerically as \((1,0)\) or \((-1,1)\). The former is common in regression modeling and the state coded as 1 is the event of interest. Coding the states as \(-1\) and \(1\) can be useful in classification models.
A binary random variable takes on the values 1 and 0 with probabilities \(\pi\) and \(1-\pi\), respectively. The probability mass function can be written conveniently as \[ \Pr(Y=y) = \pi^y \, (1-\pi)^{(1-y)} \]
It is easy to show that \(\text{E}[Y] = \pi\) and \(\text{Var}[Y] = \pi(1-\pi)\). The model for the mean function of a binary target variable must be suitable for a probability. If the model has a linear predictor component \(\eta = \textbf{x}^\prime\boldsymbol{\beta}\), some transformation is needed to map between \[-\infty < \eta < \infty\] and \[0 \le \pi \le 1\]
This transformation, when applied to the mean, is called the link function. The inverse transformation, the inverse link function, is applied to the linear predictor
\[\begin{align*} g(\pi) &= \eta \\ \pi &= g^{-1}(\eta) \end{align*}\]
Logistic Regression
Where do we find functions that can be used as link/inverse link functions? One place to look are the cumulative distribution functions of random variables that are defined on \(-\infty < x < \infty\). For example, take the c.d.f. of the logistic distribution \[F(x) = \Pr(X \le x) = \frac{1}{1-\exp\{-(x-\mu)/\phi\}} \quad -\infty < x < \infty\]
\(\mu\) and \(\phi\) are the mean and scale parameter of the logistic distribution. The standard logistic has \(\mu=0\) and \(\sigma=1\) and c.d.f. \[ F(x) = \frac{1}{1+e^{-x}} \] The inverse operation, to compute the value of \(x\) associated with a cumulative probability \(p\), is the quantile function. For the standard logistic distribution the quantile function is \[ Q(p) = \log \left(\frac{p}{1-p} \right) \] This quantile function is known as the logit.
Definition: Logistic Regression
A logistic regression model is a generalized linear model for Binary data with linear predictor \(\eta = \textbf{x}^\prime \boldsymbol{\beta}\) and logit link function \[ \log\left(\frac{\text{E}[Y|\textbf{x}]}{1-\text{E}[Y|\textbf{x}]} \right) = \eta = \textbf{x}^\prime\boldsymbol{\beta} \] \[ \text{E}[Y|\textbf{x}] = \frac{1}{1+\exp\{-\eta\}}=\frac{1}{1+\exp\{-\textbf{x}^\prime\boldsymbol{\beta}\}} \]
The logit link function is a convenient choice for modeling binary data, it is symmetric about zero, has a simple expression and its inverse can be computed easily. With a logit link the coefficients have a simple and appealing interpretation in terms of log odds ratios (see below). Many other functions could serve as link functions. For example, if \(\phi(x)\) is the c.d.f. of a standard Gaussian random variable, the quantile function \(\phi^{-1}(p)\) can be used as a link function. This configuration is called **probit* regression. Logistic and Gaussian c.d.f. and quantile function have similar shapes (Figure 10.1), computations of the Gaussian probabilities and quantiles is numerically much more involved, requiring approximations.

The term logistic regression is often used for any general linear model for binary data, regardless of which link function is used. Other link functions than the logit and probit you might encounter are the log-log and complementary log-log links. These are similar in shape to the logit link but are not symmetric about zero.
Example: Credit Default–ISLR
The Default
data is part of the ISLR2
library (James et al. 2021), a simulated data set with ten thousand observations. The target variable is default
, whether a customer defaulted on their credit card debt. Input variables include a factor that indicates student status, account balance and income information.
library(ISLR2)
head(Default)
default student balance income
1 No No 729.5265 44361.625
2 No Yes 817.1804 12106.135
3 No No 1073.5492 31767.139
4 No No 529.2506 35704.494
5 No No 785.6559 38463.496
6 No Yes 919.5885 7491.559
The default
target variable is a factor. It can be passed directly to the glm
function, which models the first level of the factor as the non-event and all other levels are considered events. It is thus important to know how the factor levels are ordered:
str(Default$default)
Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
If you wish to work with the underlying values of the factor, use the unclass
function:
unique(unclass(Default$default))
[1] 1 2
Next, we split the data into a train and test data set using a 90:10 split.
set.seed(765)
<- nrow(Default)
n <- sort(sample(n,n*0.1))
testset <- Default[testset,]
test <- Default[-testset,]
train nrow(train)
[1] 9000
nrow(test)
[1] 1000
Working with binary data is easier in some ways compared to continuous response data and trickier in other ways. The following plot shows the responses (coded Default=1, non-Default=0) against the income input.
plot(train$income,
ifelse(train$default=="No",0,1),
ylim=c(0,1),
ylab="Credit default: Yes = 1",
xlab="Income",
col=c("red","blue")[unclass(train$student)],yaxt="n")
axis(2, at = seq(0, 1, by = 1), las=2)
legend("left",legend=c("Non-student","Student"),fill=c("red","blue"))
What do we learn from this plot? Students have lower incomes that non-students but are students more likely to default on credit card debt? It is difficult to discern too much from a plot that shows only two response values.
<- table(Student=Default$student,Default=Default$default)
x x
Default
Student No Yes
No 6850 206
Yes 2817 127
Cross-tabulating student status and default status shows that the proportion of defaults is higher among students (0.0431) than among non-students (0.0292).
The logistic regression of default
on all input variables is performed with the glm
function in R
.
<- glm(default ~ ., data=train, family=binomial)
log_reg
summary(log_reg)
Call:
glm(formula = default ~ ., family = binomial, data = train)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.106e+01 5.252e-01 -21.060 <2e-16 ***
studentYes -6.098e-01 2.530e-01 -2.410 0.016 *
balance 5.811e-03 2.491e-04 23.323 <2e-16 ***
income 5.038e-06 8.731e-06 0.577 0.564
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2617.1 on 8999 degrees of freedom
Residual deviance: 1398.0 on 8996 degrees of freedom
AIC: 1406
Number of Fisher Scoring iterations: 8
The Binary distribution is a special case of the Binomial distribution: the sum of \(n\) independent Bernoulli(\(\pi\)) experiments is a Binomial(\(n,\pi\)) random variable. family=binomial
covers both Binary and Binomial cases. The default link function for the binomial family in R
is the logit. A probit regression would use family=binomial(link="probit")
.
How do we use the coefficients in the output to make predictions? Consider a student with $20,000 income and a credit card balance of $1,500. How likely are they to default?
<- data.frame(student="Yes",balance=1500,income=20000)
new_x
<- predict(log_reg,newdata=new_x,type="link")
pred_linp <- predict(log_reg,newdata=new_x,type="response")
pred_mean
cat("Predicted linear predictor :", pred_linp)
Predicted linear predictor : -2.852752
cat("Predicted probability to default: ", pred_mean)
Predicted probability to default: 0.05453925
The predict
function for a generalized linear model takes a type=
argument that determines the kind of prediction. The default is type="link"
for a prediction on the scale of the linear predictor: \(\widehat{\eta} = \textbf{x}^\prime\widehat{\boldsymbol{\beta}}\). To obtain a prediction on the inverse link scale, the scale of the mean response, use type="response"
. The following calculations verify:
= coef(log_reg)[1] + coef(log_reg)[2] + 1500*coef(log_reg)[3] +
linp 20000*coef(log_reg)[4]
= 1 / (1+exp(-linp))
prob
cat("Predicted linear predictor :", linp)
Predicted linear predictor : -2.852752
cat("Predicted probability to default: ", prob)
Predicted probability to default: 0.05453925
The predicted proportion of students with an account balance of $1,500 and an income of $20,000 is {r}round(prob,5)
.
It is easy to forget for which scale predictions are calculated–different software implementations have different defaults. In the previous example it was obvious that the negative value -2.8528 is not estimating a probability. How about the following prediction?
predict(log_reg,newdata=data.frame(student="No",balance=2000,income=20000))
1
0.6623438
Interpretation
The sign of the coefficient tells us whether the predicted probability increases. It is important to check which of the two binary states is being modeled. If the target variable is a factor, glm
in R
models the second level of the factor–default="Yes"
in our example. The probability of credit card default thus increases with account balance and income and is lower if student = "Yes
compared to student = "No"
.
In the classical linear model the coefficients measure the change in the mean response if the associated input variable increases in value by one unit and all other inputs are held constant. In the logistic regression model, because of the involvement of the link function, such an interpretation applies to the link scale. In other words, the coefficient \(\widehat{\beta}_2 =\) 0.0058 for balance
measures the change in logits when the account balance increases by one dollar. Because the logistic function is nonlinear the effect of an input on the predicted mean, the probability to default, is also nonlinear. The effect of \(\beta_j\) on the probability is greater when the linear predictor is near zero where the logistic function has the greatest slope.
The odds \(O(\eta)\) are defined as the ratio of event and non-events for the value of the linear predictor: \[ O(\eta) = \frac{\pi(\eta)}{1-\pi(\eta)} = e^\eta \] The odds ratio compares the odds for two different linear predictors \[ OR(\eta_1,\eta_2) = \frac{O(\eta_1)}{O(\eta_2)}=\exp\{\eta_1 - \eta_2\} \] \(OR(\eta_1,\eta_2)\) measures how much the odds have changed from \(\eta_1\) to \(\eta_2\). These comparisons are particularly meaningful if the linear predictors differ in an effect we wish to test. For example, if \(\eta_1\) is the linear predictor for a student and \(\eta_2\) is the linear predictor for a student (with the same income and account balance), then exponentiating $_2 = $ 0.0058 gives the odds ratio for student status.
We have finally established that the logistic regression coefficients are interpreted as changes in the log odds ratios–this holds only for the logit link.
From Regression to Classification
Classification methods assign observations to categories. Any statistical method that yields probabilities can be used as a classifier. The difference between the two is in the statements
The predicted probability of a student with $x account balance and $xx income to default is …
A student with $x account balance and $xx income to default is classified as a (non-)defaulter
All we need to turn the predictions from a logistic regression model into a classification model is a rule that assigns predictions to events or non-events based on the predicted probabilities. The cutoff \(c = 0.5\) is common and is known as the Bayes classifier. If a predicted probability is greater than 0.5 it is classified as an event, otherwise it is classified as a non-event. Classification models are explored in much more detail in the next part of the material.
The results of a binary classification are displayed in a 2 x 2 table, called the confusion matrix. It compares the observed and predicted categories. The cells of the matrix contain the number of data points that fall into the cross-classification when the decision rule is applied to \(n\) observations. If one of the categories is labeled positive and the other is labeled negative, the cells give the number of true positive (TP), true negative (TN), false positive (FP), and false negative (FN) predictions.
Observed Category | ||
---|---|---|
Predicted Category | Yes (Positive) | No (Negative) |
Yes (Positive) | True Positive (TP) | False Positive (FP) |
No (Negative) | False Negative (FN) | True Negative (TN) |
Given the counts in the four cells, we can calculate a number of statistics (Table 10.1). A more in-depth look at confusion statistics follows in Section 12.4.
Statistic | Calculation | Notes |
---|---|---|
Sensitivity | TP / (TP + FN) | This is the true positive rate; also called Recall |
Specificity | TN / (FP + TN) | This is the true negative rate |
Accuracy | (TP + TN) / (TP + TN + FP + FN) | Overall proportion of correct classifications |
Misclassification rate | (FP + FN) / (TP + TN + FP + FN) | Overall proportion of incorrect classifications, 1 – Accuracy |
Precision | TP / (TP + FP) | Ratio of true positives to anything predicted as positive |
The model accuracy is measured by the ratio of observations that were correctly classified, the sum of the diagonal cells divided by the total number of observations. The misclassification rate, another common performance measure is the complement of the accuracy.
The sensitivity is the ratio of true positives to what should have been predicted as positive. The specificity is the ratio of true negatives to what should have been predicted as negative. Note that these are are not complements of each other; they are calculated with different denominators.
Example: Credit Default–ISLR (Cont’d)
With the logistic regression model fitted to the training data we can compute the confusion matrix for the test data. If the application of the logistic regression is classification, metrics based on confusion statistics for test data are a good method for judging the performance of the model.
The following statements apply the Bayes classification rule to the test data and call the confusionMatrix
function in the caret
package. The option positive="Yes"
is used to identify the level of the factor considers the “positive” level for the calculation of the statistics. This will not affect the overall confusion matrix but will affect the interpretation of sensitivity, specificity and others. By default, the function uses the first level of a factor as the “positive” result.
<- predict(log_reg, newdata=test, type="response")
predicted_prob_test <- as.factor(ifelse(predicted_prob_test > 0.5,"Yes","No"))
classify_test
::confusionMatrix(classify_test,test$default, positive="Yes") caret
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 958 26
Yes 7 9
Accuracy : 0.967
95% CI : (0.954, 0.9772)
No Information Rate : 0.965
P-Value [Acc > NIR] : 0.407906
Kappa : 0.3384
Mcnemar's Test P-Value : 0.001728
Sensitivity : 0.2571
Specificity : 0.9927
Pos Pred Value : 0.5625
Neg Pred Value : 0.9736
Prevalence : 0.0350
Detection Rate : 0.0090
Detection Prevalence : 0.0160
Balanced Accuracy : 0.6249
'Positive' Class : Yes
The classification has an accuracy of 96.7%, which seems impressive. But consider that the proportion of observations in the larger observed class is (958 + 7)/1,000 = 0.965. If you were to take a naïve approach and predict all observations as “No” without looking at the data, that decision rule would have an accuracy of 96.5%. In light of this, using tree inputs has not much approved the quality of the model.
The model is very sensitive, but not specific. It is much less likely to predict a “Yes” when the true state is “No”, than it is to predict a “No” when the true state is “Yes”. Whether we can accept a model with a sensitivity of only 25.71% is questionable, despite its high accuracy. An evaluation of this model should consider whether the two errors, false positive and false negative predictions, are of equal importance and consequence.
We will revisit decision rules for these data in Chapter 13.
10.3 Modeling Counts
Count targets, where the possible values of the response variables are countable, \(0, 1, 2, \cdots\), arise in many situations. Two situations, which reflect different distributional properties, are
Counts out of a total. We are counting the number of event occurrences out of a total number of experiments. The support for these counts has an upper limit and the counts can be expressed as proportions. For example, the number of defective manufactured items out of 100 items has support\(\{0,1,2,\cdots,100\}\).
Counts per unit. We are counting how many times an event of interest occurs per day, per square foot, or per some other units. The counts cannot be converted to proportions and they depend on the unit of measurement. By increasing the units by factor 2 one expects twice as many events. Examples are the number of disease incidences per 10,000 population, the number of customer calls per day received by a call center, the number of chocolate chips on a chocolate chip cookie, and so on.
Counts out of a total have a direct connection to binary data. If \(Y_1, Y_2, \cdots, Y_n\) are independent Bernoulli(\(\pi\)) random variables, then their sum is a Binomial(\(n,\pi\)) random variable. The approach to model such counts is also called Binomial regression.
The typical assumption for counts per unit is that they follow a Poisson distribution.
An introduction to binomial and Poisson regression follows here. Other distributions such as the Negative Binomial and concepts such as overdispersion and zero-inflated count processes are covered in more detail in the chapter on Generalized Linear Models (Chapter 28).
Binomial Regression
A Binomial(\(n,\pi\)) random variable is the sum of \(n\) independent Bernoulli(\(\pi\)) random variables. The mean and variance of a \(Y\sim \text{Binomial}(n,\pi\)) random variable follow from this definition \[ \text{E}[Y] = n\pi \quad \text{Var}[Y] = n\pi(1-\pi) \]
A regression for binomial data is thus a special case of a logistic regression with grouped data.
Example: Insecticide Dose Response
The data for this problem are from $6.7.1 of (Schabenberger and Pierce 2001). For each of seven concentrations of an insecticide, 20 larvae were exposed and the number of larvae killed was recorded.
<- c(0.375,0.75,1.5,3.0,6.0,12.0,24.0)
conc <- c(0,1,8,11,16,18,20)
killed <- c(20,20,20,20,20,20,20)
total <- data.frame(conc,killed,total)
insect ::kable(insect,format="html") knitr
conc | killed | total |
---|---|---|
0.375 | 0 | 20 |
0.750 | 1 | 20 |
1.500 | 8 | 20 |
3.000 | 11 | 20 |
6.000 | 16 | 20 |
12.000 | 18 | 20 |
24.000 | 20 | 20 |
None of the twenty larvae exposed to a concentration of 0.375 succumbed to the insecticide. At a concentration of 3.0, 11 of the twenty larvae exposed to that concentration died, and so on.
Each larva’s response to the insecticide is a Bernoulli (binary) random variable. Since the larvae respond independently, the number of larvae killed out of 20 is a Binomial(20,\(\pi\)) random variable.
A goal of the data analysis is to derive the \(LD_{50}\) value, the insecticide concentration that leads to 50% mortality.
Binomial regression models the event probability \(\pi\) as a function of predictor (input) variables.
For each concentration we can calculate the proportion \(p = \text{killed}/\text{total}\) and its logit \(\log(p/(1-p))\) and plot the logits against the concentration. This is not possible for binary data where outcomes are coded either \(Y=1\) or \(Y=0\) and a logit cannot be calculated.
That gives us an idea of the relationship between mortality and concentration on the logit scale.
No larvae were killed at the first concentration and all larvae were killed at the last concentration. This leads to an undefined logit. To approximately visualize the observed logits we add/subtract a small amount to the counts for those concentrations. The original data will be used in modeling.
library(dplyr)
<- c(0.2,1,8,11,16,18,19.8)
k2 <- insect %>% mutate(prop = k2/total,
insect2 observed_logit = log(prop/(1-prop)),
logconc = log10(conc))
par(mfrow=c(1,2))
plot(insect2$conc,insect2$observed_logit,
xlab="Concentration",
ylab="Logit of sample proportion", type="b")
plot(insect2$logconc,insect2$observed_logit,
xlab="log10(Concentration)",
ylab="Logit of sample proportion", type="b")
The relationship between the logits of the sample proportion and insecticide concentration is not linear. Taking the log of the concentration removes the curvature. The graph in the right-hand panel suggests the following binomial regression model
- \(Y | x \sim \text{Binomial}(n,\pi(x))\)
- \(\eta = \beta_0 + \beta_1 \text{log10}(x)\)
- link function: logit
Example: Insecticide Dose Response (Cont’d)
To signal to the glm()
function that the response is a binomial proportion we specify the response as a two-column matrix where the first column contains the number of events (larvae killed) for the binomials and the second column contains the non-events (larvae survived).
<- glm(cbind(killed,total-killed) ~ logconc,
bin_reg family="binomial",
data=insect2)
summary(bin_reg)
Call:
glm(formula = cbind(killed, total - killed) ~ logconc, family = "binomial",
data = insect2)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.7305 0.3741 -4.626 3.73e-06 ***
logconc 4.1651 0.6520 6.388 1.68e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 98.2178 on 6 degrees of freedom
Residual deviance: 4.6206 on 5 degrees of freedom
AIC: 23.019
Number of Fisher Scoring iterations: 4
The regression coefficient for logconc
is highly significant. Figure 10.2 shows the predicted logistic probability function of the model and the \(LD_{50}\) value.
The \(LD_{50}\) is a value on the \(x\)-axis. This is an inverse prediction problem: to find the value on the \(x\)-axis that is associated with a predicted mortality of 0.5. Any value on the x-axis corresponding to a particular mortality rate \(\alpha\) can be found by solving the following equation for \(x_\alpha\)
\[ \text{logit}(\alpha) = \beta_0 + \beta_1\text{log}_{10}(x_\alpha) \]
The solution is \[ x_\alpha = 10^{(\text{logit}(\alpha) - \beta_0)/\beta_1} \]
With \(\alpha = 0.5\) and applying the estimated regression coefficients you get \[ x_{0.5} = 10^{(\text{logit}(0.5) + 1.7305)/4.1651} = 10^{0.4155} = 2.603 \]
The insecticide dosage that is fatal for 50% of the larvae, the \(LD_{50}\) value, is 2.603 (the log10 concentration is 0.4155).
Poisson Regression
The Poisson distribution is arguably the go-to distribution when the target variable is a count per unit. It is a simple yet powerful probability model. A Poisson random variable has support \(\{0,1,2,\cdots\}\) and probability mass function (p.m.f.) \[ \text{Pr}(Y=y) = \frac{\lambda^y}{y!} e^{-y} \] The parameter \(\lambda\) is the average count per unit and is also the variance of \(Y\), \(\text{E}[Y] = \text{Var}[Y] = \lambda\).
The following plots show the p.m.f. for Poisson distributions with \(\lambda = 2\) and \(\lambda = 5\).
par(mfrow=c(1,2))
barplot(dpois(seq(0,10,1),2), names.arg=seq(0,10,1), main = "Poisson(2)")
barplot(dpois(seq(0,14,1),5), names.arg=seq(0,14,1), main = "Poisson(5)")
The natural link function for the Poisson distribution is the log link. A Poisson regression model has the following configuration:
- \(Y | \textbf{x}\sim \text{Poisson}(\lambda(\textbf{x}))\)
- \(\eta = \textbf{x}^\prime \boldsymbol{\beta}\)
- \(\log(\lambda(\textbf{x})) = \eta\)
Example: Poisson Regression for Bike Sharing Data
The data for this example comes with the ISLR2
library and contains 8,645 records of the number of bike rentals per hour in Washington, DC along with time/date and weather information.
library(ISLR2)
attach(Bikeshare)
str(Bikeshare)
'data.frame': 8645 obs. of 15 variables:
$ season : num 1 1 1 1 1 1 1 1 1 1 ...
$ mnth : Factor w/ 12 levels "Jan","Feb","March",..: 1 1 1 1 1 1 1 1 1 1 ...
$ day : num 1 1 1 1 1 1 1 1 1 1 ...
$ hr : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
$ holiday : num 0 0 0 0 0 0 0 0 0 0 ...
$ weekday : num 6 6 6 6 6 6 6 6 6 6 ...
$ workingday: num 0 0 0 0 0 0 0 0 0 0 ...
$ weathersit: Factor w/ 4 levels "clear","cloudy/misty",..: 1 1 1 1 1 2 1 1 1 1 ...
$ temp : num 0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
$ atemp : num 0.288 0.273 0.273 0.288 0.288 ...
$ hum : num 0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
$ windspeed : num 0 0 0 0 0 0.0896 0 0 0 0 ...
$ casual : num 3 8 5 3 0 0 2 1 1 8 ...
$ registered: num 13 32 27 10 1 1 0 2 7 6 ...
$ bikers : num 16 40 32 13 1 1 2 3 8 14 ...
The following statements fit the same Poisson regression model as in (James et al. 2021, 187)
<- glm(bikers ~ mnth + hr + workingday + temp + weathersit,
mod.pois data = Bikeshare,
family = "poisson")
summary(mod.pois)
Call:
glm(formula = bikers ~ mnth + hr + workingday + temp + weathersit,
family = "poisson", data = Bikeshare)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.693688 0.009720 277.124 < 2e-16 ***
mnthFeb 0.226046 0.006951 32.521 < 2e-16 ***
mnthMarch 0.376437 0.006691 56.263 < 2e-16 ***
mnthApril 0.691693 0.006987 98.996 < 2e-16 ***
mnthMay 0.910641 0.007436 122.469 < 2e-16 ***
mnthJune 0.893405 0.008242 108.402 < 2e-16 ***
mnthJuly 0.773787 0.008806 87.874 < 2e-16 ***
mnthAug 0.821341 0.008332 98.573 < 2e-16 ***
mnthSept 0.903663 0.007621 118.578 < 2e-16 ***
mnthOct 0.937743 0.006744 139.054 < 2e-16 ***
mnthNov 0.820433 0.006494 126.334 < 2e-16 ***
mnthDec 0.686850 0.006317 108.724 < 2e-16 ***
hr1 -0.471593 0.012999 -36.278 < 2e-16 ***
hr2 -0.808761 0.014646 -55.220 < 2e-16 ***
hr3 -1.443918 0.018843 -76.631 < 2e-16 ***
hr4 -2.076098 0.024796 -83.728 < 2e-16 ***
hr5 -1.060271 0.016075 -65.957 < 2e-16 ***
hr6 0.324498 0.010610 30.585 < 2e-16 ***
hr7 1.329567 0.009056 146.822 < 2e-16 ***
hr8 1.831313 0.008653 211.630 < 2e-16 ***
hr9 1.336155 0.009016 148.191 < 2e-16 ***
hr10 1.091238 0.009261 117.831 < 2e-16 ***
hr11 1.248507 0.009093 137.304 < 2e-16 ***
hr12 1.434028 0.008936 160.486 < 2e-16 ***
hr13 1.427951 0.008951 159.529 < 2e-16 ***
hr14 1.379296 0.008999 153.266 < 2e-16 ***
hr15 1.408149 0.008977 156.862 < 2e-16 ***
hr16 1.628688 0.008805 184.979 < 2e-16 ***
hr17 2.049021 0.008565 239.221 < 2e-16 ***
hr18 1.966668 0.008586 229.065 < 2e-16 ***
hr19 1.668409 0.008743 190.830 < 2e-16 ***
hr20 1.370588 0.008973 152.737 < 2e-16 ***
hr21 1.118568 0.009215 121.383 < 2e-16 ***
hr22 0.871879 0.009536 91.429 < 2e-16 ***
hr23 0.481387 0.010207 47.164 < 2e-16 ***
workingday 0.014665 0.001955 7.502 6.27e-14 ***
temp 0.785292 0.011475 68.434 < 2e-16 ***
weathersitcloudy/misty -0.075231 0.002179 -34.528 < 2e-16 ***
weathersitlight rain/snow -0.575800 0.004058 -141.905 < 2e-16 ***
weathersitheavy rain/snow -0.926287 0.166782 -5.554 2.79e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 1052921 on 8644 degrees of freedom
Residual deviance: 228041 on 8605 degrees of freedom
AIC: 281159
Number of Fisher Scoring iterations: 5
Input variables mnth
, hr
, and weathersit
are factors, their first levels serve as the reference levels in the model (mnth="Jan"
, hr=0
, weathersit="clear"
). The coefficient estimates inform us that bikeshare usage is high during commute hours (hr
8, 17, 18) and highest in spring and fall. Clear weather has higher usage than any of the other recorded weather situations.
As the mean of the Poisson distribution increases, the distribution becomes more symmetric and approaches that of a Gaussian(\(\lambda,\lambda\)) distribution. The average daily bike sharing count in the bike share data set is 143.7944 so it might be tempting to analyze the bikers
variable with a classical linear model.
Example: Classical Linear Model for Bike Sharing Data
<- lm(bikers ~ mnth + hr + workingday + temp + weathersit,
mod.lm data = Bikeshare)
summary(mod.lm)
Call:
lm(formula = bikers ~ mnth + hr + workingday + temp + weathersit,
data = Bikeshare)
Residuals:
Min 1Q Median 3Q Max
-299.00 -45.70 -6.23 41.08 425.29
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -68.632 5.307 -12.932 < 2e-16 ***
mnthFeb 6.845 4.287 1.597 0.110398
mnthMarch 16.551 4.301 3.848 0.000120 ***
mnthApril 41.425 4.972 8.331 < 2e-16 ***
mnthMay 72.557 5.641 12.862 < 2e-16 ***
mnthJune 67.819 6.544 10.364 < 2e-16 ***
mnthJuly 45.324 7.081 6.401 1.63e-10 ***
mnthAug 53.243 6.640 8.019 1.21e-15 ***
mnthSept 66.678 5.925 11.254 < 2e-16 ***
mnthOct 75.834 4.950 15.319 < 2e-16 ***
mnthNov 60.310 4.610 13.083 < 2e-16 ***
mnthDec 46.458 4.271 10.878 < 2e-16 ***
hr1 -14.579 5.699 -2.558 0.010536 *
hr2 -21.579 5.733 -3.764 0.000168 ***
hr3 -31.141 5.778 -5.389 7.26e-08 ***
hr4 -36.908 5.802 -6.361 2.11e-10 ***
hr5 -24.135 5.737 -4.207 2.61e-05 ***
hr6 20.600 5.704 3.612 0.000306 ***
hr7 120.093 5.693 21.095 < 2e-16 ***
hr8 223.662 5.690 39.310 < 2e-16 ***
hr9 120.582 5.693 21.182 < 2e-16 ***
hr10 83.801 5.705 14.689 < 2e-16 ***
hr11 105.423 5.722 18.424 < 2e-16 ***
hr12 137.284 5.740 23.916 < 2e-16 ***
hr13 136.036 5.760 23.617 < 2e-16 ***
hr14 126.636 5.776 21.923 < 2e-16 ***
hr15 132.087 5.780 22.852 < 2e-16 ***
hr16 178.521 5.772 30.927 < 2e-16 ***
hr17 296.267 5.749 51.537 < 2e-16 ***
hr18 269.441 5.736 46.976 < 2e-16 ***
hr19 186.256 5.714 32.596 < 2e-16 ***
hr20 125.549 5.704 22.012 < 2e-16 ***
hr21 87.554 5.693 15.378 < 2e-16 ***
hr22 59.123 5.689 10.392 < 2e-16 ***
hr23 26.838 5.688 4.719 2.41e-06 ***
workingday 1.270 1.784 0.711 0.476810
temp 157.209 10.261 15.321 < 2e-16 ***
weathersitcloudy/misty -12.890 1.964 -6.562 5.60e-11 ***
weathersitlight rain/snow -66.494 2.965 -22.425 < 2e-16 ***
weathersitheavy rain/snow -109.745 76.667 -1.431 0.152341
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 76.5 on 8605 degrees of freedom
Multiple R-squared: 0.6745, Adjusted R-squared: 0.6731
F-statistic: 457.3 on 39 and 8605 DF, p-value: < 2.2e-16
The linear model explains 67.45% of the variability in the number of bike shares and has many highly significant predictors. Are the predictions comparable to those from the Poisson model? A look at Figure 10.3 shows that they are not at all the same. Predictions are higher under the Poisson model for small and large values. The predictions of the linear model extend into the negative values.

10.4 Log-likelihood, Deviance, and the Likelihood-ratio Test
Interestingly, all coefficients are highly statistically significant in this analysis. As we will discover in greater detail in the chapter on generalized linear models, the analysis shows a red flag: the ratio of the residual deviance and the residual degrees of freedom should generally be about 1.
The deviance of a model is twice the difference between the log likelihood evaluated at the parameter estimates and the largest possible log likelihood that is achievable in a saturated (yet overfit) model. The Null deviance
reported by glm
is the deviance for a model without input variables (intercept-only model). The Residual deviance
is the deviance for the model with the specified input variables.
<- glm(bikers ~ 1, data=Bikeshare, family=poisson)
mod.null summary(mod.null)
Call:
glm(formula = bikers ~ 1, family = poisson, data = Bikeshare)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.9683848 0.0008969 5539 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 1052921 on 8644 degrees of freedom
Residual deviance: 1052921 on 8644 degrees of freedom
AIC: 1105961
Number of Fisher Scoring iterations: 5
In the null model the Null
and Residual
deviances are identical and match the null deviance reported in the previous model with input variables.
The amount by which the deviance decreases through the addition of the input variables is the test statistic for the likelihood ratio test (LRT) that all coefficients except the intercept are simultaneously zero.
Example: Likelihood Ratio Test for Bike Sharing Data
The logLik
function extracts the log likelihood for model classes that export that information.
<- 2*(logLik(mod.pois) - logLik(mod.null))
LRT as.numeric(LRT)
[1] 824880.2
$null.deviance - mod.pois$deviance mod.pois
[1] 824880.2
Under certain conditions, this likelihood-ratio test statistic has a \(\chi^2\) distribution with degrees of freedom equal to the rank of \(\textbf{X}\). The LRT is the equivalent test in likelihood-based estimation to the sum-of-squares reduction test in least-squares based estimation (Section 7.2)
Definition: Likelihood Ratio Test
Suppose that \(\ell_f\) and \(\ell_r\) are the log likelihoods in a full and reduced model where the reduced model is nested within the full model through a constraint (a hypothesis) \(H\) with \(q\) degrees of freedom. The likelihood ratio test statistic \[ \Lambda = 2(\ell_f - \ell_r) \] has an asymptotic \(\chi^2_q\) distribution under the hypothesis.
The LRT for the constraint \(H: \beta_1 = \beta_2 = \cdots = \beta_{39} = 0\) has a near-zero \(p\)-value, indicating not all of the coefficients are simultaneously zero.
pchisq(LRT,df=39,lower.tail=FALSE)
'log Lik.' 0 (df=40)
The anova
function provides a convenient way of performing the LRT for nested models:
anova(mod.null,mod.pois,test="LRT")
Analysis of Deviance Table
Model 1: bikers ~ 1
Model 2: bikers ~ mnth + hr + workingday + temp + weathersit
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 8644 1052921
2 8605 228041 39 824880 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Under certain conditions the residual deviance has a \(\chi^2\) distribution with \(n-r(\textbf{X})\) degrees of freedom.
Since the mean of a \(\chi^2_\nu\) random variable is equal to the degrees of freedom, \(\nu\), this suggests a simple calculation with the residual deviance: the ratio of the deviance and its degrees of freedom should be about 1. We can see that this ratio is far from 1 in the Poisson regression:
$deviance / mod.pois$df.residual mod.pois
[1] 26.50099
This value indicates that there remains considerably more variability than expected. The model is somehow not correct. This condition is called overdispersion and can have many causes:
- important input variables are missing from the model
- the data are auto-correlated, for example in time series or longitudinal data
- the data do not follow a Poisson distribution but a model with greater dispersion, e.g., the Negative Binomial distribution
- the data were generated by more than one process and we observe a mixture.
Overdispersion is addressed by turning the appropriate knob. If overdispersion is not addressed, the precision of the fitted model is overstated: standard error estimates are too small, confidence and prediction intervals are too narrow, \(p\)-values are too small. More on this topic in Section 28.7.
10.5 Working with Offsets
An offset is an input variable that has a known coefficient of 1. Consider the model \[ g(\mu) = \beta_0 + \beta_1x_1 + \beta_2x_2 + x^o \] \(x^o\) acts as an offset to the linear predictor. It is not an intercept, which takes on the same value for all observations. The offset can change from one observation to the next. But it is also not an input variable whose coefficient needs to be estimated. Somehow we know that the offset has a coefficient \(\beta = 1\).
Offsets are important when working with count variables that relate to different units. If \(Y_1\) is a count per day and \(Y_2\) is a count of the same random process per week, the two do not have the same mean, \(\text{E}[Y_2] = 7\text{E}[Y_1]\). When incidences (crime, cancer, …) are counted in areas of different size or population (by city, by zip code, by county, by age group,…) we are more interested in modeling the event rate (cancer rate in Seattle per 100,000) than the raw counts (cancer cases in Seattle).
If count variables relate to different units we need to adjust the analysis, otherwise changes in the mean due to size differences are attributed to the wrong effects. For Poisson regression the problem can be formalized as follows:
\[ Y_i \sim \text{Poisson}( x_i^0 \lambda) \] and \(\lambda = \exp\{\textbf{x}^\prime\boldsymbol{\beta}\}\). In other words, we have to adjust the counts to a common unit to account for the differing sizes on the mean of the data. In the case of a log link, the model for the mean of \(Y_i\) becomes \[ \text{E}[Y_i] = x_i^0\exp\{\textbf{x}^\prime\boldsymbol{\beta}\} = \exp\{\textbf{x}^\prime\boldsymbol{\beta}+ \log(x_i^0)\} \] The log of the size of the units is an offset variable on the linear predictor. The interpretation of \(\lambda\) is now in terms of a rate of events rather than a raw count.
Example: Non-melanoma Skin Cancer
The following data appear in (Kleinbaum et al. 2013) and describe the incidence of non-melanoma skin cancer among women by age group in two cities.
<- read.table(header = TRUE,
nonmel text = "
cases city u1 u2 u3 u4 u5 u6 u7 n
1 1 0 1 0 0 0 0 0 0 172675
2 16 0 0 1 0 0 0 0 0 123065
3 30 0 0 0 1 0 0 0 0 96216
4 71 0 0 0 0 1 0 0 0 92051
5 102 0 0 0 0 0 1 0 0 72159
6 130 0 0 0 0 0 0 1 0 54722
7 133 0 0 0 0 0 0 0 1 32185
8 40 0 0 0 0 0 0 0 0 8328
9 4 1 1 0 0 0 0 0 0 181343
10 38 1 0 1 0 0 0 0 0 146207
11 119 1 0 0 1 0 0 0 0 121374
12 221 1 0 0 0 1 0 0 0 111353
13 259 1 0 0 0 0 1 0 0 83004
14 310 1 0 0 0 0 0 1 0 55932
15 226 1 0 0 0 0 0 0 1 29007
16 65 1 0 0 0 0 0 0 0 7583
")
<- within(nonmel, {
nonmel <- rep(c("15_24","25_34","35_44","45_54","55_64","65_74","75_84","85+"), 2)
agegroup <- factor(agegroup)
agegroup
<- factor(city, 0:1, c("Minneapolis", "Dallas"))
city
})
<- nonmel[c("cases","n","city","agegroup")]
nonmel
::kable(nonmel,format="html") knitr
cases | n | city | agegroup |
---|---|---|---|
1 | 172675 | Minneapolis | 15_24 |
16 | 123065 | Minneapolis | 25_34 |
30 | 96216 | Minneapolis | 35_44 |
71 | 92051 | Minneapolis | 45_54 |
102 | 72159 | Minneapolis | 55_64 |
130 | 54722 | Minneapolis | 65_74 |
133 | 32185 | Minneapolis | 75_84 |
40 | 8328 | Minneapolis | 85+ |
4 | 181343 | Dallas | 15_24 |
38 | 146207 | Dallas | 25_34 |
119 | 121374 | Dallas | 35_44 |
221 | 111353 | Dallas | 45_54 |
259 | 83004 | Dallas | 55_64 |
310 | 55932 | Dallas | 65_74 |
226 | 29007 | Dallas | 75_84 |
65 | 7583 | Dallas | 85+ |
The cases
variable is the raw count for the number of cases. Is the rate of non-melanoma skin cancer in 25–34 year old individuals in Dallas higher than in Minneapolis? To answer this question we need to adjust the raw counts for the population size. If the log of the population size n
is used as the offset in a Poisson regression, then we are modeling the cancer rate per person in the two cities. Other offsets are possible, for example, to model the rate per 10,000 population. What matters is that the raw counts are adjusted to a common unit.
Since city
and agegroup
are factors in the analysis, we choose reference levels for both. The results are expressed relative to the oldest age group in Minneapolis.
$city <- relevel(nonmel$city, ref="Minneapolis")
nonmel$agegroup <- relevel(nonmel$agegroup, ref="85+") nonmel
The offset=log(n)
option in the glm
call defines the offset for the model.
<- glm(cases ~ city + agegroup,
mod_poi offset=log(n),
family="poisson",
data=nonmel)
summary(mod_poi)
Call:
glm(formula = cases ~ city + agegroup, family = "poisson", data = nonmel,
offset = log(n))
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.4834 0.1037 -52.890 < 2e-16 ***
cityDallas 0.8039 0.0522 15.399 < 2e-16 ***
agegroup15_24 -6.1742 0.4577 -13.488 < 2e-16 ***
agegroup25_34 -3.5440 0.1675 -21.160 < 2e-16 ***
agegroup35_44 -2.3268 0.1275 -18.254 < 2e-16 ***
agegroup45_54 -1.5790 0.1138 -13.871 < 2e-16 ***
agegroup55_64 -1.0869 0.1109 -9.800 < 2e-16 ***
agegroup65_74 -0.5288 0.1086 -4.868 1.13e-06 ***
agegroup75_84 -0.1157 0.1109 -1.042 0.297
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 2789.6810 on 15 degrees of freedom
Residual deviance: 8.2585 on 7 degrees of freedom
AIC: 120.5
Number of Fisher Scoring iterations: 4
With the chosen reference levels, the intercept is the linear predictor for residents of Minneapolis in the 85+ age group. Exponentiating the intercept is the predicted cancer rate in that group.
cat("The predicted cancer rate among residents of \nMinneapolis in the 85+ age group is ",
exp(mod_poi$coefficients[1]),"\n")
The predicted cancer rate among residents of
Minneapolis in the 85+ age group is 0.004155093
cat("The predicted number of non-melanoma skin cancer cases \namong 1,000 residents of Minneapolis 85 years is ",
1000*exp(mod_poi$coefficients[1]), "\n")
The predicted number of non-melanoma skin cancer cases
among 1,000 residents of Minneapolis 85 years is 4.155093
The number of cases for a particular population size can also be obtained by supplying the value for the offet variable:
<- data.frame(city="Minneapolis", agegroup="85+", n=1000)
xvals predict(mod_poi,newdata=xvals,type="response")
1
4.155093
Since we are modeling a standardized rate, comparisons across cities and age groups are now possible. For example, the following computes the excess number of non-melanoma skin cancer cases per 100,000 40-year old individuals versus 20-year old individuals for the two cities:
<- data.frame(city="Minneapolis", agegroup="15_24", n=100000)
xvals nrow(xvals)+1,] <- list(city="Minneapolis", agegroup="35_44", n=100000)
xvals[nrow(xvals)+1,] <- list(city="Dallas", agegrouop="15_24", n=100000)
xvals[nrow(xvals)+1,] <- list(city="Dallas", agegroup="35_44", n=100000)
xvals[
<- predict(mod_poi,newdata=xvals,type="response")
p
cat("Excess cancers per 100,000 in Minneapolis ", p[2]-p[1], "\n")
Excess cancers per 100,000 in Minneapolis 39.69063
cat("Excess cancers per 100,000 in Dallas ", p[4]-p[3], "\n")
Excess cancers per 100,000 in Dallas 88.67815
There are 2.2342 more cancer cases per 100,000 in Dallas compared to Minneapolis. Since the model does not include an interaction between city and age group, this multiplier applies to all age groups. You can find it more easily by exponentiating the coefficient for cityDallas
:
as.numeric(exp(mod_poi$coefficients[2]))
[1] 2.234234
10.6 Zero-inflated Models
We briefly mentioned overdispersion in this chapter, the condition by which the data are more dispersed than is permissible under an assumed probability model. Gaussian data can never be overdispersed, because the mean and variance of the Gaussian are not related: any mean can be combined with any variance. That is quite unusual among probability distributions. In most cases, there is some relationship between the mean and the variance. For example, the Poisson(\(\lambda\)) distribution has \[ \text{E}[Y] = \text{Var}[Y] = \lambda \]
Regression explicitly models the mean function; when mean and variance are functionally related, the model implicitly captures the variance. Suppose you perform linear Poisson regression with \[ \log(\text{E}[Y]) = \beta_0 + \beta_1 x \] and you find that the variability grows faster in \(x\) than the mean. Either the mean function is not correctly specified, or \(Y|x\) does not follow a Poisson distribution, or both.
Especially with count data, a possible reason for overdispersion in the data is zero-inflated responses. Under this condition we observe more zero counts than expected under the reference distribution, usually a Poisson or Negative Binomial distribution. For example, if you ask visitors to a state park how many fish they caught that day, you will probably receive more zero responses than expected under a Poisson model because multiple processes are at work, generating zeros. Some visitors do not fish at all and naturally won’t catch any fish. Some visitors fish but are unlucky anglers that day. Zero count values appear in your data from multiple sources, inflating the overall count.
Example: Simulated Zero Inflation
The following code simulates a zero-inflated Poisson process. 1,000 observations are drawn from a Poisson(5) process. A second process, p2
, is deterministic and generates only zeros. With probability 0.2 an observation is drawn from p2
, with probabilityt 0.8 an observation is drawn from the Poisson(5) process p1
. The result is a zero-inflated Poisson process.
set.seed(876)
<- runif(1000)
u
<- rpois(1000,5)
p1 <- 0
p2 <- ifelse(u < 0.2,p2,p1)
zip
# Effect of zero inflation
par(mfrow=c(1,2))
hist(zip)
hist(p1)
The excess zeros in the zero-inflated process leads to greater variability in the data compared to the non-inflated process.
sd(zip)
[1] 2.83468
sd(p1)
[1] 2.214244
If unaccounted for, zero inflation manifests itself as overdispersion in count data.
summary(glm(p2 ~ 1, family="poisson"))$deviance
[1] 4.122307e-10
summary(glm(zip ~ 1, family="poisson"))$deviance
[1] 2589.894
A zero-inflated model is a special case of a finite mixture model (FMM). In a \(k\)-component finite mixture the probability distribution of the target variable is expressed as a weighted combination of \(k\) distributions \[ p(y) = \sum_{j=1}^k w_j p_j(y) \]
The \(w_j\) are called the mixing probabilities of the FMM. In a homogeneous mixture model the component distributions \(p_j(y)\) are from the same family, for example, they are all Gaussian distributions but with different parameters: \(p_j(y) = G(\mu_j,\sigma^2_j)\). A heterogeneous mixture model features component distributions from different families.
The zero-inflated Poisson model is a heterogeneous, two-component mixture in which one component distribution is concentrated at 0 and the other distribution is Poisson(\(\lambda\))
\[ p_1(y) = \left \{ \begin{array}{ll} 1 & y = 0 \\ 0 & \text{otherwise}\end{array}\right . \] \[ p_2(y) = \frac{\lambda^y}{y!} e{-\lambda} \] The overall model can be written as \[ \Pr(Y=y) = w \, p_1 + (1-w) \, p_2(y) \]
Opportunities to introduce input variables are the mixing probability \(\pi\) and the mean of the Poisson model. The mixing probability can be modeled as a logistic-type model. The model for the mean of \(p_2(y)\) is a Poisson regression with log link. The two models can depend on the same inputs or different inputs. For example, the likelihood of excess zeros might depend on age of an individual but the mean of the Poisson variable is not a function of age.
Using \(\textbf{z}\) for the vector of inputs in the model for the mixing probability and \(\textbf{x}\) for the vector of inputs in the Poisson model, the zero-inflated Poisson model in terms of estimable parameters is
\[\begin{align*} \text{logit}(\pi) &= \textbf{z}^\prime \boldsymbol{\alpha}\\ \log(\lambda) &= \textbf{x}^\prime \boldsymbol{\beta} \end{align*}\]
Example: Catching Fish
The data for this example comes from the documentation of the FMM procedure in SAS/STAT software. The data represent gender, age, and the number of fish reportedly caught by visitors of a park.
suppressWarnings(library("duckdb"))
Loading required package: DBI
<- dbConnect(duckdb(),dbdir = "ads.ddb",read_only=TRUE)
con <- dbGetQuery(con, "SELECT * FROM catch")
catch
dbDisconnect(con)
head(catch)
gender age count
1 F 54 18
2 M 37 0
3 F 48 12
4 M 27 0
5 M 55 0
6 M 32 0
A Poisson generalized linear model with a common intercept and different age slopes for the genders exhibits considerable overdispersion.
<- glm(count ~ gender:age, data=catch,family="poisson")
poi_reg summary(poi_reg)
Call:
glm(formula = count ~ gender:age, family = "poisson", data = catch)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.98109 0.54391 -7.319 2.49e-13 ***
genderF:age 0.12778 0.01149 11.125 < 2e-16 ***
genderM:age 0.10442 0.01224 8.531 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 315.58 on 51 degrees of freedom
Residual deviance: 104.84 on 49 degrees of freedom
AIC: 188.71
Number of Fisher Scoring iterations: 5
A zero-inflated Poisson model can be fit in R
with the zeroinfl
function in the pscl
package. The function specifies separate linear predictors for the count model and the zero-inflated part, separated by a vertical slash.
count ~ gender:age | 1
implies the same mean model for the count part as in the Poisson model above and an intercept-only model for the zero-inflation part. The dist=
option specifies the distribution of the count variable, the link=
option specifies the link function for the zero-inflation.
library(pscl)
Classes and Methods for R originally developed in the
Political Science Computational Laboratory
Department of Political Science
Stanford University (2002-2015),
by and under the direction of Simon Jackman.
hurdle and zeroinfl functions by Achim Zeileis.
<- zeroinfl(count ~ gender:age | 1,
zip data=catch,
dist="poisson", #link for the GLM part assumed to be log()
link="logit" #this is the link for the zero-inflated part!
)summary(zip)
Call:
zeroinfl(formula = count ~ gender:age | 1, data = catch, dist = "poisson",
link = "logit")
Pearson residuals:
Min 1Q Median 3Q Max
-1.3540 -0.6922 -0.3865 0.6969 2.4075
Count model coefficients (poisson with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.52151 0.64473 -5.462 4.71e-08 ***
genderF:age 0.12157 0.01344 9.046 < 2e-16 ***
genderM:age 0.10562 0.01393 7.580 3.45e-14 ***
Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.8342 0.4768 -1.749 0.0802 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Number of iterations in BFGS optimization: 10
Log-likelihood: -72.81 on 4 Df
The estimate of the intercept in the zero-inflation model is \(\widehat{\alpha}_0 =\) -0.8342. Applying the inverse link yields the estimated mixing probability:
<- zip$optim$par
coef coef
(Intercept) genderF:age genderM:age (Intercept)
-3.5215051 0.1215663 0.1056187 -0.8341843
1/(1+exp(-coef[4])) #mixing prob. for the zero-inflated part
(Intercept)
0.3027611
30.2% of the observations come from the zero-inflated process. The complement, 69.8% of the observations come from a Poisson process with estimated mean
\[ \exp\{\widehat{\beta}_0 + \widehat{\beta}_{1f} \,\text{age} \} = \exp\{-3.52149 + 0.12157\times \text{age}\} \]
for females and \[ \exp\{\widehat{\beta}_0 + \widehat{\beta}_{1m} \, \text{age} \} = \exp\{-3.52149 + 0.10562\times \text{age}\} \] for males.
How do you calculate the predicted values in the zero-inflated Poisson model? The predictions of the two processes and the mixing probabilities need to be combined. For the first two observations,
<- 1/(1+exp(-coef[4]))
mixprob
cat("Predicted value for obs 1: ",
* 0 + (1-mixprob) * exp(coef[1] + catch[1,"age"]*coef[2]),"\n") mixprob
Predicted value for obs 1: 14.62083
cat("Predicted value for obs 2: ",
* 0 + (1-mixprob) * exp(coef[1] + catch[2,"age"]*coef[3])) mixprob
Predicted value for obs 2: 1.026095
predict(zip,newdata=catch[1:2,])
1 2
14.620832 1.026095
A likelihood-ratio test can be used to test whether adding the zero-inflation to the Poisson model improved the overall model. The two models are nested, the ZIP model reduces to a Poisson model as \(\pi \rightarrow 0\), or equivalently, \(\alpha \rightarrow -\infty\).
<- 2*(logLik(zip) - logLik(poi_reg))
lrt <- pchisq(lrt,df=1,lower.tail=FALSE)
pvalue cat("p-value of LRT :", pvalue)
p-value of LRT : 1.120097e-09
The ZIP model provides a significantly better fit than the Poisson model.
The next model adds the gender
variable to the zero-inflated model. There are now two mxing probabilities, one for males and one for females.
<- zeroinfl(count ~ gender:age | gender,
zip_g data=catch,
dist="poisson",
link="logit"
)summary(zip_g)
Call:
zeroinfl(formula = count ~ gender:age | gender, data = catch, dist = "poisson",
link = "logit")
Pearson residuals:
Min 1Q Median 3Q Max
-1.7975 -0.6309 -0.3893 0.5731 2.3716
Count model coefficients (poisson with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.40113 0.63517 -5.355 8.57e-08 ***
genderF:age 0.11893 0.01327 8.965 < 2e-16 ***
genderM:age 0.10448 0.01372 7.617 2.60e-14 ***
Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.5405 0.7141 -2.157 0.0310 *
genderM 1.5856 0.8858 1.790 0.0734 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Number of iterations in BFGS optimization: 11
Log-likelihood: -71.1 on 5 Df
<- 2*(logLik(zip_g) - logLik(zip))
lrt <- pchisq(lrt,df=1,lower.tail=FALSE)
pvalue cat("p-value of LRT :", pvalue)
p-value of LRT : 0.06473306
The likelihood ratio test of a gender-specific mixing probability is not significant at the 5% level (\(p =\) 0.0647, very similar to the Wald-type \(p\)-value of 0.0734 reported in the output.