7  The Classical Linear Model

The classical (standard) linear model is \[ \textbf{Y}= \textbf{X}\boldsymbol{\beta}+ \boldsymbol{\epsilon}, \quad \quad \boldsymbol{\epsilon}\sim (\textbf{0},\sigma^2\textbf{I}) \] \(\textbf{Y}\) is an \((n \times 1)\) vector of target values, \(\textbf{X}\) is an \((n \times (p+1))\) matrix of \(p\) input variables plus an intercept, and \(\boldsymbol{\epsilon}\) is an \((n \times 1)\) random vector whose elements have mean zero and the same variance \(\sigma^2\). If \(\textbf{X}\) is of full rank, the OLS estimator \[ \widehat{\boldsymbol{\beta}} = \left(\textbf{X}^\prime\textbf{X}^{-1}\right) \textbf{X}^\prime\textbf{Y} \] is the unbiased estimator with smallest variance. Predicted values and fitted residuals are given by

\[\begin{align*} \widehat{\textbf{y}} &= \textbf{X}\widehat{\boldsymbol{\beta}} = \textbf{H}\textbf{y}\\ \widehat{\boldsymbol{\epsilon}} = \textbf{y}- \widehat{\textbf{y}} &= (\textbf{I}- \textbf{H})\textbf{y} \end{align*}\]

where \(\textbf{H}\) is the “Hat” matrix \(\textbf{H}= \textbf{X}(\textbf{X}^\prime\textbf{X})^{-1}\textbf{X}^\prime\).

The OLS estimator has variance \[ \text{Var}[\widehat{\boldsymbol{\beta}}] = \sigma^2 (\textbf{X}^\prime\textbf{X})^{-1} \] We saw in Section 3.6 that \(\textbf{H}\) is an orthogonal projection matrix–it is symmetric and idempotent. Because \(\textbf{H}\) is a projection matrix, \(\textbf{I}- \textbf{H}\) is also a projection matrix. To emphasize this fact we can write \(\textbf{y}\) as the sum of two components \[ \textbf{y}= \textbf{H}\textbf{y}+ (\textbf{I}- \textbf{H})\textbf{y} \] This identity simply states that the target variable is projected onto two spaces: the space generated by the columns of \(\textbf{X}\), represented by the projection matrix \(\textbf{H}\), and its complement (residual) space. Furthermore, these projections are orthogonal: \[ \textbf{H}(\textbf{I}- \textbf{H}) = \textbf{H}- \textbf{H}\textbf{H}= \textbf{H}- \textbf{H}= \textbf{0} \]

Projection properties are useful in establishing properties of fitted values and residuals.

\[\begin{align*} \textbf{X}^\prime \widehat{\boldsymbol{\epsilon}} &= \textbf{0}\\ \text{Var}[\widehat{\textbf{y}}] &= \text{Var}[\textbf{H}\textbf{y}] = \sigma^2\textbf{H}\\ \text{Var}[\widehat{y}_i] &= \sigma^2 h_{ii} \\ \text{Var}[\widehat{\boldsymbol{\epsilon}}] &= \text{Var}[(\textbf{I}-\textbf{H})\textbf{y}] = \sigma^2(\textbf{I}- \textbf{H}) \\ \text{Var}[\widehat{\epsilon}_i] &= \sigma^2(1-h_{ii}) \end{align*}\]

Several interesting facts can be gleaned from these expressions.

  1. The first result, that the fitted residuals are orthogonal to the inputs, follows from the orthogonality of least squares, but can be shown easily using projection properties \[ \textbf{X}^\prime\widehat{\boldsymbol{\epsilon}} = \textbf{X}^\prime(\textbf{I}-\textbf{H})\textbf{y}= (\textbf{X}^\prime - \textbf{X}^\prime\textbf{H})\textbf{Y}= (\textbf{X}^\prime-\textbf{X}^\prime)\textbf{y}= \textbf{0} \] The orthogonality of inputs and residuals implies that for any column in \(\textbf{X}\) \[ \sum_{i=1}^n \, x_{ij}\widehat{\epsilon}_i = 0, \quad j=1,\cdots, p \] This is also true for the intercept column, so that \(\sum_{i=1}^n \widehat{\epsilon}_i = 0\); the fitted residuals sum to zero.

  2. The variance of the \(i\)th fitted value depends on the irreducible variability \(\sigma^2\) and on the \(i\)th diagonal element of the Hat matrix, \(h_{ii}\). The Hat matrix in turn depends only on the input variables. In other words, the variability of the fitted values does not depend on any of the target values in the data set.

  3. The variance of a fitted residual is \(\sigma^2(1-h_{ii})\). Since \(\sigma^2\) is the variance of the model errors (the irreducible variance), the variability of the residuals is smaller if \(h_{ii} > 0\). The leverages \(h_{ii}\) cannot be larger than 1, otherwise \(\sigma^2(1-h_{ii})\) is not a valid variance. In fact, the leverages are bounded \(1/n \le h_{ii} \le 1\).

7.1 Simple and Multiple Linear Regression

The SLR and MLR models are examples of the classical linear model. In the SLR case there is a single input variable \[ Y_i = \beta_0 + \beta_1 x_i + \epsilon_i \] in the MLR case there are multiple input variables. These can be distinct variables, or transformations and/or combinations of variables. For example, \[ Y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i1}x_{i2} + \epsilon_i \] has three inputs formed from the main effects of \(x_1\) and \(x_2\) and their interaction \(x_1x_2\). More on the use and interpretation of interactions below.

Computing Estimates

Example: Auto Data from ISLR

The Auto data is a data set used in James et al. (2021) (ISLR2). It comprises information on 392 automobiles, such as mileage (mpg), horsepower, number of cylinders, engine displacement (cu. inches), weight (lbs), etc. The original data set has 397 observations of which five have a missing value for the horsepower variable. The data set that comes with the ISLR2 library does not contain these five observations.

Suppose we want to develop a model that predicts mpg from input variables cylinders, displacement, weight and horsepower.

Most R functions for statistical modeling support a formula expression to specify models directly based on information in data frames. You do not have to set up separate objects for \(\textbf{X}\) and \(\textbf{y}\). Also, R provides standard errors, \(t\)-statistics, \(p\)-values, and other estimates by default, and has default methods for handling missing values.

library(ISLR2)
data(Auto)
head(Auto)
  mpg cylinders displacement horsepower weight acceleration year origin
1  18         8          307        130   3504         12.0   70      1
2  15         8          350        165   3693         11.5   70      1
3  18         8          318        150   3436         11.0   70      1
4  16         8          304        150   3433         12.0   70      1
5  17         8          302        140   3449         10.5   70      1
6  15         8          429        198   4341         10.0   70      1
                       name
1 chevrolet chevelle malibu
2         buick skylark 320
3        plymouth satellite
4             amc rebel sst
5               ford torino
6          ford galaxie 500
linreg <- lm(mpg ~ cylinders + displacement + weight + horsepower,
             data=Auto)
summary(linreg)

Call:
lm(formula = mpg ~ cylinders + displacement + weight + horsepower, 
    data = Auto)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.5248  -2.7964  -0.3568   2.2577  16.3221 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  45.7567705  1.5200437  30.102  < 2e-16 ***
cylinders    -0.3932854  0.4095522  -0.960 0.337513    
displacement  0.0001389  0.0090099   0.015 0.987709    
weight       -0.0052772  0.0007166  -7.364 1.08e-12 ***
horsepower   -0.0428125  0.0128699  -3.327 0.000963 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.242 on 387 degrees of freedom
Multiple R-squared:  0.7077,    Adjusted R-squared:  0.7046 
F-statistic: 234.2 on 4 and 387 DF,  p-value: < 2.2e-16

The Estimate column of the lm summary reports the \(\widehat{\beta}_k\) estimates, the Std. Error column reports their standard errors.

The residual standard error of 4.241859 is \(\widehat{\sigma}\), the square root of the estimator \(\widehat{\sigma}^2\) derived above.

The following statements load the data set from DuckDB into a Pandas dataframe, removes the observations with missing values and display the first observations.

import pandas as pd
import duckdb 

con = duckdb.connect(database="ads.ddb", read_only=True)
Auto_data = con.sql("SELECT * FROM auto").df().dropna()
con.close()

Auto_data.head()
    mpg  cylinders  displacement  ...  year  origin                       name
0  18.0          8         307.0  ...    70       1  chevrolet chevelle malibu
1  15.0          8         350.0  ...    70       1          buick skylark 320
2  18.0          8         318.0  ...    70       1         plymouth satellite
3  16.0          8         304.0  ...    70       1              amc rebel sst
4  17.0          8         302.0  ...    70       1                ford torino

[5 rows x 9 columns]

The following statements fit the multiple linear regression model using the statsmodels library.

Suppose we want to develop a model that predicts mpg from other variables. A multiple linear regression model with inputs cylinders, displacement, weight and horsepower is fit in Python with scikit-learn (sklearn) as follows.

import statsmodels.api as sm
import numpy as np

X = Auto_data[['cylinders', 'displacement', 'weight', 'horsepower']]
X = sm.add_constant(X)
y = Auto_data['mpg']

linmod = sm.OLS(y, X).fit()

print(linmod.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    mpg   R-squared:                       0.708
Model:                            OLS   Adj. R-squared:                  0.705
Method:                 Least Squares   F-statistic:                     234.2
Date:                Wed, 30 Apr 2025   Prob (F-statistic):          6.18e-102
Time:                        17:23:20   Log-Likelihood:                -1120.1
No. Observations:                 392   AIC:                             2250.
Df Residuals:                     387   BIC:                             2270.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const           45.7568      1.520     30.102      0.000      42.768      48.745
cylinders       -0.3933      0.410     -0.960      0.338      -1.199       0.412
displacement     0.0001      0.009      0.015      0.988      -0.018       0.018
weight          -0.0053      0.001     -7.364      0.000      -0.007      -0.004
horsepower      -0.0428      0.013     -3.327      0.001      -0.068      -0.018
==============================================================================
Omnibus:                       37.596   Durbin-Watson:                   0.862
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               50.918
Skew:                           0.696   Prob(JB):                     8.77e-12
Kurtosis:                       4.085   Cond. No.                     2.23e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.23e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

An intercept is not automatically added to the model in the fitting routine, so it must be added to the X matrix prior to calling sm.OLS.

The coefficient estimates, their standard errors, \(t\)-values anr \(p\)-values are shown in the body of the OLS Regression Results.

The regression model trained by ordinary least squares is \[ \text{mpg}_i = \beta_0 + \beta_1\text{cylinders}_i+\beta_2\text{displacement}_i+\beta_3\text{weight}_i+\beta_4\text{horsepower}_i + \epsilon_{i} \] and it is assumed that \(\epsilon_i \sim iid (0,\sigma^2)\). The OLS estimates are

  • \(\widehat{\beta}_0 = 45.7568\)
  • \(\widehat{\beta}_1 = -0.3933\)
  • \(\widehat{\beta}_2 = 0.0001\)
  • \(\widehat{\beta}_3 = -0.0053\)
  • \(\widehat{\beta}_4 = -0.0428\)

The predicted miles per gallon of an automobile for which the data frame is representative, is

\[\begin{align*} \widehat{\text{mpg}} = 45.7568 &-0.3933\times\text{cylinders} + 0.0001\times\text{displacement} \\ &-0.0053\times \text{weight} -0.0428\times\text{horsepower} \end{align*}\]

The \(p\)-values in the regression summary associated with individual input variables test the hypotheses that the coefficient for the input is zero, given all other input variables in the model. These are known as partial tests because the other variables remain in the model. For example, the \(p\)-value for the test of the hypothesis that cylinders makes a significant contribution given that displacement, weight, and horsepower are in the model is 0.338.

The \(F\)-statistic of 234.2 and its very small associated \(p\)-value tests that the coefficients of the four variables are simultaneously zero, \(H: \beta_1 = \beta_2 = \beta_3 = \beta_4 = 0\). This hypothesis is soundly rejected.

The four input variables combined explain 70.8% of the variability of mileage (mpg) through a multiple linear regression.

Coefficient Interpretation

How do we interpret the regression coefficients of the fitted model

\[\begin{align*} \widehat{\text{mpg}} = 45.7587 &-0.3933\times\text{cylinders} + 0.0001\times\text{displacement} \\ &-0.0053\times \text{weight} -0.0428\times\text{horsepower} \end{align*}\]

Since the model is linear, it is tempting to state that, for example, a change in 1 unit of displacement causes a change of 0.0001388 in miles per gallon. This interpretation is not correct, because

  1. We cannot conclude causality between inputs and the target variable. The data are purely observational so we can at best state that changes in the input variables are associated with different predicted values for miles per gallon.

  2. We cannot interpret one input variable in the absence of the others. The signs of the regression coefficients are somewhat counter-intuitive. Why would mileage go down for cars with more cylinders but go up with greater displacement. Does adding cylinders not imply a larger engine displacement? The point is that the inputs are related to each other, they do not vary freely from each other. When we interpret the magnitude of a regression coefficient in terms of the change in the target variable that corresponds to a unit change in the input variable, we are implicitly holding all other predictors fixed (ceteris paribus).

The correct interpretation of the displacement coefficient is thus

when displacement increases by one cubic inch and all other attributes remain constant, the expected mileage increases by 0.0001388 miles per gallon.

Note

The all-other-variables-held-fixed interpretation is also important when interpreting hypothesis test results. The \(p\)-values of variables cylinders and displacement are 0.33 and 0.98, respectively, suggesting that these variables do not make significant contributions toward explaining miles per gallon. These \(p\)-values are interpreted in the presence of the other variables in the model.

The correct interpretation is that the number of cylinders is not a significant predictor of miles per gallon in a model that accounts for engine displacement, vehicle weight, and horsepower.

\(R^2\), the Coefficient of Determination

The variability in the target \(\textbf{y}\), not accounting for any information provided by the input variables can be estimated as \[ s^2 = \frac{1}{n-1}\sum_{i=1}^n (y_i - \overline{y})^2 \] If the \(y_i\) had the same mean, this would be an unbiased estimator of \(\text{Var}[Y]\). However, the regression model states very clearly that the mean of \(Y\) is a function of the \(x\)-inputs. This estimator is then a biased estimator of \(\text{Var}[Y] = \sigma^2\). The numerator of \(s^2\) is called the total sum of squares (SST). If SST captures variability of \(Y\) about a constant mean, how much of this is attributable to the input variables? To answer this we can look at the variability not attributable to the \(x\)s, the error sum of squares \[ \text{SSE} = \sum_{i=1}^n \widehat{\epsilon}_i = \sum_{i=1}^n (y_i - \widehat{y}_i)^2 \] The ratio \[ R^2 = \frac{\text{SST}-\text{SSE}}{\text{SST}}=1-\frac{\text{SSE}}{\text{SST}} \] is known as the coefficient of determination or R-squared. The name R-squared comes from a simple relationship between \(R^2\) and the Pearson correlation coefficient in the SLR case. If \(y_i = \beta_0 + \beta_1 x_i + \epsilon_i\), then \(R^2\) is the square of the correlation coefficient between \(X\) and \(Y\):

Example: Auto Data (Cont’d)

slr <- lm(mpg ~ horsepower, data=Auto)
summary(slr)$r.squared
[1] 0.6059483
cor(Auto$mpg, Auto$horsepower)^2
[1] 0.6059483

The squared correlation coefficient between mpg and horsepower is the same as the \(R^2\) statistic shown in the lm summary.

You can compute the correlation coefficient between the two variables from simple linear regression output, but you need to take into account the sign of the regression coefficient. The correlation coefficient has the same sign as \(\beta_1\).

cor(Auto$mpg, Auto$horsepower)
[1] -0.7784268
as.numeric(sign(slr$coefficients[2])) * sqrt(summary(slr)$r.squared)
[1] -0.7784268
import numpy as np

X = Auto_data[['horsepower']]
X = sm.add_constant(X)
y = Auto_data['mpg']

slr = sm.OLS(y, X).fit()

r_squared = slr.rsquared
print(round(r_squared,5))
0.60595
corr_squared = np.corrcoef(Auto_data['mpg'], Auto_data['horsepower'])[0, 1]**2
print(round(corr_squared,5))
0.60595

The squared correlation coefficient between mpg and horsepower is the same as the \(R^2\) statistic shown in the fit.OLS summary.

corr = np.corrcoef(Auto_data['mpg'], Auto_data['horsepower'])[0, 1]
print(round(corr,5))
-0.77843
sign_coef = np.sign(slr.params[1]) * np.sqrt(slr.rsquared)
print(round(sign_coef,5))
-0.77843

\(R^2\) ranges between 0 and 1; it achieves the lower bound \(R^2=0\) if SSE = SST, the input variables do not explain any variability in \(Y\). \(R^2 = 1\) results when SSE = 0, the model fits the data “perfectly”, it interpolates the \(y_i\).

The straightforward interpretation of \(R^2\) as the proportion of variability explained by the input variables unfortunately can lead one to chase models that have a high \(R^2\). This is a terrible practice for a number of reasons

  • The value of \(R^2\) does not tell us whether the model assumptions are met. You can explain a substantial amount of variability in the data with a seriously deficient model. For example, the four regressor model fit to the Auto data earlier has \(R^2 = 0.708\), explaining 70.8% of the variability in miles per gallon. A look at the residuals from that model shows that there is substantial trends in the residuals (Figure 7.1). Larger fitted values have larger variability and there is a definite trend in the residuals; the model is not (yet) correct.
Figure 7.1

Here is an example with simulated data where the mean function is not a straight line and the error variance depends on \(x\). A simple linear regression model is clearly not appropriate, but it explains more than 75% of the variability in \(y\).

x <- seq(0.15, 1, l = 100)
set.seed(123456)
eps <- rnorm(n = 100, sd = 0.25 * x^2)
y <- 1 - 2 * x * (1 + 0.25 * sin(4 * pi * x)) + eps
slr <- lm(y ~ x)

plot(x,y,type="p",col="red",las=1,bty="l")
abline(slr$coefficients)

summary(slr)$r.squared
[1] 0.7694696
import matplotlib.pyplot as plt
from scipy import stats

x = np.linspace(0.15, 1, 100)
np.random.seed(123456)
eps = np.random.normal(scale=0.25 * x**2, size=100)

y = 1 - 2 * x * (1 + 0.25 * np.sin(4 * np.pi * x)) + eps

X = sm.add_constant(x)
slr = sm.OLS(y, X).fit()

plt.figure(figsize=(10, 6))
plt.scatter(x, y, color='red')

x_range = np.array([min(x), max(x)])
X_range = sm.add_constant(x_range)
y_pred = slr.predict(X_range)
plt.plot(x_range, y_pred, color='blue')

plt.xlabel('x')
plt.ylabel('y')
plt.grid(True, alpha=0.3)
plt.show()

print(round(slr.rsquared,5))
0.79301
  • \(R^2\) is a function of SSE, the prediction error on the training data set. This can be made arbitrarily small by adding input variables. If \(R^2_{\text{cur}}\) is the coefficient of determination in a linear regression model, and you add a new predictor \(x_{p+1}\), then \[ R^2_{\text{new}} \ge R^2_{\text{cur}} \] Predictors that make no relevant contribution will increase \(R^2\) because their addition still reduces SSE.

  • A model that interpolates the data (fits “perfectly”) is not a good model. It is certainly not perfect if the goal is to build a model that generalizes well to unseen observations. Different metrics are needed to develop models that generalize and do not overfit the data.

Chasing \(R^2\) values invariably leads to overfitting and models that memorize too much of the training observations to perform well on new data.

Measuring Prediction Error

Since we can make SSE on the training data arbitrarily small by adding more input variables, the MSPE on the training data is not a good metric if we want a model that performs well in predicting new observations. Instead of MSETr the test error MSETe should be used. As discussed in Section 2.5, the test error can be estimated by holding out some observations from training in a test data set or by cross-validation (Section 2.6).

In the classical linear model leave-one-out cross-validation is particularly appealing. In addition to not depending on any random selection of data points as test data sets or \(k\)-fold cross-validation do, LOOCV error can be calculated in the linear model without re-fitting the model \(n\) times. All the necessary pieces to compute the LOOCV error can be assembled on the same pass through the data that calculates the OLS estimates. The key is the Sherman, Morrison, Woodbury formula.

Sherman, Morrison, Woodbury formula

This remarkable formula is at the heart of many regression-type diagnostics and cross-validation techniques. A version of this formula was first given by Gauss in 1821. Around 1950, it appeared in several papers by Sherman and Morrison, and Woodbury.

Suppose we are in a full-rank linear modeling context with design matrix \(\textbf{X}_{(n \times p + 1)}\), so that the inverse \(\left( \textbf{X}^\prime\textbf{X}\right)^{-1}\) exists. In diagnosing the quality of a model, we are interested in measuring the prediction error for the \(i\)th observation as if the data point had not contributed to the analysis. This is an example of a leave-one-out estimate: remove an observation from the data, redo the analysis, and measure how well the quantity of interest can be computed for the withheld observation.

If you do this in turn for all \(n\) observations, you must fit the model \(n + 1\) times, an overall fit to the training data with \(n\) observations, and \(n\) additional fits with training data sets of size \(n - 1\), leaving out each observation in turn. The computationally expensive part of fitting the linear model is building the cross-product matrix \(\textbf{X}^\prime\textbf{X}\) and computing its inverse \(\left( \textbf{X}^\prime\textbf{X}\right)^{-1}\).

The Sherman-Morrison-Woodbury formula allows us to compute the inverse of the cross-product matrix based on \(\left( \textbf{X}^\prime\textbf{X}\right)^{- 1}\) as if the \(i\)th observation had been removed.

Denote as \(\textbf{X}_{-i}\) the design matrix with the \(i\)th observation removed. Then

\[ \left( \textbf{X}_{-i}^\prime\textbf{X}_{-i} \right)^{- 1} = \left( \textbf{X}^\prime\textbf{X}- \textbf{x}_{i}\textbf{x}_{i}^{\prime} \right)^{-1}\ = \left( \textbf{X}^\prime\textbf{X}\right)^{- 1} + \frac{\left( \textbf{X}^\prime\textbf{X}\right)^{-1}{\textbf{x}_{i}\textbf{x}_{i}^{\prime}\left( \textbf{X}^\prime\textbf{X}\right)}^{- 1}}{1 - \textbf{x}_{i}^{\prime}\left( \textbf{X}^\prime\textbf{X}\right)^{-1}\textbf{x}_{i}} \]

The quantities on the right hand side are available in the standard regression calculations based on \(n\) data points. Because of this remarkable result, leave-one-out statistics can be calculated easily—without retraining any models—based on the fit to the full training data alone.

Note

Note that the quantity in the denominator of the right-hand side is the diagonal value of \(\textbf{I}- \textbf{H}\), where \(\textbf{H}\) is the Hat matrix. If \(h_{ii}\) denotes the diagonal values of \(\textbf{H}\), we can write the update formula as

\[ \left( \textbf{X}^\prime\textbf{X}\right)^{- 1} + \frac{\left( \textbf{X}^\prime\textbf{X}\right)^{- 1}{\textbf{x}_{i}\textbf{x}_{i}^{\prime}\left( \textbf{X}^\prime\textbf{X}\right)}^{- 1}}{1 - h_{ii}} \]

The leverage values \(h_{ii}\) play an important role in the computation of residual, influence, and case-deletion diagnostics in linear models.

PRESS Statistic

If we denote the predicted value of \(y_i\), obtained in a regression without the \(i\)th observation, as \(\widehat{y}_{-i}\), then the leave-one-out residual \[ y_i - \widehat{y}_{-i} \] is the test error for the \(i\)th observation. Using the Sherman-Morrison-Woodbury result, it is a neat exercise to show that this is simply \[ y_i - \widehat{y}_{-i} = \frac{y_i - \widehat{y}_i}{1-h_{ii}} \] The leave-one-out error for the \(i\)th observation is obtained by dividing the \(i\)th residual by one minus the leverage value. When these deviations are squared and summed across the entire data set the PRESS statistic results \[ \text{PRESS} = \sum_{i=1}^n (y_i - \widehat{y}_{-i})^2 = \sum_{i=1}^n \left(\frac{y_i - \widehat{y}_i}{1-h_{ii}}\right)^2 \] The name is derived from prediction sum of squares. The average PRESS value estimates the mean square test error \[ \text{MSE}_{Te} = \frac{1}{n}\text{PRESS} \]

Example: Auto Data (Cont’d)

For the four regressor model the PRESS statistic and the MSETe can be calculated by extracting the leverage values

mlr <- lm(mpg ~ cylinders + displacement + weight + horsepower,
             data=Auto)

leverage <- hatvalues(mlr)
PRESS_res <- mlr$residuals / (1-leverage)
PRESS <- sum(PRESS_res^2)
MSE_Te  <- PRESS/length(leverage)
cat("PRESS statistic: ", PRESS, "\n")
PRESS statistic:  7138.559 
cat("MSE Test based on LOOCV: ", MSE_Te,"\n")
MSE Test based on LOOCV:  18.21061 

You can validate this result the hard way by fitting \(n\) separate regression models, leaving one observation out each time and predicting that observation to obtain the PRESS residual.

PRESS <- 0
for (i in 1:nrow(Auto)) {
    m <- lm(mpg ~ cylinders + displacement + weight + horsepower, data=Auto[-i,])
    yhat_minus_i <- predict(m,newdata=Auto[i,])
    PRESS <- PRESS + (Auto[i,"mpg"] - yhat_minus_i)^2
}
cat("MSE Test based on LOOCV: ", PRESS/nrow(Auto))
MSE Test based on LOOCV:  18.21061
import pandas as pd

X = Auto_data[['cylinders', 'displacement', 'weight', 'horsepower']]
X = sm.add_constant(X)
y = Auto_data['mpg']
mlr = sm.OLS(y, X).fit()

leverage = mlr.get_influence().hat_matrix_diag

# Calculate PRESS residuals and PRESS statistic
PRESS_res = mlr.resid / (1 - leverage)
PRESS = np.sum(PRESS_res**2)

# Calculate MSE Test based on LOOCV
MSE_Te = PRESS / len(leverage)

print(f"PRESS statistic: {PRESS:.4f}")
PRESS statistic: 7138.5591
print(f"MSE Test based on LOOCV: {MSE_Te:.4f}")
MSE Test based on LOOCV: 18.2106

You can validate this result the hard way by fitting \(n\) separate regression models, leaving one observation out each time and predicting that observation to obtain the PRESS residual.

# Alternative calculation of PRESS using explicit leave-one-out
PRESS = 0
for i in range(len(Auto_data)):
    # Create training data without the i-th observation
    X_train = Auto_data.drop(Auto_data.index[i])[['cylinders', 'displacement', 'weight', 'horsepower']]
    X_train = sm.add_constant(X_train)
    y_train = Auto_data.drop(Auto_data.index[i])['mpg']
    
    # Fit the model
    mlr = sm.OLS(y_train, X_train).fit()
    
    # Predict for the held-out observation
    X_test = Auto_data.iloc[i][['cylinders', 'displacement', 'weight', 'horsepower']].to_frame().T
    X_test.insert(0,'intercept',1)
    yhat_minus_i = float(mlr.predict(X_test).iloc[0])
    
    # Add squared error to PRESS
    PRESS += (Auto_data.iloc[i]['mpg'] - yhat_minus_i)**2

print(f"MSE Test based on LOOCV: {PRESS/len(Auto_data):.4f}")
MSE Test based on LOOCV: 18.2106

Interactions

What is the difference between the following models

\[\begin{align*} Y &= \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon\\ Y &= \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1 x_2 + \epsilon \end{align*}\]

Both models depend on \(x_1\) and \(x_2\). In the first case the variables enter the model as main effects, that is, by themselves. Each input variable is allowed to make its contribution on the outcome given the presence of the other. The second model contains the additional interaction term \(x_1 x_2\). To be more precise, this is a two-way interaction term because it involves two input variables. A three-way interaction term would be \(x_1 x_2 x_3\).

Suppose that \(\beta_3\) is not zero, how should we interpret the presence of an interaction term in the model? We can no longer state that \(\beta_1\) measures the effect on the target variable when \(x_1\) is changed by one unit. The effect of changing \(x_1\) by one unit in the second model is now a function of \(x_2\). To see this, consider the mean of \(Y\) at two points, \(x_1\) and \(x_1+1\).

\[\begin{align*} \text{E}[Y | x_1] &= \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1 x_2 \\ \text{E}[Y | x_1 + 1] &= \beta_0 + \beta_1 (x_1+1) + \beta_2 x_2 + \beta_3 (x_1+1)x_2 \end{align*}\]

The difference between the two is \[ \text{E}[Y | x_1 + 1] - \text{E}[Y | x_1] = \beta_1 + \beta_3 x_2 \] The effect of \(x_1\) is now a function of \(x_2\). That is the very meaning of an interaction. The effect of one variable (or factor) depends on another variable (or factor) and vice versa. In the example the effect of \(x_1\) is a linear regression in \(x_2\).

Tip

When building models with interactions, it is customary to include lower-order effects in the model if higher-order effects are significant. For example, if \(x_1 x_2\) is in the model one includes the main effects \(x_1\) and \(x_2\) regardless of their significance. Similarly, if a three-way interaction is significant one includes the two-way interactions and the main effects in the model. The argument for doing so is that in order for two things to interact they must be present–otherwise, what interacts?

The presence/absence of an interaction between a categorical input variable and a numeric input variable can be seen by comparing the trends as in Figure 7.2. With a two-level categorical factor, representing for example, a treatment and a placebo, the absence of interactions manifests itself in parallel lines. The effect of the treatment is the distance between the two lines and is the same for all values of \(x_1\). Similarly, the effect of \(x_1\), the slope of the line, is the same for both groups. In the presence of an interaction the slopes are not the same and the distance between the lines (the effect of \(x_2\)) depends on the value for \(x_1\).

Example: Auto Data (Cont’d)

We are now considering a series of model for the Auto data, based on the same four input variables used earlier.

The first four models add inputs and also two-way interactions of all inputs in the model. The formula expression y ~ (x1 + x2 + x3)^2 is a shorthand for including the main effects and two-way interactions of the three inputs. R displays the interaction terms as x1:x2, x1:x3, and x2:x3 in the output.

Models 5 and 6 then add up to three-way and four-way interactions, respectively. For each model we calculate the number of non-zero coefficients (the rank of \(\textbf{X}\)), SSE, \(R^2\) and the PRESS statistic.

calcPress <- function(linModel) {
    leverage <- hatvalues(linModel)
    r <- linModel$residuals
    Press_res <- r/(1-leverage)
    Press <- sum(Press_res^2)
    return(list("ncoef"=linModel$rank,
                "R2"   =summary(linModel)$r.squared, 
                "SSE"  =sum(r^2),
                "Press"=Press))
}

l1 <- lm(mpg ~ cylinders, data=Auto)
l2 <- lm(mpg ~ (cylinders + displacement)^2, data=Auto )
l3 <- lm(mpg ~ (cylinders + displacement + horsepower)^2, data=Auto )
l4 <- lm(mpg ~ (cylinders + displacement + horsepower + weight)^2, data=Auto )
l5 <- lm(mpg ~ (cylinders + displacement + horsepower + weight)^3, data=Auto )
l6 <- lm(mpg ~ (cylinders + displacement + horsepower + weight)^4, data=Auto )

df <- rbind(as.data.frame(calcPress(l1)), 
            as.data.frame(calcPress(l2)),
            as.data.frame(calcPress(l3)),
            as.data.frame(calcPress(l4)),
            as.data.frame(calcPress(l5)),
            as.data.frame(calcPress(l6))
      )
knitr::kable(df,format="simple")
ncoef R2 SSE Press
2 0.6046890 9415.910 9505.534
4 0.6769104 7695.670 7846.982
7 0.7497691 5960.247 6197.767
11 0.7607536 5698.608 6089.687
15 0.7752334 5353.714 5818.953
16 0.7752893 5352.382 5861.117

The model complexity increases from the first to the sixth model; the models have more parameters and more intricate interaction terms. The SSE values decrease as terms are added to the model, and \(R^2\) increases accordingly. The PRESS statistic is always larger than the SSE, which makes sense because it is based on squaring the Press residuals \((y_i - \widehat{y}_i)/(1-h_{ii})\) which are larger than the ordinary residuals \(y_i - \widehat{y}_i\).

From the fifth to the sixth model only one additional parameter is added to the model, the four-way interaction of all inputs. \(R^2\) barely increases but the PRESS statistic increases compared to the model with only three-way interaction. Interestingly, none of the effects in the four-way model are significant given the presence of other terms in the model.

summary(l6)

Call:
lm(formula = mpg ~ (cylinders + displacement + horsepower + weight)^4, 
    data = Auto)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.0497  -1.9574  -0.2334   1.8091  18.8569 

Coefficients:
                                           Estimate Std. Error t value Pr(>|t|)
(Intercept)                               6.104e+01  7.744e+01   0.788    0.431
cylinders                                -4.974e+00  1.465e+01  -0.339    0.734
displacement                             -4.338e-01  5.655e-01  -0.767    0.444
horsepower                               -6.459e-01  7.861e-01  -0.822    0.412
weight                                    2.735e-02  2.623e-02   1.042    0.298
cylinders:displacement                    4.737e-02  7.861e-02   0.603    0.547
cylinders:horsepower                      1.069e-01  1.324e-01   0.808    0.420
cylinders:weight                         -2.757e-03  4.565e-03  -0.604    0.546
displacement:horsepower                   8.090e-03  6.253e-03   1.294    0.197
displacement:weight                      -7.045e-05  1.775e-04  -0.397    0.692
horsepower:weight                        -2.047e-04  2.689e-04  -0.761    0.447
cylinders:displacement:horsepower        -1.041e-03  8.238e-04  -1.263    0.207
cylinders:displacement:weight             8.719e-06  2.422e-05   0.360    0.719
cylinders:horsepower:weight               1.351e-05  4.130e-05   0.327    0.744
displacement:horsepower:weight           -4.645e-07  1.920e-06  -0.242    0.809
cylinders:displacement:horsepower:weight  7.702e-08  2.517e-07   0.306    0.760

Residual standard error: 3.773 on 376 degrees of freedom
Multiple R-squared:  0.7753,    Adjusted R-squared:  0.7663 
F-statistic: 86.48 on 15 and 376 DF,  p-value: < 2.2e-16

The first four models add inputs and also two-way interactions of all inputs in the model. The code uses patsy’s dmatrices function to build model expressions similar to the formula expression in R.

Models 5 and 6 then add up to three-way and four-way interactions, respectively. For each model we calculate the number of non-zero coefficients (the rank of \(\textbf{X}\)), SSE, \(R^2\) and the PRESS statistic.

import statsmodels.api as sm
import pandas as pd
import numpy as np
from patsy import dmatrices
from tabulate import tabulate

# Define function to calculate PRESS and other statistics
def calc_press(model):
    leverage = model.get_influence().hat_matrix_diag
    r = model.resid
    press_res = r / (1 - leverage)
    press = np.sum(press_res**2)
    
    return {
        "ncoef": model.df_model + 1,  # +1 for intercept
        "R2": model.rsquared,
        "SSE": np.sum(r**2),
        "Press": press
    }

formula1 = "mpg ~ cylinders"
y, X = dmatrices(formula1, data=Auto_data, return_type='dataframe')
l1 = sm.OLS(y, X).fit()

formula2 = "mpg ~ (cylinders + displacement)**2"
y, X = dmatrices(formula2, data=Auto_data, return_type='dataframe')
l2 = sm.OLS(y, X).fit()

formula3 = "mpg ~ (cylinders + displacement + horsepower)**2"
y, X = dmatrices(formula3, data=Auto_data, return_type='dataframe')
l3 = sm.OLS(y, X).fit()

formula4 = "mpg ~ (cylinders + displacement + horsepower + weight)**2"
y, X = dmatrices(formula4, data=Auto_data, return_type='dataframe')
l4 = sm.OLS(y, X).fit()

formula5 = "mpg ~ (cylinders + displacement + horsepower + weight)**3"
y, X = dmatrices(formula5, data=Auto_data, return_type='dataframe')
l5 = sm.OLS(y, X).fit()

formula6 = "mpg ~ (cylinders + displacement + horsepower + weight)**4"
y, X = dmatrices(formula6, data=Auto_data, return_type='dataframe')
l6 = sm.OLS(y, X).fit()

df = pd.DataFrame([
    calc_press(l1),
    calc_press(l2),
    calc_press(l3),
    calc_press(l4),
    calc_press(l5),
    calc_press(l6)
])

print(tabulate(df, headers='keys', tablefmt='simple'))
      ncoef        R2      SSE    Press
--  -------  --------  -------  -------
 0        2  0.604689  9415.91  9505.53
 1        4  0.67691   7695.67  7846.98
 2        7  0.749769  5960.25  6197.77
 3       11  0.760754  5698.61  6089.69
 4       15  0.775233  5353.71  5818.95
 5       16  0.775289  5352.38  5861.12
print(l6.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    mpg   R-squared:                       0.775
Model:                            OLS   Adj. R-squared:                  0.766
Method:                 Least Squares   F-statistic:                     86.48
Date:                Wed, 30 Apr 2025   Prob (F-statistic):          9.05e-112
Time:                        17:23:22   Log-Likelihood:                -1068.6
No. Observations:                 392   AIC:                             2169.
Df Residuals:                     376   BIC:                             2233.
Df Model:                          15                                         
Covariance Type:            nonrobust                                         
============================================================================================================
                                               coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------------------------
Intercept                                   61.0371     77.437      0.788      0.431     -91.227     213.301
cylinders                                   -4.9738     14.653     -0.339      0.734     -33.786      23.839
displacement                                -0.4338      0.566     -0.767      0.444      -1.546       0.678
horsepower                                  -0.6459      0.786     -0.822      0.412      -2.192       0.900
weight                                       0.0273      0.026      1.042      0.298      -0.024       0.079
cylinders:displacement                       0.0474      0.079      0.603      0.547      -0.107       0.202
cylinders:horsepower                         0.1069      0.132      0.808      0.420      -0.153       0.367
cylinders:weight                            -0.0028      0.005     -0.604      0.546      -0.012       0.006
displacement:horsepower                      0.0081      0.006      1.294      0.197      -0.004       0.020
displacement:weight                      -7.045e-05      0.000     -0.397      0.692      -0.000       0.000
horsepower:weight                           -0.0002      0.000     -0.761      0.447      -0.001       0.000
cylinders:displacement:horsepower           -0.0010      0.001     -1.263      0.207      -0.003       0.001
cylinders:displacement:weight             8.719e-06   2.42e-05      0.360      0.719   -3.89e-05    5.63e-05
cylinders:horsepower:weight               1.351e-05   4.13e-05      0.327      0.744   -6.77e-05    9.47e-05
displacement:horsepower:weight           -4.645e-07   1.92e-06     -0.242      0.809   -4.24e-06    3.31e-06
cylinders:displacement:horsepower:weight  7.702e-08   2.52e-07      0.306      0.760   -4.18e-07    5.72e-07
==============================================================================
Omnibus:                       51.766   Durbin-Watson:                   1.035
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              136.295
Skew:                           0.636   Prob(JB):                     2.53e-30
Kurtosis:                       5.594   Cond. No.                     4.33e+11
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.33e+11. This might indicate that there are
strong multicollinearity or other numerical problems.

7.2 Hypothesis Testing

When testing hypothesis in statistical models it is useful to think of the hypothesis as imposing a constraint on the model. The test then boils down to comparing a constrained and an unconstrained model in such a way that we can make probability statements about the validity of the constraint. If it is highly unlikely that the constraint holds, we reject the hypothesis.

This principle applies to hypothesis testing in many model families, what differs is how the impact of the constraint on the model is measured. In least-squares estimation we look at how a sum of squares changes as the constraint is imposed. In models fit by maximum likelihood we measure how much the log likelihood changes when the constraint is imposed.

Suppose we have a model with four predictors, \[ Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_4 + \epsilon \] and want to test the hypothesis that the absence of \(x_3\) and \(x_4\) does not make the model worse. The constraint we impose on the model is \[ H: \beta_3 = \beta_4 = 0 \] This is a hypothesis with two degrees of freedom, since two parameters of the model are constrained simultaneously.

Tip

You can usually figure out the degrees of freedom in a hypothesis by counting equal signs.

The unconstrained and constrained models are also called the full and the reduced models, respectively. In this case the full model is \[ Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_4 + \epsilon \] and the reduced model is \[ Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon \]

Definition: Nested Models

Two models are said to be nested when one model can be derived from the other by imposing constraints on the model parameters. The full and reduced models in the hypothesis testing context are nested models. On the contrary, if two models are not nested, the hypothesis testing framework described here does not apply.

If based on the data the hypothesis cannot be rejected, we conclude that the full model is not significantly improved over the reduced model. The strength of evidence in favor of the hypothesis depends on how much variability is accounted for in the model when the constraint is relaxed, relative to the overall variability in the system. To measure that we need to introduce the idea of partitioning sums of squares.

Partitioning Variability through Sums of Squares

The difference between the total sum of squares, \(\text{SST} = \sum_{i=1}^n(y_i - \overline{y})^2\), which does not depend on the inputs, and the error sum of squares \(\text{SSE} = \sum_{i=1}^n (y_i - \widehat{y}_i)^2\), is the model sum of squares \[ \text{SSM} = \sum_{i=1}^n\left(\widehat{y}_i-\overline{y}\right)^2 = \widehat{\boldsymbol{\beta}}\textbf{X}^\prime\textbf{y}- n\overline{y}^2 \]

The total and model sums of squares are also called the corrected total and model sums of squares because they adjust for an overall estimate of the mean of \(Y\) if there are no inputs. In other words, they account for the intercept \(\beta_0\) in the model; \(\overline{y}\) is the estimate of \(\beta_0\) in an intercept-only model.

Another way of looking at SSM is as a measure of the combined contribution of \(\beta_1, \ldots, \beta_p\) beyond the intercept. The notation \[ \text{SSM} = SS(\beta_1, \cdots,\beta_p | \beta_0) \] makes this explicit. We can now think of other sum of squares, for example, \(SS(\beta_3, \beta_4 | \beta_0, \beta_1, \beta_2)\) is the sum of squares contribution when \(\beta_3\) and \(\beta_4\) are added to a model that contains \(\beta_0, \beta_1\), and \(\beta_2\). Algebraically, \[ SS(\beta_3, \beta_4 | \beta_0, \beta_1, \beta_2) = SS(\beta_1, \beta_2, \beta_3, \beta_4 | \beta_0) - SS(\beta_1, \beta_2 | \beta_0) \] This will be one part of measuring the strength of hypothesis \(H: \beta_3 = \beta_4 = 0\), the change in the model sum of squares between the full model with four inputs and the reduced model with two inputs However, there has to be more to it. If the data are very noisy, this change will have to be large to convince us of evidence against the hypothesis. If the data have small error variability, a smaller change in model sums of squares will suffice. This leads to considering the following test statistic: \[ F_{obs} = \frac{SS(\beta_3, \beta_4 | \beta_0, \beta_1, \beta_2) / 2}{\widehat{\sigma}^2} \] Notice that the sum of squares in the numerator is divided by the degrees of freedom of the hypothesis. To find an estimator for the variance in the denominator we rely on SSE in the larger of the two models, the unconstrained model, because it is more likely to be an unbiased estimator of \(\sigma^2\).

Definition: Sum of Squares Reduction Test

Suppose that \(\text{SSE}_f\) and \(\text{SSE}_r\) are the error sum of squares in a full and reduced model where the reduced model is defined by a constraint \(H\) with \(q\) degrees of freedom imposed on the full model. The statistic \[ F_{obs} = \frac{(\text{SSE}_r - \text{SSE}_f)/q}{\text{SSE}_f/\text{dfE}_f} \] follows an F distribution with \(q\) numerator and \(\text{dfE}_f\) denominator degrees of freedom if the model errors follow a Gaussian distribution, \(\boldsymbol{\epsilon}\sim G(\textbf{0},\sigma^2\textbf{I})\). \(\text{dfE}_f\) are the degrees of freedom associated with SSE in the full model, \(n-r(\textbf{X})_f\).

The sum of squares reduction test is very general and a very powerful tool to answer questions about the parameters in a linear model. However, in order to use any hypothesis testing framework that works with probability statements, a distributional assumption is required. We can always calculate \(F_{obs}\) between two models. Computing \(p\)-values, that is, the probability \(\Pr(F_{q,\text{dfE}_f} > F_{obs})\), is only valid if the data are normally distributed.

Caution

If you reject the hypothesis \(H\) based on a small \(p\)-value, you do not conclude that the full model is the correct model. You can say that there is significant evidence that the constraint can be relaxed. You might have compared a really bad model and a bad model.

Example: Auto Data (Cont’d)

To test the hypothesis that the coefficient for weight and horsepower are simultaneously zero in a model that accounts for cylinders and displacement, we can use the sum of squares reduction test.

lm_full <- lm(mpg ~ cylinders + displacement + weight + horsepower, data=Auto)
lm_red  <- lm(mpg ~ cylinders + displacement                      , data=Auto)

SSE_full <- sum(lm_full$residuals^2)
SSE_red  <- sum(lm_red$residuals^2)
q <- lm_full$rank - lm_red$rank
sigma2_hat <- SSE_full / (lm_full$df.residual)

Fobs <- ((SSE_red-SSE_full)/q)/sigma2_hat
pvalue <- 1-pf(Fobs,q,lm_full$df.residual)

cat("SSE_r: ", SSE_red, "SSE_f: ", SSE_full, "\n")
SSE_r:  8342.566 SSE_f:  6963.433 
cat("sigma2_hat: ", sigma2_hat, "\n")
sigma2_hat:  17.99337 
cat("Fobs: ", Fobs, "Pr(F > Fobs): ", pvalue)
Fobs:  38.32337 Pr(F > Fobs):  6.661338e-16

Removing weight and horsepower from the four-predictor model increases the error sum of squares from 6963.4333 to 8342.5664. The F statistic for this reduction test is \(F_{obs} =\) 38.323 and the \(p\)-value is very small (6.6613381^{-16}). We reject the hypothesis, the two-predictor model is not a sufficient model to explain mileage compared to the four-predictor model.

You can get to this result more quickly by using the anova function in R:

anova(lm_red, lm_full)
Analysis of Variance Table

Model 1: mpg ~ cylinders + displacement
Model 2: mpg ~ cylinders + displacement + weight + horsepower
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1    389 8342.6                                  
2    387 6963.4  2    1379.1 38.323 6.529e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
import scipy.stats as stats

X_full = Auto_data[['cylinders', 'displacement', 'weight', 'horsepower']]
X_full = sm.add_constant(X_full)
y = Auto_data['mpg']
lm_full = sm.OLS(y, X_full).fit()

# Fit the reduced model
X_red = Auto_data[['cylinders', 'displacement']]
X_red = sm.add_constant(X_red)
lm_red = sm.OLS(y, X_red).fit()

SSE_full = sum(lm_full.resid**2)
SSE_red = sum(lm_red.resid**2)

q = lm_full.df_model - lm_red.df_model
df_residual = lm_full.df_resid

sigma2_hat = SSE_full / df_residual

Fobs = ((SSE_red - SSE_full) / q) / sigma2_hat
pvalue = 1 - stats.f.cdf(Fobs, q, df_residual)

# Print results
print(f"SSE_r: {SSE_red:.4f} SSE_f: {SSE_full:.4f}")
SSE_r: 8342.5664 SSE_f: 6963.4333
print(f"sigma2_hat: {sigma2_hat:.4f}")
sigma2_hat: 17.9934
print(f"Fobs: {Fobs:.4f} Pr(F > Fobs): {pvalue:.6f}")
Fobs: 38.3234 Pr(F > Fobs): 0.000000

Removing weight and horsepower from the four-predictor model increases the error sum of squares from 8342.566 to 6963.433. The F statistic for this reduction test is \(F_{obs} =\) 38.323 and the \(p\)-value is very small. We reject the hypothesis, the two-predictor model is not a sufficient model to explain mileage compared to the four-predictor model.

You can get to this result more quickly by using the anova_lm function from statsmodels:

from statsmodels.stats.anova import anova_lm
print("\nANOVA Table:")

ANOVA Table:
print(anova_lm(lm_red, lm_full))
   df_resid          ssr  df_diff      ss_diff          F        Pr(>F)
0     389.0  8342.566366      0.0          NaN        NaN           NaN
1     387.0  6963.433344      2.0  1379.133022  38.323371  6.528992e-16

Sequential and Partial Sums of Squares

The sum of squares reduction test is helpful to understand the difference between two important special types of sum of squares, sequential and partial ones. How do we measure the contribution an individual predictor makes to the model \[ Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon \] Sequential sums of squares measures the contribution of an input relative to the inputs that precede it in the model. Partial sums of squares measure the contribution relative to all other inputs in the model. For example, if the input of interest is \(x_3\) in a four-regressor model, \[ SS(\beta_3 | \beta_0, \beta_1, \beta_2) \] is the sequential sum of squares for \(x_3\) and \[ SS(\beta_3 | \beta_0, \beta_1, \beta_2, \beta_4) \] is the partial sum of squares. \(\beta_4\) does not appear in the sequential sum of squares for \(x_3\) because it appears after \(\beta_3\) in the model formula. In other words, the sequential sum of squares for \(x_3\) does not adjust for \(x_4\) at all, while the partial sum of squares does. Clearly, the two types of sum of squares are not identical, unless the inputs are orthogonal (independent).

Sequential sum of squares have an appealing additive property, the overall model sum of squares can be accumulated from a series of sequential terms,

\[\begin{align*} \text{SSM} &= SS(\beta_1 | \beta_0)\\ &+ SS(\beta_2 | \beta_0, \beta_1) \\ &+ SS(\beta_3 | \beta_0, \beta_1, \beta_2) \\ &+ \cdots \\ &+ SS(\beta_p | \beta_0, \beta_1, \beta_2, \ldots, \beta_{p-1}) \end{align*}\]

Partial sum of squares do not add up to anything meaningful, unless the inputs are orthogonal.

This raises an interesting question. If we divide a coefficient estimate by its standard error and compute a \(p\)-value from a \(t\)-distribution, what kind of hypothesis is being tested?

Example: Auto Data (Cont’d)

Here is the summary of the four-predictor model for mpg.

summary(lm(mpg ~ cylinders + displacement + weight + horsepower, data=Auto))

Call:
lm(formula = mpg ~ cylinders + displacement + weight + horsepower, 
    data = Auto)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.5248  -2.7964  -0.3568   2.2577  16.3221 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  45.7567705  1.5200437  30.102  < 2e-16 ***
cylinders    -0.3932854  0.4095522  -0.960 0.337513    
displacement  0.0001389  0.0090099   0.015 0.987709    
weight       -0.0052772  0.0007166  -7.364 1.08e-12 ***
horsepower   -0.0428125  0.0128699  -3.327 0.000963 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.242 on 387 degrees of freedom
Multiple R-squared:  0.7077,    Adjusted R-squared:  0.7046 
F-statistic: 234.2 on 4 and 387 DF,  p-value: < 2.2e-16

How do we interpret the \(p-\)-value of 0.337513 for the cylinders variable or the \(p\)-value of 0.987709 for the displacement variable? A two-sided \(t\)-test is equivalent to an F test with 1 numerator degrees of freedom where \(F_{obs}\) is the square of the \(t_{obs}\) statistic. Let’s first compute the sequential sum of squares tests and see if they match the \(p\)-values in the output.

lm1 <- lm(mpg ~ cylinders, data=Auto)
lm2 <- lm(mpg ~ cylinders + displacement, data=Auto)
lm3 <- lm(mpg ~ cylinders + displacement + weight, data=Auto)
lm4 <- lm(mpg ~ cylinders + displacement + weight + horsepower, data=Auto)

anova(lm1, lm2, lm3, lm4)
Analysis of Variance Table

Model 1: mpg ~ cylinders
Model 2: mpg ~ cylinders + displacement
Model 3: mpg ~ cylinders + displacement + weight
Model 4: mpg ~ cylinders + displacement + weight + horsepower
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1    390 9415.9                                  
2    389 8342.6  1   1073.34 59.652 9.795e-14 ***
3    388 7162.5  1   1180.02 65.581 7.342e-15 ***
4    387 6963.4  1    199.12 11.066 0.0009633 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The \(p\)-values are different from those shown in the lm output, so Pr(>|t|) = 0.987709 cannot have a sequential interpretation (adding displacement to a model that contains cylinders only). The \(p\)-values in the lm summary have a partial interpretation as seen by performing reduction tests between the following reduced models and the full model with four predictors:

lm_nocyl <- lm(mpg ~ displacement + weight + horsepower, data=Auto)
lm_nodis <- lm(mpg ~ cylinders + weight + horsepower, data=Auto)
lm_nowgt <- lm(mpg ~ cylinders + displacement + horsepower, data=Auto)
lm_nohp  <- lm(mpg ~ cylinders + displacement + weight, data=Auto)

anova(lm_nocyl,lm4)
Analysis of Variance Table

Model 1: mpg ~ displacement + weight + horsepower
Model 2: mpg ~ cylinders + displacement + weight + horsepower
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1    388 6980.0                           
2    387 6963.4  1    16.592 0.9221 0.3375
anova(lm_nodis,lm4)
Analysis of Variance Table

Model 1: mpg ~ cylinders + weight + horsepower
Model 2: mpg ~ cylinders + displacement + weight + horsepower
  Res.Df    RSS Df Sum of Sq     F Pr(>F)
1    388 6963.4                          
2    387 6963.4  1  0.004276 2e-04 0.9877
anova(lm_nowgt,lm4)
Analysis of Variance Table

Model 1: mpg ~ cylinders + displacement + horsepower
Model 2: mpg ~ cylinders + displacement + weight + horsepower
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1    388 7939.2                                  
2    387 6963.4  1    975.72 54.227 1.085e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(lm_nohp ,lm4)
Analysis of Variance Table

Model 1: mpg ~ cylinders + displacement + weight
Model 2: mpg ~ cylinders + displacement + weight + horsepower
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1    388 7162.5                                  
2    387 6963.4  1    199.12 11.066 0.0009633 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The \(p\)-values of the partial reduction F-tests are identical to those for the \(t\)-tests in the lm output. The \(F_{obs}\) statistics are the squared values of the \(t_{obs}\) statistics.

On the other hand, if you ask for the analysis of variance on the full model with the aov function, you get sequential tests of the inputs.

summary(aov(lm4))
              Df Sum Sq Mean Sq F value   Pr(>F)    
cylinders      1  14403   14403  800.47  < 2e-16 ***
displacement   1   1073    1073   59.65 9.79e-14 ***
weight         1   1180    1180   65.58 7.34e-15 ***
horsepower     1    199     199   11.07 0.000963 ***
Residuals    387   6963      18                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
formula1 = "mpg ~ cylinders + displacement + weight + horsepower"
y, X = dmatrices(formula1, data=Auto_data, return_type='dataframe')
mlr = sm.OLS(y, X).fit()
print(mlr.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    mpg   R-squared:                       0.708
Model:                            OLS   Adj. R-squared:                  0.705
Method:                 Least Squares   F-statistic:                     234.2
Date:                Wed, 30 Apr 2025   Prob (F-statistic):          6.18e-102
Time:                        17:23:23   Log-Likelihood:                -1120.1
No. Observations:                 392   AIC:                             2250.
Df Residuals:                     387   BIC:                             2270.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept       45.7568      1.520     30.102      0.000      42.768      48.745
cylinders       -0.3933      0.410     -0.960      0.338      -1.199       0.412
displacement     0.0001      0.009      0.015      0.988      -0.018       0.018
weight          -0.0053      0.001     -7.364      0.000      -0.007      -0.004
horsepower      -0.0428      0.013     -3.327      0.001      -0.068      -0.018
==============================================================================
Omnibus:                       37.596   Durbin-Watson:                   0.862
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               50.918
Skew:                           0.696   Prob(JB):                     8.77e-12
Kurtosis:                       4.085   Cond. No.                     2.23e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.23e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

How do we interpret the \(p-\)-value of 0.338 for the cylinders variable or the \(p\)-value of 0.988 for the displacement variable? A two-sided \(t\)-test is equivalent to an F test with 1 numerator degrees of freedom where \(F_{obs}\) is the square of the \(t_{obs}\) statistic. Let’s first compute the sequential sum of squares tests and see if they match the \(p\)-values in the output.

y, X = dmatrices("mpg ~ cylinders", data=Auto_data, return_type='dataframe')
lm1 = sm.OLS(y, X).fit()

y, X = dmatrices("mpg ~ cylinders + displacement", data=Auto_data, return_type='dataframe')
lm2 = sm.OLS(y, X).fit()

y, X = dmatrices("mpg ~ cylinders + displacement + weight", data=Auto_data, return_type='dataframe')
lm3 = sm.OLS(y, X).fit()

anova_lm(lm1, lm2, lm3, mlr)
   df_resid          ssr  df_diff      ss_diff          F        Pr(>F)
0     390.0  9415.910391      0.0          NaN        NaN           NaN
1     389.0  8342.566366      1.0  1073.344025  59.652203  9.696240e-14
2     388.0  7162.549156      1.0  1180.017210  65.580675  7.297943e-15
3     387.0  6963.433344      1.0   199.115812  11.066067  9.633308e-04

The \(p\)-values are different from those shown in the summary of the four input model, so Pr(>|t|) = 0.988 cannot have a sequential interpretation (adding displacement to a model that contains cylinders only). The \(p\)-values in the model summary have a partial interpretation as seen by performing reduction tests between the following reduced models and the full model with four predictors:

y, X = dmatrices("mpg ~ displacement + weight + horsepower", data=Auto_data, return_type='dataframe')
lm_nocyl = sm.OLS(y, X).fit()

y, X = dmatrices("mpg ~ cylinders + weight + horsepower", data=Auto_data, return_type='dataframe')
lm_nodis = sm.OLS(y, X).fit()

y, X = dmatrices("mpg ~ cylinders + displacement + horsepower", data=Auto_data, return_type='dataframe')
lm_nowgt = sm.OLS(y, X).fit()


y, X = dmatrices("mpg ~ cylinders + displacement + weight", data=Auto_data, return_type='dataframe')
lm_nohp = sm.OLS(y, X).fit()

anova_lm(lm_nocyl,mlr)
   df_resid          ssr  df_diff    ss_diff         F    Pr(>F)
0     388.0  6980.025762      0.0        NaN       NaN       NaN
1     387.0  6963.433344      1.0  16.592418  0.922141  0.337513
anova_lm(lm_nodis,mlr)
   df_resid          ssr  df_diff   ss_diff         F    Pr(>F)
0     388.0  6963.437620      0.0       NaN       NaN       NaN
1     387.0  6963.433344      1.0  0.004276  0.000238  0.987709
anova_lm(lm_nowgt,mlr)
   df_resid          ssr  df_diff     ss_diff          F        Pr(>F)
0     388.0  7939.158297      0.0         NaN        NaN           NaN
1     387.0  6963.433344      1.0  975.724953  54.226922  1.084846e-12
anova_lm(lm_nohp ,mlr)
   df_resid          ssr  df_diff     ss_diff          F    Pr(>F)
0     388.0  7162.549156      0.0         NaN        NaN       NaN
1     387.0  6963.433344      1.0  199.115812  11.066067  0.000963

7.3 Prediction

To obtain the fitted values in the linear model \(\textbf{Y}= \textbf{X}\boldsymbol{\beta}+ \boldsymbol{\epsilon}\) we apply the OLS estimates and get \[ \widehat{\textbf{y}} = \textbf{X}\widehat{\boldsymbol{\beta}} \] To predict at a previously unobserved value of the inputs, a new observation \(\textbf{x}_0\), we apply the regression equation and the OLS estimates to \(\textbf{x}_0\): \[ \widehat{\textbf{y}}_0 = \textbf{x}_0^\prime \widehat{\boldsymbol{\beta}} \]

This seems simple, there must be a trick to it. The “trick” lies in the question: what is being predicted?

The target of the prediction (the what) can be \(Y\), the target variable, or \(\text{E}[Y]\), the mean of the target variable. How are these different? \(Y\) is a random variable and \(\text{E}[Y]\) is a constant; \(\text{Var}[Y] = \sigma^2\) and \(\text{Var}[\text{E}[Y]] = 0\). The uncertainty in predicting an individual observation and predicting the mean function will differ. The former has to take into account the inherent variability \(\sigma^2\) in the data.

How are the predicted values themselves different?

For illustration, consider an SLR model \(Y = \beta_0 + \beta_1x + \epsilon\). To predict \(Y\) at \(X=x_0\) we choose the obvious expression \[ \widehat{Y}_0 = \widehat{\beta}_0 + \widehat{\beta}_1x_0 + \widehat{\epsilon} \] substituting estimates for all unknowns on the right hand side. Since \(\epsilon\) is a zero-mean random variable and cannot be observed directly, the best estimate is \(\widehat{\epsilon} = 0\) which leads to \[ \widehat{Y}_0 = \widehat{\beta}_0 + \widehat{\beta}_1x_0 \]

That is the same expression we use to predict \(\text{E}[Y] = \beta_0 + \beta_1 x_0\), substituting the OLS estimates for unknowns on the right hand side: \[ \widehat{\text{E}}[Y_0] = \widehat{\beta}_0 + \widehat{\beta}_1x_0 \] And therein lies the crux. The predicted values for an observation and for the mean of an observation are the same. But their variability is not the same. We need to be very clear about what it is we are shooting for. More frequently one is interested in predicting observations, not averages of observations. In a study of health outcomes over time you might be more interested in predicting how a patient does at time \(t\), rather than how the population of patients does on average at time \(t\). Yet when folks see the different levels of confidence we have in the two predictions, they wish they could make predictions for the average.

Prediction Variance

The variance of \(\widehat{\text{E}}[Y_0] = \textbf{x}_0^\prime\widehat{\boldsymbol{\beta}}\) is straightforward and depends only on the variance of \(\widehat{\boldsymbol{\beta}}\): \[ \text{Var}{\widehat{\text{E}}[Y_0]} = \sigma^2 \, \textbf{x}_0^\prime(\textbf{X}^\prime\textbf{X})^{-1}\textbf{x}_0 \] To account for the variability in the data, the proper variance to consider when predicting an individual observation is \[ \text{Var}[\widehat{Y}_0 - Y_0] = \text{Var}[\widehat{Y}_0 - \epsilon] = \text{Var}[\textbf{x}_0^\prime\widehat{\boldsymbol{\beta}}-\epsilon] = \sigma^2\left(1+\textbf{x}_0^\prime(\textbf{X}^\prime\textbf{X})^{-1}\textbf{x}_0\right) \]

The additional \(1+\) does not seem like a big deal but has substantial numeric consequences.

Confidence and Prediction Intervals

When the errors have a Gaussian distribution, the following random variables have \(t\) distributions with \(n-r(\textbf{X})\) degrees of freedom:

\[ \begin{align*} t &= \frac{\widehat{Y}_0 - \text{E}[Y]_0}{\sqrt{\text{Var}[\widehat{Y}_0]}} \\ t &= \frac{\widehat{Y}_0 - Y_0}{\sqrt{\text{Var}[\widehat{Y}_0]-Y_0}} \end{align*} \]

The first is used to construct \((1-\alpha)\)-level confidence intervals for the mean of the target \[ \widehat{y}_0 \pm t_{\frac{\alpha}{2},n-(p+1)} \sqrt{\sigma^2 \textbf{x}_0^\prime(\textbf{X}^\prime\textbf{X})^{-1}\textbf{x}_0} \] The second is used to construct \((1-\alpha)\)-level prediction intervals for an individual target \[ \widehat{y}_0 \pm t_{\frac{\alpha}{2},n-(p+1)} \sqrt{\sigma^2 \left(1+ \textbf{x}_0^\prime(\textbf{X}^\prime\textbf{X})^{-1}\textbf{x}_0 \right)} \]

Example: PISA OECD Study

PISA (Program for International Student Assessment) is an OECD study in 65 countries to evaluate the performance of 15-year old students in math, science, and reading. Among the questions the study is trying to address is whether the educational level in a country is influenced by economic wealth, and if so, to what extent.

library(duckdb)

con <- dbConnect(duckdb(),dbdir="ads.ddb", read_only=FALSE)
pisa <- dbGetQuery(con,"SELECT * from pisa;")
dbDisconnect(con)

head(pisa)
               Country MathMean MathShareLow MathShareTop ReadingMean
1       Shanghai-China      613          3.8         55.4         570
2            Singapore      573          8.3         40.0         542
3 Hong Kong SAR, China      561          8.5         33.7         545
4       Chinese Taipei      560         12.8         37.2         523
5                Korea      554          9.1         30.9         536
6     Macao SAR, China      538         10.8         24.3         509
  ScienceMean     GDPp  logGDPp HighIncome
1         580  6264.60  8.74267      FALSE
2         551 54451.21 10.90506       TRUE
3         555 36707.77 10.51074       TRUE
4         523       NA       NA         NA
5         538 24453.97 10.10455       TRUE
6         521 77145.04 11.25344       TRUE

The following statements compute the simple linear regression of MathMean on the log GDP and 95% prediction and confidence intervals.

pisa_slr <- lm(MathMean ~ logGDPp, data=pisa)

xvals <- data.frame(logGDPp=seq(from=6, to = 14, by=0.1))
p_pred <- data.frame(predict(pisa_slr, newdata=xvals, interval="prediction"))
p_conf <- data.frame(predict(pisa_slr, newdata=xvals, interval="confidence"))

head(p_pred)
       fit      lwr      upr
1 357.8802 250.6125 465.1479
2 360.7589 254.0404 467.4775
3 363.6376 257.4570 469.8183
4 366.5163 260.8621 472.1706
5 369.3951 264.2555 474.5346
6 372.2738 267.6372 476.9104
head(p_conf)
       fit      lwr      upr
1 357.8802 307.8856 407.8749
2 360.7589 311.9537 409.5642
3 363.6376 316.0200 411.2553
4 366.5163 320.0844 412.9483
5 369.3951 324.1466 414.6435
6 372.2738 328.2065 416.3410
import duckdb
import pandas as pd

con = duckdb.connect(database="ads.ddb", read_only=True)
pisa = con.sql("SELECT * FROM pisa;").df().dropna()
con.close()

pisa.head()
                Country  MathMean  ...    logGDPp  HighIncome
0        Shanghai-China       613  ...   8.742670       False
1             Singapore       573  ...  10.905060        True
2  Hong Kong SAR, China       561  ...  10.510744        True
4                 Korea       554  ...  10.104548        True
5      Macao SAR, China       538  ...  11.253443        True

[5 rows x 9 columns]

The following statements compute the simple linear regression of MathMean on the log GDP and 95% prediction and confidence intervals.

import statsmodels.api as sm
import pandas as pd
import numpy as np

X = sm.add_constant(pisa['logGDPp'])
y = pisa['MathMean']
pisa_slr = sm.OLS(y, X).fit()

xvals = pd.DataFrame({'logGDPp': np.arange(6, 14.1, 0.1)})
X_pred = sm.add_constant(xvals)

intervals = pd.DataFrame(
    pisa_slr.get_prediction(X_pred).summary_frame(alpha=0.05),
    columns=['mean', 'mean_se', 'mean_ci_lower', 'mean_ci_upper', 'obs_ci_lower', 'obs_ci_upper']
)

p_pred = intervals[['mean', 'obs_ci_lower', 'obs_ci_upper']]
p_pred.columns = ['fit', 'lwr', 'upr']  

p_conf = intervals[['mean', 'mean_ci_lower', 'mean_ci_upper']]
p_conf.columns = ['fit', 'lwr', 'upr']  

print("Prediction intervals:")
Prediction intervals:
print(p_pred.head())
          fit         lwr         upr
0  357.880227  250.612505  465.147949
1  360.758933  254.040380  467.477487
2  363.637639  257.456954  469.818324
3  366.516345  260.862057  472.170633
4  369.395050  264.255515  474.534586
print("\nConfidence intervals:")

Confidence intervals:
print(p_conf.head())
          fit         lwr         upr
0  357.880227  307.885571  407.874884
1  360.758933  311.953695  409.564171
2  363.637639  316.020002  411.255276
3  366.516345  320.084350  412.948339
4  369.395050  324.146586  414.643514

Both types of intervals are most narrow near the center of the \(x\) data range and widen toward the edges (Figure 7.3). A prediction outside of the hull of the data will be much less precise than a prediction near the center of the data. The additional variance term that distinguishes the variance of \(\widehat{y}_0\) from that of \(\widehat{y}_0 - y_0\) has considerable effect; the prediction intervals are much wider than the confidence intervals.

Figure 7.3: 95% prediction and confidence intervals.

7.4 Diagnostics

Model diagnostics are useful to examine model assumptions and to decide whether the model is an adequate description of the data. Questions we would like to answer through model diagnostics include

  • Are the assumptions of the model met?

    • Linearity (zero mean errors, \(\text{E}[\boldsymbol{\epsilon}] = \textbf{0}\))

    • Equal variance (homoscedasticity, \(\text{Var}[\boldsymbol{\epsilon}] = \sigma^2\textbf{I}\))

    • Independence of the errors \(\text{Cov}[\epsilon_i, \epsilon_j] = 0\)

    • Can we assume that the errors are normally distributed?

  • How well does the model predict new observations?

  • Do observations have undue influence on the results?

  • Does the relationship between the inputs negatively affect the analysis?

The three basic types of linear model diagnostics are

  1. Residual diagnostics to examine linearity, equal variance assumptions, and to detect outliers. Residual analysis relies on functions of \(y_i - \widehat{y}_i\) to study the behavior of the unobservable \(\epsilon_i\).

  2. Case-deletion diagnostics find data points that exert high influence on the analysis. These diagnostics ask how an aspect of the analysis (variability, predicted values, coefficient estimates, …) changes if an observation is removed from the analysis.

  3. Collinearity diagnostics examine the relationships among the inputs and whether they impact the analysis in a negative way. The situations at the end of the extremes include completely orthogonal inputs (\(\textbf{X}^\prime\textbf{X}\) is a diagonal matrix) and inputs that are linear combinations of each other (\(\textbf{X}^\prime\textbf{X}\) is non-singular and a unique OLS solution does not exist). Most applications fall in-between unless one or more inputs are factors.

Leverage

Linear model diagnostics depend on the leverage values \(h_{ii}\), the diagonal values of the Hat matrix \(\textbf{H}\). That is not surprising because the fitted values are linear combinations of the entries in the Hat matrix \[ \widehat{\textbf{y}} = \textbf{H}\textbf{y} \] The \(i\)th fitted value is a linear combination of the entries in the \(i\)th row of \(\textbf{H}\) with the elements of \(\textbf{y}\) \[ \widehat{y}_i = \textbf{x}_i^\prime\widehat{\boldsymbol{\beta}} = \sum_{j=1}^n h_{ij}y_j \]

From the last expression it is easy to establish that \[ \text{Var}[\widehat{y}_i] = \sigma^2 \, \textbf{x}_i^\prime(\textbf{X}^\prime\textbf{X})^{-1}\textbf{x}_i = \sigma^2 h_{ii} \]

We can think of the leverages \(h_{ii}\) as standardized squared distance measures that tell us how far the \(i\)th data point is from the center of the \(x\)-data.

It is illustrative to look at the average variance of the fitted values across the data set, \[ \frac{1}{n}\sum_{i=1}^n \text{Var}[\widehat{y}_i] = \frac{1}{n} \sum_{i=1}^n \sigma^2h_{ii} = \sigma^2\left(\frac{p+1}{n}\right) \tag{7.1}\]

The last result follows because \(\textbf{H}\) is a projection matrix and thus \(tr(\textbf{H})\) equals its rank, \(p+1\).

What does Equation 7.1 tell us about the precision of the estimated regression output? Suppose \(p=4\) and \(n=10\). The average variance is then \(\sigma^2/2\). When \(p=4\) and \(n=1000\), the average variance is \(\sigma^2/200\). As sample size increases, the estimates become more precise as \(p\) remains fixed. As \(p\) increases for a given sample size, the average variance of the fitted values increases. In high-dimensional problems—where \(p\) is large—OLS estimates have high variability and are unstable. Regularized estimation methods such as Ridge or Lasso regression can perform better in those circumstances.

Since \(\sum_{i=1}^n h_{ii} = p+1\), the average leverage value in the data is \((p+1)/n\), and a good threshold for high leverage points is \(h_{ii} > 2(p+1)/n\). This is not necessarily a problematic data point, it simply states that the point is an outlying point in the \(x\)-space. High leverage points have the potential to influence aspects of the analysis, to be highly influential data points. More on this below.

In summary, here are some important results involving the leverage values \(h_{ii}\):

  • \(h_{ii} = \textbf{x}_i^\prime (\textbf{X}^\prime\textbf{X})^{-1}\textbf{x}_i\)

  • \(\frac{1}{n} \le h_{ii} \le 1\). This holds only for the \(n\) training observations, the leverage \(\textbf{x}_0^\prime(\textbf{X}^\prime\textbf{X})^{-1}\textbf{x}_0\) of a new data point is not bounded in this way.

  • \(\overline{h} = (p+1)/n\); high leverage points are those for which \(h_{ii} > 2(p+1)/n\).

  • \(\text{Var}[\widehat{y}_i] = \sigma^2 h_{ii}\)

  • \(\text{Var}[y_i - \widehat{y}_i] = \sigma^2 (1-h_{ii})\)

  • \(\widehat{y}_i = \sum_{j=1}^n h_{ij}y_j\), the \(i\)th fitted value is a linear combination of the target values with the values in the \(i\)th row of \(\textbf{H}\).

  • \(\sum_{j=1}^n h_{ij} = 1\), the sum of the leverage values in the \(i\)th row of \(\textbf{H}\) is 1. Since the leverage values are bounded, they sum to 1 within a row, and the fitted values are linear combinations of \(\textbf{H}\), this shows how a data point with \(h_{ii} \approx 1\) has outsize influence. The fitted value is almost entirely determined by the input values of that observation.

Example: Same Leverage–Different Influence

This simulation demonstrates the concept of high leverage points with and without high influence. Data are simulated under the model \[ Y = \beta_0 + \beta_1 x + \epsilon \] with \(\epsilon \sim G(0,0.25^2), \beta_0 = 1, \beta_1 = 0.5\). The design points for the input variable are spread evenly from 1 to 2, and a high leverage point is added at \(x=4\). There are 22 observations, so the threshold for a high leverage point is \(2(p+1)/n = 2*2/22 = 0.18\).

Two data sets are simulated. One in which the target value at \(x=4\) concurs with the mean function, one in which the target value is unusually high (Figure 7.4).

datfr <- read.csv(file="data/sim_leverage.csv")

lm1 <- lm(y1 ~ x, data=datfr)
lm2 <- lm(y2 ~ x, data=datfr)

lm1$coefficients
(Intercept)           x 
  0.9868996   0.5024693 
lm2$coefficients
(Intercept)           x 
 0.02317379  1.16105181 
# the point at xlev has the same leverage in both regressions
round(hatvalues(lm1),4)
     1      2      3      4      5      6      7      8      9     10     11 
0.0932 0.0857 0.0789 0.0727 0.0671 0.0622 0.0579 0.0543 0.0512 0.0488 0.0471 
    12     13     14     15     16     17     18     19     20     21     22 
0.0460 0.0455 0.0456 0.0464 0.0478 0.0499 0.0525 0.0558 0.0598 0.0644 0.7671 
round(hatvalues(lm2),4)
     1      2      3      4      5      6      7      8      9     10     11 
0.0932 0.0857 0.0789 0.0727 0.0671 0.0622 0.0579 0.0543 0.0512 0.0488 0.0471 
    12     13     14     15     16     17     18     19     20     21     22 
0.0460 0.0455 0.0456 0.0464 0.0478 0.0499 0.0525 0.0558 0.0598 0.0644 0.7671 
import numpy as np
import pandas as pd
import statsmodels.api as sm

datfr = pd.read_csv("data/sim_leverage.csv")

X = sm.add_constant(datfr['x'])
y1 = datfr['y1']
y2 = datfr['y2']

lm1 = sm.OLS(y1, X).fit()
lm2 = sm.OLS(y2, X).fit()

print("lm1 coefficients:")
lm1 coefficients:
print(lm1.params)
const    0.986900
x        0.502469
dtype: float64
print("\nlm2 coefficients:")

lm2 coefficients:
print(lm2.params)
const    0.023174
x        1.161052
dtype: float64
leverage1 = np.round(lm1.get_influence().hat_matrix_diag, 4)
leverage2 = np.round(lm2.get_influence().hat_matrix_diag, 4)

# the point at xlev has the same leverage in both regressions
print(leverage1)
[0.0932 0.0857 0.0789 0.0727 0.0671 0.0622 0.0579 0.0543 0.0512 0.0488
 0.0471 0.046  0.0455 0.0456 0.0464 0.0478 0.0499 0.0525 0.0558 0.0598
 0.0644 0.7671]
print(leverage2)
[0.0932 0.0857 0.0789 0.0727 0.0671 0.0622 0.0579 0.0543 0.0512 0.0488
 0.0471 0.046  0.0455 0.0456 0.0464 0.0478 0.0499 0.0525 0.0558 0.0598
 0.0644 0.7671]

The fitted linear regressions are quite different. The parameter estimates in the first model, where \(y|x=4\) is not unusual, are close to the true values \(\beta_0 = 1, \beta_1 = 0.5\). In the second regression the parameter estimates are very different from the true values, the estimates are biased.

The leverage values in both models are identical, since they depend only on the \(x\)-data. The data point at \(x=4\) has high leverage, its value of 0.7671 exceeds the threshold considerably. No other data point has high leverage.

Figure 7.4 displays the data and fitted regressions for the two data sets. Although \(x=4\) is a high leverage point, it has no undue influence on the estimated regression in the left panel. In the right panel, the high leverage point is a highly influential point due to its unusual \(y\)-value. The data point exerts its leverage by pulling the estimated regression towards it.

Figure 7.4: A data point with high-leverage at \(x=4\) has little influence in one analysis and is highly influential in another analysis depending on its \(y\)-value. The dashed line is the true mean function.

Residual Diagnostics

Basic questions about the correctness of the model revolve around the properties of \(\boldsymbol{\epsilon}\). The usual assumption \(\boldsymbol{\epsilon}\sim (\textbf{0}, \sigma^2\textbf{I})\) states that the errors have zero mean, are uncorrelated, and have equal variance. Although \(\boldsymbol{\epsilon}\) is unobservable, we should be able to check those assumptions by looking at the OLS residuals of the fitted model, \(\widehat{\boldsymbol{\epsilon}} = \textbf{y}- \widehat{\textbf{y}}\). These are also called the raw residuals.

Unfortunately, the properties of \(\widehat{\boldsymbol{\epsilon}}\) match the properties of \(\boldsymbol{\epsilon}\) only partially. Because \(\widehat{\boldsymbol{\epsilon}}\) is the result of fitting a model to data, the fitted residuals obey constraints that do not affect the model errors \(\boldsymbol{\epsilon}\). Because \[ \textbf{X}^\prime \widehat{\boldsymbol{\epsilon}} = \textbf{X}^\prime (\textbf{I}-\textbf{H})\textbf{y}= \textbf{0} \] the raw residuals sum to zero across each column of the \(\textbf{X}\) matrix. In other words, there are only \(n-r(\textbf{X})\) degrees of freedom in the raw residual vector. From a statistical perspective, the residuals have zero mean, \(\text{E}[\widehat{\boldsymbol{\epsilon}}] = \textbf{0}\) and share this property with the model errors. The variance of \(\boldsymbol{\epsilon}\) and \(\widehat{\boldsymbol{\epsilon}}\) is different, however:

\[\begin{align*} \text{Var}[\boldsymbol{\epsilon}] &= \sigma^2 \textbf{I}\\ \text{Var}[\widehat{\boldsymbol{\epsilon}}] &= \sigma^2(\textbf{I}- \textbf{H}) \\ \text{Var}[\widehat{\epsilon}_i] &= \sigma^2 (1-h_{ii}) \end{align*}\]

While the model errors are uncorrelated, the fitted residuals are correlated, \(\textbf{I}- \textbf{H}\) is not a diagonal matrix. The fitted residuals also do not have the same variance; the variance depends on the leverage of the \(i\)th data point.

These properties (or lack thereof) should give pause in using the raw residuals to diagnose the assumptions of equal variance or uncorrelated errors. Instead, residual diagnostics use transformations of the raw residuals.

Studentized residuals

The unequal variance of the residuals can be handled by standardizing, dividing the residual by its standard deviation (the square root of its variance) \[ \frac{\widehat{\epsilon}_i}{\sigma\sqrt{1-h_{ii}}} \] \(\sigma\) is unknown and the obvious solution is to substitute an estimator. Statisticians refer to this process, using an estimate to scale a random variable, as studentization. The studentized residual is thus \[ r_i = \frac{\widehat{\epsilon}_i}{\widehat{\sigma}\sqrt{1-h_{ii}}} \] The usual estimator for \(\sigma\) is the square root of the estimator of \(\sigma^2\), \[ \widehat{\sigma} = \sqrt{ \frac{\text{SSE}}{n-r(\textbf{X})}} \]

R-student residuals

A further adjustment can be made to the studentized residuals. Rather than use an estimate of the variance that is derived from all the data, an estimator can be used that does not depend on the \(i\)th observation. This technique is called external studentization in contrast to the internal studentization that gives rise to \(r_i\).

Fortunately such an external estimate of the variance \(\sigma^2\) that does not rely on the \(i\)th observation can be computed based on the analysis of all \(n\) observations. Not surprisingly, as with PRESS residuals discussed earlier, the leverage plays a role again: \[ \widehat{\sigma}^2_{-i} = \frac{(n-r(\textbf{X}))\widehat{\sigma}^2 - \frac{\widehat{\epsilon}^2_i}{1-h_{ii}}}{n-r(\textbf{X})-1} \]

The externally studentized residual is called the R-student residual, \[ t_i = \frac{\widehat{\epsilon}_i}{\widehat{\sigma}_{-i}\sqrt{1-h_{ii}}} \]

Since \(\widehat{\epsilon}_i\) and \(\widehat{\sigma}_{-i}\) are independent, \(t_i\) behaves like a \(t\)-distributed random variable. The R-student residuals are good diagnostics to detect outliers and high-influence points (hips). Outliers are observations unusual in \(y\)-space. They are not necessarily hips, unless they are also high leverage points (see the previous example).

The \(t_i\) work well for outliers and hips because an outlier has large \(\widehat{\epsilon}_i\) and a hip has small \(\sqrt{1-h_{ii}}\). Both effects increase the value of \(t_i\). This is also true for the (internally) studentized residual. In addition, outliers or hips will have a large \[ \frac{\widehat{\epsilon}^2_i}{1-h_{ii}} \] the adjustment term in the computation of \(\widehat{\sigma}^2_{-i}\). For those data points \(\widehat{\sigma}^2_{-i} < \widehat{\sigma}^2\) and \(t_i\) will be more sensitive than \(r_i\).

You can obtain all three sets of residuals easily in R:

  • The residual vector returned on the lm return object contains the \(\widehat{\epsilon}_i\).

  • The rstandard() function returns the studentized residuals \(r_i\) (unfortunate function name)

  • The rstudent() function returns the R-student residuals \(t_i\)

When testing model assumptions such as linearity, equal variance (homoscedasticity), and checking for outliers, the R-student residuals are the preferred quantities. The threshold \(|r_i| > 2\) is often applied to indicate outlying observations.

Example: Boston Housing Values

To demonstrate residual analysis in a linear model we use the Boston housing data that is part of the MASS library in R. The data set comprises 506 observations on the median value of owner-occupied houses (medv in $000s) in Boston suburbs and 13 variables describing the towns and properties.

The following statements fit a multiple linear regression model to predict median home value as a function of all but two inputs. The formula syntax medv ~ . requests to include all variables in the dataframe as inputs. medv ~ . -indus -age requests inclusion of all inputs except indus and age.

library(MASS)
fit <- lm(medv ~ . - indus - age, data=Boston)
summary(fit)

Call:
lm(formula = medv ~ . - indus - age, data = Boston)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.5984  -2.7386  -0.5046   1.7273  26.2373 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  36.341145   5.067492   7.171 2.73e-12 ***
crim         -0.108413   0.032779  -3.307 0.001010 ** 
zn            0.045845   0.013523   3.390 0.000754 ***
chas          2.718716   0.854240   3.183 0.001551 ** 
nox         -17.376023   3.535243  -4.915 1.21e-06 ***
rm            3.801579   0.406316   9.356  < 2e-16 ***
dis          -1.492711   0.185731  -8.037 6.84e-15 ***
rad           0.299608   0.063402   4.726 3.00e-06 ***
tax          -0.011778   0.003372  -3.493 0.000521 ***
ptratio      -0.946525   0.129066  -7.334 9.24e-13 ***
black         0.009291   0.002674   3.475 0.000557 ***
lstat        -0.522553   0.047424 -11.019  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.736 on 494 degrees of freedom
Multiple R-squared:  0.7406,    Adjusted R-squared:  0.7348 
F-statistic: 128.2 on 11 and 494 DF,  p-value: < 2.2e-16

All 11 input variables are significant in this model; it explains 74% of the variability in median home values.

A plot of the R-student residuals against the observation number helps to identify outlying observations. Observations with \(t_i\) values outside the [-2, +2] interval are outliers. The plot also shows whether the equal variance assumption is reasonable; if the assumption is met the residuals show as a band of equal width.

RStudent <- rstudent(fit)

par(mar=c(5.1, 4.1, 2, 3))
plot(RStudent, xlab="Obs no.", las=1,bty="l")
abline(h= 2, lty="dashed")
abline(h=-2, lty="dashed")

A plot of the residuals against leverage shows observations that are unusual with respect to \(y\) (large absolute value of the residual), with respect to \(x\) (large leverage), or both. A high leverage point that is also an outlier is a highly influential data point.

leverage <- hatvalues(fit)
lev_threshold <- 2*fit$rank/length(fit$residuals)

par(mar=c(5.1, 4.1, 2, 3))
plot(leverage, RStudent, xlab="Leverage")
abline(h= 2, lty="dashed")
abline(h=-2, lty="dashed")
abline(v=lev_threshold, lty="dotted")

import pandas as pd
import duckdb 
print(duckdb.__version__)
1.2.2

con = duckdb.connect(database="ads.ddb", read_only=True)
Boston = con.sql("SELECT * FROM Boston").df().dropna()
con.close()
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm

X = Boston.drop(['medv', 'indus', 'age'], axis=1)
X = sm.add_constant(X)
y = Boston['medv']

fit = sm.OLS(y, X).fit()
print(fit.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   medv   R-squared:                       0.741
Model:                            OLS   Adj. R-squared:                  0.735
Method:                 Least Squares   F-statistic:                     128.2
Date:                Wed, 30 Apr 2025   Prob (F-statistic):          5.54e-137
Time:                        17:23:24   Log-Likelihood:                -1498.9
No. Observations:                 506   AIC:                             3022.
Df Residuals:                     494   BIC:                             3072.
Df Model:                          11                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         36.3411      5.067      7.171      0.000      26.385      46.298
crim          -0.1084      0.033     -3.307      0.001      -0.173      -0.044
zn             0.0458      0.014      3.390      0.001       0.019       0.072
chas           2.7187      0.854      3.183      0.002       1.040       4.397
nox          -17.3760      3.535     -4.915      0.000     -24.322     -10.430
rm             3.8016      0.406      9.356      0.000       3.003       4.600
dis           -1.4927      0.186     -8.037      0.000      -1.858      -1.128
rad            0.2996      0.063      4.726      0.000       0.175       0.424
tax           -0.0118      0.003     -3.493      0.001      -0.018      -0.005
ptratio       -0.9465      0.129     -7.334      0.000      -1.200      -0.693
black          0.0093      0.003      3.475      0.001       0.004       0.015
lstat         -0.5226      0.047    -11.019      0.000      -0.616      -0.429
==============================================================================
Omnibus:                      178.430   Durbin-Watson:                   1.078
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              787.785
Skew:                           1.523   Prob(JB):                    8.60e-172
Kurtosis:                       8.300   Cond. No.                     1.47e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.47e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

All 11 input variables are significant in this model; it explains 74% of the variability in median home values.

A plot of the R-student residuals against the observation number helps to identify outlying observations. Observations with \(t_i\) values outside the [-2, +2] interval are outliers. The plot also shows whether the equal variance assumption is reasonable; if the assumption is met the residuals show as a band of equal width.

from statsmodels.graphics.regressionplots import influence_plot
from statsmodels.stats.outliers_influence import OLSInfluence

influence = OLSInfluence(fit)
rstudent = influence.resid_studentized_external

plt.figure(figsize=(10, 6))
plt.plot(rstudent, 'o')
plt.axhline(y=2, linestyle='--', color='r')
plt.axhline(y=-2, linestyle='--', color='r')
plt.xlabel('Obs no.')
plt.ylabel('Studentized Residuals')
plt.title('Studentized Residuals Plot')
plt.grid(True, alpha=0.3)
plt.show()

A plot of the residuals against leverage shows observations that are unusual with respect to \(y\) (large absolute value of the residual), with respect to \(x\) (large leverage), or both. A high leverage point that is also an outlier is a highly influential data point.

leverage = influence.hat_matrix_diag
lev_threshold = 2 * (fit.df_model + 1) / len(fit.resid)

# Plot leverage vs studentized residuals
plt.figure(figsize=(10, 6))
plt.scatter(leverage, rstudent)
plt.axhline(y=2, linestyle='--', color='r')
plt.axhline(y=-2, linestyle='--', color='r')
plt.axvline(x=lev_threshold, linestyle=':', color='g')
plt.xlabel('Leverage')
plt.ylabel('Studentized Residuals')
plt.title('Leverage vs Studentized Residuals')
plt.grid(True, alpha=0.3)
plt.show()

In a simple linear regression model you can plot the residuals against the input \(x\). In a multiple linear regression model (or a SLR) you plot the residuals against the fitted values. The residuals should display no obvious trend against \(\widehat{y}\).

When plotting the residuals against the fitted value, add the predictions from a smoothing spline or other nonparametric model to identify trends in the residuals. Ideally, the smoothing spline should not show any gross departures from a flat line at zero.

The following plot raises concerns. We might not have the right set of inputs. Inputs might need to be transformed or additional/different terms are needed in the model, for example, interactions between the inputs.

yhat <- predict(fit)
plot(yhat,RStudent)
abline(h=2, lty="dashed")
abline(h=-1, lty="dashed")
loe <- loess(RStudent ~ yhat)
lines(predict(loe,newdata=data.frame(yhat=seq(-2,40,1))),
      col="red",lwd=2)

from statsmodels.nonparametric.smoothers_lowess import lowess

yhat = fit.fittedvalues

# Plot fitted values vs studentized residuals
plt.figure(figsize=(10, 6))
plt.scatter(yhat, rstudent)

# Add reference lines
plt.axhline(y=2, linestyle='--', color='r')
plt.axhline(y=-1, linestyle='--', color='r')

lowess_result = lowess(rstudent, yhat, frac=3/4, it=3)
plt.plot(lowess_result[:, 0], lowess_result[:, 1], color='red', linewidth=2)

plt.xlabel('Fitted Values')
plt.ylabel('Studentized Residuals')
plt.title('Fitted Values vs Studentized Residuals')
plt.grid(True, alpha=0.3)
plt.show()

Partial regression plots

It is tempting to create plots comparing residuals against the values of the input variables. This is meaningful in a simple linear regression with one input and can help suggest transformations of \(X\) to achieve linearity and help diagnose heteroscedasticity.

In Figure 7.5, the variance of \(Y\) increases with the value of \(X\) and the relationship between \(Y\) and \(X\) is a second-degree polynomial.

Figure 7.5: Simulated data where the variance of the observations increases with \(X\).

The plot of the Rstudent residuals shows the increasing variability in \(x\) and a systematic quadratic trend in the residuals.

In a multiple linear regression (MLR) model it is tempting to create plots such as the previous one for all input variables. Unfortunately, such plots can be misleading because in an MLR model values of one input are changing with the values of other inputs. A way around this problem seems to be a plot of residuals versus the fitted values \(\widehat{y}_i\) but that is not an optimal solution either; this plot does not tell us anything about the \(X\)s that could be transformed to improve the model. It mushes together the contributions of all inputs.

So, how can we diagnose whether \(X_j\) needs to be transformed to account for non-linearity and visualize the relationship in such a way that accounts for the other inputs in the model? The answer is the partial regression plot, also known as the added-variable plot.

Suppose we partition the \(\textbf{X}\) matrix of the MLR model as follows \[ \textbf{X}= [\textbf{X}_{-j}, \textbf{x}_j] \] so that \(\textbf{X}_{-j}\) contains all inputs (including the intercept) except for the \(j\)th input. Now consider two new regression models:

  • Regress \(\textbf{y}\) on \(\textbf{X}_{-j}\). Denote the residuals from this regression as \(\textbf{e}_{y,-j}\)

  • Regress \(\textbf{x}_j\) on \(\textbf{X}_{-j}\). Denote the residuals from this regression as \(\textbf{e}_{x,-j}\)

The added variable plot for input \(j\) displays \(\textbf{e}_{y,-j}\) on the vertical axis and \(\textbf{e}_{x,-j}\) on the horizontal axis.

Example: Auto (Cont’d)

For the four-regressor model in the Auto example, added-variable plots can be constructed with the avPlots() function in the car package.

mlr <- lm(mpg ~ cylinders + displacement + weight + horsepower,
             data=Auto)
car::avPlots(mlr)

For the four-regressor model in the Auto example, added-variable plots can be constructed with the plot_ccpr function in statsmodels.graphics.regressionplots.

from statsmodels.graphics.regressionplots import plot_ccpr

X = Auto_data[['cylinders', 'displacement', 'weight', 'horsepower']]
X = sm.add_constant(X)
y = Auto_data['mpg']
mlr = sm.OLS(y, X).fit()

# Create added variable plots (equivalent to car::avPlots in R)
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.flatten()

# Plot each predictor variable
predictors = ['cylinders', 'displacement', 'weight', 'horsepower']
for i, pred in enumerate(predictors):
    plot_ccpr(mlr, pred, ax=axes[i])
    axes[i].set_title(f'Added Variable Plot for {pred}')
    
plt.tight_layout()
plt.show()

Take the scatter of points in an added variable plot. This is a regression through the origin (a no-intercept) model of the form \[ \textbf{e}_{y,-j} = \beta_j \, \textbf{e}_{x,-j} + \boldsymbol{\epsilon}^* \]

Using \(\beta_j\) to denote the slope in this regression is no accident. The estimate of \(\beta_j\) in this regression is the same as the estimate in the multiple linear regression model.

The added-variable plot is a visual representation of how \(X_j\) fares in the \(p\)-input model even if we are looking only at a two-dimensional plot. The partial regressions show the effect of \(X_j\) on \(y\) as if it was added last to the model. If the residual point scatter in the added-variable plot suggests nonlinearity, a transformation of \(X_j\) is in order.

Inputs that are highly significant in the multiple linear regression model will have a tight point cloud in the added-variable plot. Inputs that are correctly specified in the model will show non-systematic scatter of the points around the line. Note that we are not looking for horizontal point clouds in the added-variable plots, because the points are arranged around a non-zero line, its slope corresponds to the coefficient estimate.

Case Deletion Diagnostics

Case deletion diagnostics express how much an aspect of the model changes when an observation is removed from the analysis. The RStudent residual is a leave-one-out diagnostic in this spirit, as it uses an estimator of \(\sigma\) that does not incorporate the \(i\)th data point.

The two most important case deletion diagnostics are Cook’s D and the DFFITS. The name DFFITS stands for difference in fit, standardized. The statistic measures the change in predicted value in units of standard errors when the \(i\)th observation is deleted. We are concerned when a DFFIT exceeds in absolute value the threshold of \(2\sqrt{(p+1)/n}\).

The Cook’s D (“D” for distance) statistic measures the change in the parameter estimates \(\boldsymbol{\beta}\) when the \(i\)th observation is removed. If the purpose of modeling is to build a model that predicts well, focus more on DFFITS. If the purpose of modeling is to test hypotheses, then focus more on Cook’s D. We are concerned if the D statistic exceeds 1.

Cook’s D

Example: Boston Housing Values (Cont’d)

Computing Cook’s D for our model is easy:

D <- cooks.distance(fit)
D[which.max(D)]
      369 
0.1612148 
D[which(D > 0.1)]
      369       373 
0.1612148 0.1085777 

There are no data points with a D > 1. We conclude that there are no data points that unduly influence the regression coefficients.

A plot of the D statistic against the observation number shows that there is a group of data points with much higher values of the D statistic. This group also coincides with larger residuals in the previous plots.

plot(D, xlab="Obs no.")

Here are the observations with the larger D values

D[which(D > 0.035)]
       365        366        368        369        370        371        372 
0.07895966 0.07399805 0.04632638 0.16121476 0.06142387 0.04956350 0.04241984 
       373        381        413        415 
0.10857770 0.03622523 0.05795739 0.03974459 

What do these observations have in common?

subset(Boston, D > 0.035)
        crim zn indus chas   nox    rm   age    dis rad tax ptratio  black
365  3.47428  0  18.1    1 0.718 8.780  82.9 1.9047  24 666    20.2 354.55
366  4.55587  0  18.1    0 0.718 3.561  87.9 1.6132  24 666    20.2 354.70
368 13.52220  0  18.1    0 0.631 3.863 100.0 1.5106  24 666    20.2 131.42
369  4.89822  0  18.1    0 0.631 4.970 100.0 1.3325  24 666    20.2 375.52
370  5.66998  0  18.1    1 0.631 6.683  96.8 1.3567  24 666    20.2 375.33
371  6.53876  0  18.1    1 0.631 7.016  97.5 1.2024  24 666    20.2 392.05
372  9.23230  0  18.1    0 0.631 6.216 100.0 1.1691  24 666    20.2 366.15
373  8.26725  0  18.1    1 0.668 5.875  89.6 1.1296  24 666    20.2 347.88
381 88.97620  0  18.1    0 0.671 6.968  91.9 1.4165  24 666    20.2 396.90
413 18.81100  0  18.1    0 0.597 4.628 100.0 1.5539  24 666    20.2  28.79
415 45.74610  0  18.1    0 0.693 4.519 100.0 1.6582  24 666    20.2  88.27
    lstat medv
365  5.29 21.9
366  7.12 27.5
368 13.33 23.1
369  3.26 50.0
370  3.73 50.0
371  2.96 50.0
372  9.53 50.0
373  8.88 50.0
381 17.21 10.4
413 34.37 17.9
415 36.98  7.0

Computing Cook’s D for our model extracts the D statistic from the influence statistics in statsmodels:

import numpy as np
import matplotlib.pyplot as plt
from statsmodels.stats.outliers_influence import OLSInfluence

influence = OLSInfluence(fit)
D = influence.cooks_distance[0]  # cooks_distance returns a tuple, we need the first element

max_idx = np.argmax(D)
print(f"Maximum Cook's distance: {D[max_idx]:.4f} at observation {max_idx}")
Maximum Cook's distance: 0.1612 at observation 368
# Find observations with Cook's distance > 0.1
large_D_indices = np.where(D > 0.1)[0]
if len(large_D_indices) > 0:
    print(f"Observations with Cook's distance > 0.1:")
    for idx in large_D_indices:
        print(f"Observation {idx}: {D[idx]:.4f}")
Observations with Cook's distance > 0.1:
Observation 368: 0.1612
Observation 372: 0.1086

There are no data points with a D > 1. We conclude that there are no data points that unduly influence the regression coefficients.

A plot of the D statistic against the observation number shows that there is a group of data points with much higher values of the D statistic. This group also coincides with larger residuals in the previous plots.

plt.figure(figsize=(10, 6))
plt.stem(D, markerfmt='ro', basefmt='b-')
plt.xlabel('Obs no.')
plt.ylabel("Cook's distance")
plt.title("Cook's Distance Plot")
plt.grid(True, alpha=0.3)
plt.show()

Here are the observations with the larger D values. What do these observations have in common?

# Find observations with Cook's distance > 0.035
influential_indices = np.where(D > 0.035)[0]
if len(influential_indices) > 0:
    print(f"Observations with Cook's distance > 0.035:")
    influential_data = Boston.iloc[influential_indices]
    print("\nDetails of influential observations:")
    print(influential_data)
Observations with Cook's distance > 0.035:

Details of influential observations:
         crim   zn  indus  chas    nox  ...    tax  ptratio   black  lstat  medv
364   3.47428  0.0   18.1     1  0.718  ...  666.0     20.2  354.55   5.29  21.9
365   4.55587  0.0   18.1     0  0.718  ...  666.0     20.2  354.70   7.12  27.5
367  13.52220  0.0   18.1     0  0.631  ...  666.0     20.2  131.42  13.33  23.1
368   4.89822  0.0   18.1     0  0.631  ...  666.0     20.2  375.52   3.26  50.0
369   5.66998  0.0   18.1     1  0.631  ...  666.0     20.2  375.33   3.73  50.0
370   6.53876  0.0   18.1     1  0.631  ...  666.0     20.2  392.05   2.96  50.0
371   9.23230  0.0   18.1     0  0.631  ...  666.0     20.2  366.15   9.53  50.0
372   8.26725  0.0   18.1     1  0.668  ...  666.0     20.2  347.88   8.88  50.0
380  88.97620  0.0   18.1     0  0.671  ...  666.0     20.2  396.90  17.21  10.4
412  18.81100  0.0   18.1     0  0.597  ...  666.0     20.2   28.79  34.37  17.9
414  45.74610  0.0   18.1     0  0.693  ...  666.0     20.2   88.27  36.98   7.0

[11 rows x 14 columns]

DFFITs

Example: Boston Housing Values (Cont’d)

You can calculate the DFFITs statistics with the dffits() function. The following plot shows those statistics along with the threshold.

p_plus1 <- length(fit$coefficients)
n_used <- length(fit$fitted.values)
threshold_dff <- 2 * sqrt(p_plus1/n_used) 
dff <- dffits(fit)
dff[which(abs(dff) > threshold_dff)]
        65        142        149        162        163        164        167 
 0.3844495  0.4775638  0.3858408  0.4608252  0.4684540  0.4413915  0.4735623 
       187        196        204        205        215        226        234 
 0.4237227  0.3407192  0.3117844  0.3319161  0.4907820  0.4158602  0.3670077 
       254        263        268        365        366        368        369 
 0.6332907  0.3271603  0.3633354 -0.9841664  0.9496438  0.7503581  1.4375975 
       370        371        372        373        374        375        376 
 0.8702277  0.7792767  0.7342909  1.1726437  0.3126816  0.6200353 -0.3475920 
       381        406        413        415        506 
-0.6593141 -0.3173979  0.8436891  0.6941816 -0.3201876 
plot(dff,ylab="DFFITs")
abline(h=threshold_dff)
abline(h=-threshold_dff)

You can extract the DFFITs statistics from the influence object returned by OLSInfluence in statsmodels. The following plot shows those statistics along with the threshold.

influence = OLSInfluence(fit)
dff = influence.dffits[0]  # dffits returns a tuple, we need the first element

# Calculate threshold for DFFITS
# 2 * sqrt(p/n) where p is the number of parameters and n is the number of observations
p = fit.df_model + 1  # +1 for intercept
n = len(fit.resid)
threshold_dff = 2 * np.sqrt(p/n)
print(f"DFFITS threshold: {threshold_dff:.4f}")
DFFITS threshold: 0.3080
# Find observations with DFFITS > threshold
influential_indices = np.where(np.abs(dff) > threshold_dff)[0]
if len(influential_indices) > 0:
    print("Observations with DFFITS > threshold:")
    for idx in influential_indices:
        print(f"Observation {idx}: {dff[idx]:.4f}")
Observations with DFFITS > threshold:
Observation 64: 0.3844
Observation 141: 0.4776
Observation 148: 0.3858
Observation 161: 0.4608
Observation 162: 0.4685
Observation 163: 0.4414
Observation 166: 0.4736
Observation 186: 0.4237
Observation 195: 0.3407
Observation 203: 0.3118
Observation 204: 0.3319
Observation 214: 0.4908
Observation 225: 0.4159
Observation 233: 0.3670
Observation 253: 0.6333
Observation 262: 0.3272
Observation 267: 0.3633
Observation 364: -0.9842
Observation 365: 0.9496
Observation 367: 0.7504
Observation 368: 1.4376
Observation 369: 0.8702
Observation 370: 0.7793
Observation 371: 0.7343
Observation 372: 1.1726
Observation 373: 0.3127
Observation 374: 0.6200
Observation 375: -0.3476
Observation 380: -0.6593
Observation 405: -0.3174
Observation 412: 0.8437
Observation 414: 0.6942
Observation 505: -0.3202
# Plot DFFITS
plt.figure(figsize=(10, 6))
plt.plot(dff, 'o')
plt.axhline(y=threshold_dff, color='r', linestyle='-')
plt.axhline(y=-threshold_dff, color='r', linestyle='-')  
plt.xlabel('Obs no.')
plt.ylabel('DFFITS')
plt.title('DFFITS Plot')
plt.grid(True, alpha=0.3)
plt.show()

The same group of observations with high D values has also high DFFITs. Contrary to the D values, the observations exceed the threshold for the DFFITs. We conclude that the data points are not influential on the regression coefficient estimates, but they are influential on the predicted values. If the model is used to predict median home values, we should consider refining the model or excluding the outlying observations and refitting.

Collinearity Diagnostics

When inputs are related to each other it has two important consequences

  • we cannot interpret the regression coefficient for one input without considering the other inputs
  • with increasing dependence among the inputs, the least squares estimation procedure becomes increasingly numerically unstable.

The condition when inputs are linearly related is called collinearity and it negatively affects any calculations that involve the \((\textbf{X}^\prime\textbf{X})^{-1}\) matrix (which is about all the calculations.)

The extreme case is when an input variable is linear combination of other inputs, for example \[ Z = aX_2 + bX_3 \] Adding \(Z\) to a model that contains inputs \(X_2\) and \(X_3\) leads to a singular (rank-deficient) \(\textbf{X}\) matrix. The inverse cross-product matrix \((\textbf{X}^\prime\textbf{X})^{-1}\) does not exist and the OLS estimator cannot be computed. Software uses instead a generalized inverse matrix \((\textbf{X}^\prime\textbf{X})^{-}\) to find a solution, which happens to be not unique.

Example: Boston Housing Values

The following code fits a linear model with four inputs. The variable newvar is the sum of the zn and nox variables. With zn and nox already in the model, newvar does not provide any additional information. The \(\textbf{X}\) matrix is singular and an estimate for the coefficient of newvar cannot be found.

B <- Boston
B$newvar <- B$zn+B$nox
singular_model <- lm(medv ~ crim + zn + nox + newvar, data=B)
summary(singular_model)

Call:
lm(formula = medv ~ crim + zn + nox + newvar, data = B)

Residuals:
    Min      1Q  Median      3Q     Max 
-14.659  -4.620  -1.684   2.211  31.584 

Coefficients: (1 not defined because of singularities)
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  32.22033    2.20618  14.605  < 2e-16 ***
crim         -0.27519    0.04516  -6.094 2.19e-09 ***
zn            0.07749    0.01764   4.392 1.37e-05 ***
nox         -17.25946    3.83527  -4.500 8.45e-06 ***
newvar             NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.915 on 502 degrees of freedom
Multiple R-squared:  0.2637,    Adjusted R-squared:  0.2593 
F-statistic: 59.93 on 3 and 502 DF,  p-value: < 2.2e-16

R indicates the singularity in \(\textbf{X}\) with NA for the coefficient associated with the singularity. Note that this depends on the order in which the variables enter the model. When newvar is placed before zn and nox in the model, the nox coefficient is NA.

# Create a copy of Boston
B = Boston.copy()

B['newvar'] = B['zn'] + B['nox']

# Add a constant (intercept) to the model
X = sm.add_constant(B[['crim', 'zn', 'nox', 'newvar']])
singular_model = sm.OLS(B['medv'], X).fit()

print(singular_model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   medv   R-squared:                       0.264
Model:                            OLS   Adj. R-squared:                  0.259
Method:                 Least Squares   F-statistic:                     59.93
Date:                Wed, 30 Apr 2025   Prob (F-statistic):           3.96e-33
Time:                        17:23:26   Log-Likelihood:                -1762.8
No. Observations:                 506   AIC:                             3534.
Df Residuals:                     502   BIC:                             3550.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         32.2203      2.206     14.605      0.000      27.886      36.555
crim          -0.2752      0.045     -6.094      0.000      -0.364      -0.186
zn             5.8048      1.273      4.561      0.000       3.304       8.305
nox          -11.5321      2.554     -4.515      0.000     -16.550      -6.514
newvar        -5.7273      1.281     -4.470      0.000      -8.245      -3.210
==============================================================================
Omnibus:                      185.375   Durbin-Watson:                   0.744
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              564.513
Skew:                           1.766   Prob(JB):                    2.62e-123
Kurtosis:                       6.781   Cond. No.                     2.97e+15
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 7.75e-26. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

The second note at the bottom of the output alerts us to a singularity problem in the model. Although coefficient solutions are computed for the intercept and four input variables, the model contains only 3 degrees of freedom because of the linear relationship between the variables.

Collinearity is the condition where inputs are highly correlated. They do not follow exact linear dependencies but are close to linearly dependent (a so-called near-linear dependency). The situation is more complex than involving just two input variables. For example, \(X_1\) might not be strongly correlated with \(X_2\) but can be strongly correlated with a linear combination of \(X_2\), \(X_3\), and \(X_8\). This condition is called multicollinearity.

When multicollinearity is strong, the OLS estimates of the regression coefficients become unstable, small perturbations in the target values or inputs can lead to large changes in the coefficients. The \(\widehat{\beta}_j\) can be of the wrong size and/or sign.

A nontechnical diagnostic for multicollinearity is to compute the matrix of pairwise correlations among the inputs. Large values of \(\text{Corr}[X_j,X_k]\) is a sufficient condition for collinearity, but it is not a necessary condition. Even with weak pairwise correlations you can have strong linear dependencies among multiple inputs.

Nevertheless, the pairwise correlation plot is a good place to start.

library("corrplot")
corrplot 0.94 loaded
X <- model.matrix(fit)
corrplot(cor(X[,2:12]), method = "circle", diag=FALSE, tl.col="black")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Extract the design matrix, similar to model.matrix() in R
X = fit.model.exog

# Convert to DataFrame for easier handling of column names
X_df = pd.DataFrame(X, columns=fit.model.exog_names)
X_without_intercept = X_df.iloc[:, 1:12]

corr_matrix = X_without_intercept.corr()

# Create correlation plot similar to corrplot with circle method
plt.figure(figsize=(10, 8))
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))  # To hide diagonal
sns.heatmap(corr_matrix, 
            mask=mask,
            annot=True, 
            fmt=".2f", 
            cmap="coolwarm", 
            linewidths=0.5,
            square=True,
            cbar_kws={"shrink": .5})

plt.tight_layout()
plt.show()

We see some strong pairwise relationships between tax and rad, between dis and nox, and between lstat and rm. Do we need to worry?

Here are some other, informal, ways to diagnose a multicollinearity problem:

  • The \(R^2\) statistic indicates the model explains substantial variability in \(Y\), but none or few of the inputs show statistically significant \(p\)-values. Because the \(t\)-tests are partial tests, the other input variables act as proxy for the variable being tested.

  • Different variable selection methods lead to very different models.

  • Standard errors of coefficients and/or fitted values are unusually large.

  • Slight perturbations of the data, for example, by adding some small Gaussian random noise, change the results dramatically.

A formal diagnosis relies on the computation of variance inflation factors (VIFs) or **condition indices*.

Variance inflation factors

Each predictor (input) variable in a linear model is associated with a variance inflation factor that quantifies the strength of linear dependencies between this input and all other inputs.

The \(j\)th VIF measures how many times more variable the variance of the standardized coefficients are due to the involvement of \(X_j\) in linear dependencies involving the other \(X\)s.

You can find the \(\text{VIF}_j\) from the \(R^2\) statistic of a multiple linear regression of \(X_j\) on all the other input variables. \[ \text{VIF}_j = \frac{1}{1-R^2_j} \] For example, the VIF for rad can be obtained by regressing all other inputs onto rad

Example: Boston Housing Values (Cont’d)

vif_calc <- lm(rad ~ crim + zn + chas + nox + rm + dis + tax + 
                   ptratio + black + lstat, data=Boston)
R2_rad <- summary(vif_calc)$r.squared
VIF_rad <- 1 / (1-R2_rad)
VIF_rad
[1] 6.861126
import statsmodels.api as sm
import pandas as pd
import numpy as np
from patsy import dmatrices
from tabulate import tabulate

formula1 = "rad ~ crim + zn + chas + nox + rm + dis + tax + ptratio + black + lstat"
y, X = dmatrices(formula1, data=Boston, return_type='dataframe')
vif_calc = sm.OLS(y, X).fit()

R2_rad = vif_calc.rsquared
VIF_rad = 1 / (1-R2_rad)
print(f"Variance inflation factor for rad: {VIF_rad:.4f}")
Variance inflation factor for rad: 6.8611
Caution

When calculating variance inflation factors this way, make sure that the response variable does not appear on the right hand side of the model formula. The expression lm(rad ~ .) in R would include the target variable medv on the right hand side. Variance inflation factors capture relationships among the inputs and are not related to the target variable.

Notice that a variance inflation factor does not depend on \(Y\). It is solely based on relationships among the inputs (predictors). Also, you can see from the model equation above that the VIF discovers more than a pairwise dependence on other variables. It models one input as a function of all other inputs.

Example: Boston Housing Values (Cont’d)

To compute variance inflation factors in R directly, use the vif function in the car package (Companion to Applied Regression).

library(car)
vif(fit)
    crim       zn     chas      nox       rm      dis      rad      tax 
1.789704 2.239229 1.059819 3.778011 1.834806 3.443420 6.861126 7.272386 
 ptratio    black    lstat 
1.757681 1.341559 2.581984 

The VIF for rad reported by vif() matches the previous calculation.

To compute variance inflation factors in Python directly, use the variance_inflation_factor function in statsmodels.

from statsmodels.stats.outliers_influence import variance_inflation_factor
import pandas as pd
import numpy as np

X = fit.model.exog 

vif_data = pd.DataFrame()
vif_data["Variable"] = fit.model.exog_names
vif_data["VIF"] = [variance_inflation_factor(X, i) for i in range(X.shape[1])]

print(vif_data)
   Variable         VIF
0     const  579.255845
1      crim    1.789704
2        zn    2.239229
3      chas    1.059819
4       nox    3.778011
5        rm    1.834806
6       dis    3.443420
7       rad    6.861126
8       tax    7.272386
9   ptratio    1.757681
10    black    1.341559
11    lstat    2.581984

Another way to compute the variance inflation factors is to fit the linear regression with a scaled and centered \(\textbf{X}^*\) matrix. The columns of \(\textbf{X}^*\) are centered at their sample mean and are scaled by dividing by \(\sqrt{n-1}\) times their standard error. As a result, the \(\textbf{X}^{*\prime} \textbf{X}^*\) matrix is the matrix of the empirical pairwise correlations of the inputs. The regression coefficients of this model are called the standardized coefficients (\(\beta^*_j\)) and the variance inflation factors are \[ \text{VIF}_j = \widehat{\text{Var}}[\widehat{\beta}^*_j] / \widehat{\sigma}^2 \]

Example: Variance Inflation Factors from Scaled-Centered Regression; Boston Data (Cont’d)

To compute the VIFs in the model for the Boston data, we compute first the centered-and-scaled \(\textbf{X}\) matrix. The scale() function in R centers and scales the data by default but uses the standard deviation as the scaling factor. We use a custom scaling so that the \(\textbf{X}^{*\prime}\textbf{X}^*\) matrix equals the empirical correlation matrix.

We use model.matrix() to extract the \(\textbf{X}\) matrix from the model object computed earlier. The intercept column is replaced with the target values so we can use this matrix as input to a call to lm.

X <- model.matrix(fit)
n_1 <- dim(X)[1] - 1
scaled_X <- scale(X,
                  center=apply(X,2,mean),
                  scale =apply(X,2,sd)*sqrt(n_1))
scaled_X[,1] <- Boston$medv
colnames(scaled_X)[1] <- "medv"
ll <- lm(medv ~ ., data=data.frame(scaled_X))
summary(ll)

Call:
lm(formula = medv ~ ., data = data.frame(scaled_X))

Residuals:
     Min       1Q   Median       3Q      Max 
-15.5984  -2.7386  -0.5046   1.7273  26.2373 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  22.5328     0.2106 107.018  < 2e-16 ***
crim        -20.9558     6.3361  -3.307 0.001010 ** 
zn           24.0276     7.0873   3.390 0.000754 ***
chas         15.5179     4.8758   3.183 0.001551 ** 
nox         -45.2476     9.2059  -4.915 1.21e-06 ***
rm           60.0245     6.4155   9.356  < 2e-16 ***
dis         -70.6350     8.7888  -8.037 6.84e-15 ***
rad          58.6248    12.4060   4.726 3.00e-06 ***
tax         -44.6079    12.7724  -3.493 0.000521 ***
ptratio     -46.0495     6.2792  -7.334 9.24e-13 ***
black        19.0611     5.4858   3.475 0.000557 ***
lstat       -83.8570     7.6104 -11.019  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.736 on 494 degrees of freedom
Multiple R-squared:  0.7406,    Adjusted R-squared:  0.7348 
F-statistic: 128.2 on 11 and 494 DF,  p-value: < 2.2e-16

The variance inflation factors are obtained by dividing the square values in the Std. Error column with the estimator of the residual variance

s <- summary(ll)
vif <- s$coefficients[,2]^2 / s$sigma^2
vif
(Intercept)        crim          zn        chas         nox          rm 
0.001976285 1.789704160 2.239228671 1.059819222 3.778010991 1.834806373 
        dis         rad         tax     ptratio       black       lstat 
3.443420336 6.861126315 7.272386358 1.757681497 1.341558750 2.581984268 
import numpy as np
import pandas as pd
import statsmodels.api as sm

X = fit.model.exog 

n_1 = X.shape[0] - 1

# Scale the design matrix
scaled_X = (X - np.mean(X, axis=0)) / (np.std(X, axis=0, ddof=1) * np.sqrt(n_1))
<string>:3: RuntimeWarning: invalid value encountered in divide
scaled_X[:, 0] = Boston['medv']

# Create a DataFrame with column names
scaled_X_df = pd.DataFrame(scaled_X)
scaled_X_df.columns = fit.model.exog_names
scaled_X_df.rename(columns={scaled_X_df.columns[0]: 'medv'}, inplace=True)

# Fit linear model with standardized variables
formula = 'medv ~ ' + ' + '.join(scaled_X_df.columns[1:])
ll = sm.formula.ols(formula=formula, data=scaled_X_df).fit()

print(ll.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   medv   R-squared:                       0.741
Model:                            OLS   Adj. R-squared:                  0.735
Method:                 Least Squares   F-statistic:                     128.2
Date:                Wed, 30 Apr 2025   Prob (F-statistic):          5.54e-137
Time:                        17:23:27   Log-Likelihood:                -1498.9
No. Observations:                 506   AIC:                             3022.
Df Residuals:                     494   BIC:                             3072.
Df Model:                          11                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     22.5328      0.211    107.018      0.000      22.119      22.946
crim         -20.9558      6.336     -3.307      0.001     -33.405      -8.507
zn            24.0276      7.087      3.390      0.001      10.103      37.953
chas          15.5179      4.876      3.183      0.002       5.938      25.098
nox          -45.2476      9.206     -4.915      0.000     -63.335     -27.160
rm            60.0245      6.415      9.356      0.000      47.420      72.629
dis          -70.6350      8.789     -8.037      0.000     -87.903     -53.367
rad           58.6248     12.406      4.726      0.000      34.250      83.000
tax          -44.6079     12.772     -3.493      0.001     -69.703     -19.513
ptratio      -46.0495      6.279     -7.334      0.000     -58.387     -33.712
black         19.0611      5.486      3.475      0.001       8.283      29.839
lstat        -83.8570      7.610    -11.019      0.000     -98.810     -68.904
==============================================================================
Omnibus:                      178.430   Durbin-Watson:                   1.078
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              787.785
Skew:                           1.523   Prob(JB):                    8.60e-172
Kurtosis:                       8.300   Cond. No.                         80.5
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The variance inflation factors are obtained by dividing the square values for the standard errors with the estimator of the residual variance

s_coefficients = ll.bse
s_sigma = ll.scale ** 0.5

vif = s_coefficients[1:] ** 2 / s_sigma ** 2

print(vif)
crim       1.789704
zn         2.239229
chas       1.059819
nox        3.778011
rm         1.834806
dis        3.443420
rad        6.861126
tax        7.272386
ptratio    1.757681
black      1.341559
lstat      2.581984
dtype: float64

The smallest possible VIF value is 1.0, it indicates that the input is not linearly related to the other variables. The thresholds are as follows

  • 1 < VIF < 10: moderate collinearity
  • 10 < VIF < 30: moderate to severe collinearity
  • VIF > 30: severe collinearity problem

Condition index and condition number

A formal diagnostic for multicollinearity, based on the eigenvalue decomposition of the (scaled-and-centered) \(\textbf{X}^*\) matrix, examines the spread of the eigenvalues of ^{*}$. If \(\textbf{X}^*\) is a centered and scaled version of \(\textbf{X}\) such that \(\textbf{X}^{*\prime}\textbf{X}\) is the empirical correlation matrix, its eigen decomposition is \[ \textbf{X}^{*\prime}\textbf{X}= \textbf{Q}\boldsymbol{\Lambda}\textbf{Q}^\prime \] where \(\textbf{Q}\) is a \((p+p)\) orthogonal matrix of eigenvectors and \(\boldsymbol{\Lambda}\) is a diagonal matrix with the eigenvalues \(\lambda_j\) on its diagonal. The number of eigenvalues close to zero indicates the number of near linear dependencies in \(\textbf{X}\) (or \(\textbf{X}^*\)). In this centered-and-scaled form the eigenvalues satisfy \(\sum_{j=1}^p \lambda_j = p\), so if some eigenvalues get small, others need to get bigger.

The condition index associated with the \(j\)th eigenvalue is \[ \phi_j = \frac{\max(\lambda_j)}{\lambda_j} \] and the condition number is \(\max(\phi_j)\).

library(Matrix)

X <- model.matrix(fit)[,2:12]
X_star <- scale(X,
                center=apply(X,2,mean),
                scale=apply(X,2,sd)*sqrt((n_1)))
XpX_scaled <- t(X_star) %*% X_star
eigen_decomp <- eigen(XpX_scaled)
evals <- eigen_decomp$values

evals
 [1] 4.88096727 1.27924337 1.23924750 0.83617610 0.83188592 0.65049762
 [7] 0.49759783 0.30025208 0.23419413 0.17181685 0.07812133
cat("Sum of eigenvalues: ", sum(evals),"\n")
Sum of eigenvalues:  11 
cond_index <- max(evals)/evals
cat("Condition indices: ", cond_index,"\n")
Condition indices:  1 3.815511 3.938654 5.837248 5.867352 7.503436 9.809061 16.25623 20.84154 28.40797 62.47931 
cond_number <- max(cond_index)
cat("Condition number: ", cond_number,"\n")
Condition number:  62.47931 
import numpy as np
import pandas as pd
from scipy import linalg

X = fit.model.exog[:, 1:12] 

n_1 = X.shape[0] - 1

# Scale the design matrix
X_means = np.mean(X, axis=0)
X_sds = np.std(X, axis=0, ddof=1) * np.sqrt(n_1)
X_star = (X - X_means) / X_sds

XpX_scaled = X_star.T @ X_star

# Perform eigendecomposition
evals, evecs = linalg.eigh(XpX_scaled)

print("Eigenvalues:", evals)
Eigenvalues: [0.07812133 0.17181685 0.23419413 0.30025208 0.49759783 0.65049762
 0.83188592 0.8361761  1.2392475  1.27924337 4.88096727]
print("Sum of eigenvalues: ", sum(evals))
Sum of eigenvalues:  10.999999999999993
cond_index = max(evals) / evals
print("Condition indices: ", cond_index)
Condition indices:  [62.47931219 28.40796641 20.84154415 16.25623145  9.80906065  7.50343606
  5.86735168  5.83724803  3.93865412  3.8155111   1.        ]
cond_number = max(cond_index)
print(f"Condition number: {cond_number:.4f}")
Condition number: 62.4793

Condition indices larger than 900 indicate that near linear dependencies exist.

Caution

In contrast to variance inflation factors, where the \(j\)th factor is associated with the \(j\)th input, the eigenvalue \(\lambda_j\) is not associated with a particular input. It is associated with a linear combination of all the \(X\)s.

An obvious remedy of the multicollinearity is to remove inputs that are associated with high variance inflation factors and to refit the model. If you cannot remove the variables from the model a different estimation method is called for. Regularization methods such as Ridge regression or Lasso regression handle high-dimensional problems and reduce the instability of the least-squares estimates by shrinking their values (suppressing the high variability of the coefficients). At the cost of introducing some bias, these estimators drastically reduce variability for an overall better mean square prediction error.