63 Classical Linear Model

The classical (standard) linear model is \[ \textbf{Y} = \textbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}, \quad \quad \boldsymbol{\epsilon} \sim (\textbf{0},\sigma^2\textbf{I}) \] $\textbf{Y}$ is an $(n \times 1)$ vector of target values, $\textbf{X}$ is an $(n \times (p+1))$ matrix of $p$ input variables plus an intercept, and $\boldsymbol{\epsilon}$ is an $(n \times 1)$ random vector whose elements have mean zero, have the same variance $\sigma^2$, and are mutually uncorrelated. Expanding the vectors and matrices, the model can be represented as \[ \begin{bmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{bmatrix} = \begin{bmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1p} \\ 1 & x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \cdots & x_{np} \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{bmatrix} \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n \end{bmatrix} \]

The $X$s can be different input variables (features), sets of encoded columns of qualitative variables, or transformations of each other. A polynomial model is a special case where the $X$s are powers of the same input variable. For example, a second-order polynomial model is \[ Y_i = \beta_0 + \beta_1x_i + \beta_2 x_i^2 \] Such linear polynomials can be seen as approximations of a general, unknown relationship between $Y$ and $X$. To see this, imagine that $Y = f(X) + \epsilon$ and we approximate the unknown function $f(X)$ with a Taylor series about $x=0$: \[ f(x) = f(0) + \frac{x}{1!} \frac{\partial f}{\partial x}_{\lvert_{0}} + \frac{x^2}{2!}\frac{\partial^2 f}{\partial x^2}_{\lvert_{0}} + \cdots + \frac{x^r}{r!}\frac{\partial^r f}{\partial x^r}_{\lvert_{0}} + R_r(x) \] $R_r(x)$ is the remainder term of the series. If we drop this term, the approximate model is a polynomial in $X$ \[ f(0) + \frac{x}{1!} \frac{\partial f}{\partial x}_{\lvert_{0}} + \frac{x^2}{2!}\frac{\partial^2 f}{\partial x^2}_{\lvert_{0}} + \cdots + \frac{x^r}{r!}\frac{\partial^r f}{\partial x^r}_{\lvert_{0}} = \beta_0 + \beta_1x + \beta_2x^2 + \cdots + \beta_rx^r \]

63.1 OLS Estimation

Full Rank Case

If $\textbf{X}$ is of full rank $p+1$, the OLS (ordinary least squares) estimator \[ \widehat{\boldsymbol{\beta}} = \left(\textbf{X}^\prime\textbf{X}^{-1}\right) \textbf{X}^\prime\textbf{Y} \] is the unbiased estimator with smallest variance (by the Gauss-Markov Theorem). Predicted values and fitted residuals are given by

\[\begin{align*} \widehat{\textbf{y}} &= \textbf{X}\widehat{\boldsymbol{\beta}} = \textbf{H}\textbf{y} \\ \widehat{\boldsymbol{\epsilon}} = \textbf{y} - \widehat{\textbf{y}} &= (\textbf{I} - \textbf{H})\textbf{y} \end{align*}\]

where $\textbf{H}$ is the “Hat” matrix $\textbf{H} = \textbf{X}(\textbf{X}^\prime\textbf{X})^{-1}\textbf{X}^\prime$.

The OLS estimator has variance \[ \text{Var}[\widehat{\boldsymbol{\beta}}] = \sigma^2 (\textbf{X}^\prime\textbf{X})^{-1} \]

We saw in Section 61.6.1 that $\textbf{H}$ is an orthogonal projection matrix—it is symmetric and idempotent. Because $\textbf{H}$ is a projection matrix, $\textbf{I} - \textbf{H}$ is also a projection matrix. To emphasize this fact we can write $\textbf{y}$ as the sum of two components \[ \textbf{y} = \textbf{H}\textbf{Y} + (\textbf{I} - \textbf{H})\textbf{Y} \] This identity simply states that the target variable is projected onto two spaces: the space generated by the columns of $\textbf{X}$, represented by the projection matrix $\textbf{H}$, and its complement (residual) space. Furthermore, these projections are orthogonal: \[ \textbf{H} (\textbf{I} - \textbf{H}) = \textbf{H} - \textbf{H}\textbf{H} = \textbf{H} - \textbf{H} = \textbf{0} \]

The projection properties of $\textbf{H}$ are useful in establishing properties of fitted values and residuals.

\[\begin{align*} \text{1.} &\quad \textbf{X}^\prime \widehat{\boldsymbol{\epsilon}} = \textbf{0} \\ \text{2.} &\quad \widehat{\textbf{Y}}^\prime \widehat{\boldsymbol{\epsilon}} = \textbf{0} \\ \text{3.} &\quad \text{Var}[\widehat{\textbf{Y}}] = \text{Var}[\textbf{H}\textbf{Y}] = \sigma^2\textbf{H} \\ \text{4.} &\quad \text{Var}[\widehat{Y}_i] = \sigma^2 h_{ii} \\ \text{5.} &\quad \text{Var}[\widehat{\boldsymbol{\epsilon}}] = \text{Var}[(\textbf{I}-\textbf{H})\textbf{Y}] = \sigma^2(\textbf{I} - \textbf{H}) \\ \text{6.} &\quad \text{Var}[\widehat{\epsilon}_i] = \sigma^2(1-h_{ii}) \end{align*}\]

Let’s discuss these properties in turn.

The first result states that the fitted residuals are orthogonal to the inputs. This follows from the orthogonality of least squares, but can be shown easily using projection properties \[ \textbf{X}^\prime\widehat{\boldsymbol{\epsilon}} = \textbf{X}^\prime(\textbf{I}-\textbf{H})\textbf{Y} = (\textbf{X}^\prime - \textbf{X}^\prime\textbf{H})\textbf{Y} = (\textbf{X}^\prime-\textbf{X}^\prime)\textbf{Y} = \textbf{0} \] The orthogonality of inputs and residuals implies that for any column in $\textbf{X}$ \[ \sum_{i=1}^n \, x_{ij}\widehat{\epsilon}_i = 0, \quad j=1,\cdots, p \] This is also true for the intercept column, so that $\sum_{i=1}^n \widehat{\epsilon}_i = 0$; the fitted residuals sum to zero.
The fitted values are also orthogonal to the residuals: \[ \widehat{\textbf{Y}}^\prime\widehat{\boldsymbol{\epsilon}} = (\textbf{H}\textbf{Y})^\prime(\textbf{I}-\textbf{H})\textbf{Y} = \textbf{Y}^\prime\textbf{H}^\prime(\textbf{I}-\textbf{H})\textbf{Y} = \textbf{Y}^\prime(\textbf{H}-\textbf{H}\textbf{H})\textbf{Y} = \textbf{Y}^\prime\textbf{0}\textbf{Y} = \textbf{0} \]
The variability of the fitted values depends on the irreducible variability $\sigma^2$ and on the Hat matrix. It does not depend directly on any of the target values in the data set. The Hat matrix is purely a function of the $Xs$. Notice that $\textbf{H}$ is not a diagonal matrix. Although the elements of $\textbf{Y}$ are mutually uncorrelated on account of the elements of $\boldsymbol{\epsilon}$ being mutually uncorrelated, the fitted values are correlated. This makes sense because they jointly obey constraints, for example $\widehat{\textbf{y}}^\prime\boldsymbol{\epsilon} = \textbf{0}$.
The variance of the $i$^th fitted value depends on the irreducible variability $\sigma^2$ and on the $i$^th diagonal element of the Hat matrix, $h_{ii}$. The $h_{ii}$ are known as the leverage values.
The variance of the vector of fitted residuals is $\sigma^2(\textbf{I} - \textbf{H})$. Since $\textbf{H}$ is not diagonal, the fitted residuals are not uncorrelated, although the unobservable model errors are mutually uncorrelated. Also, the fitted residuals are not homoscedastic, unlike the unobservable model errors. In fact, the only property that carries forward from $\boldsymbol{\epsilon}$ to $\widehat{\boldsymbol{\epsilon}}$ seems to be the zero mean: $\text{E}[\boldsymbol{\epsilon}] = \text{E}[\widehat{\boldsymbol{\epsilon}}] = \textbf{0}$.
The variance of a fitted residual is $\sigma^2(1-h_{ii})$. Since $\sigma^2$ is the variance of the model errors, the variability of the residuals is smaller if $h_{ii} > 0$. The leverages $h_{ii}$ cannot be larger than 1, because $\sigma^2(1-h_{ii})$ is a variance and has to be positive. In fact, the leverages are bounded $1/n \le h_{ii} \le 1$.

OLS does not provide a direct estimate of the error variance $\sigma^2$. However, an estimator can be constructed easily. Because $\widehat{\boldsymbol{\epsilon}} = (\textbf{I}-\textbf{H})\textbf{Y}$, consider the quadratic form \[ \text{SSE} = \widehat{\boldsymbol{\epsilon}}^\prime\widehat{\boldsymbol{\epsilon}} = \textbf{Y}^\prime(\textbf{I} - \textbf{H})\textbf{Y} \] where we used the projection properties of $\textbf{I} - \textbf{H}$ in the expansion. This is simply the sum of the squared fitted residuals, also known as the sum of squared errors or the error sum of squares. Applying the results about expectations of quadratic forms in Section 61.7.1, we find the expected value of SSE as

\[\begin{align*} \text{E}[\text{SSE}] &= \text{E}[\textbf{Y}^\prime(\textbf{I} - \textbf{H})\textbf{Y}] = \text{tr}(\sigma^2(\textbf{I}-\textbf{H})) + \boldsymbol{\beta}^\prime\textbf{X}^\prime(\textbf{I}-\textbf{H})\textbf{X}\boldsymbol{\beta} \\ &= \text{tr}(\sigma^2(\textbf{I}-\textbf{H})) + \boldsymbol{\beta}^\prime(\textbf{X}^\prime\textbf{X} - \textbf{X}^\prime\textbf{H}\textbf{X})\boldsymbol{\beta} \\ &= \sigma^2\text{tr}(\textbf{I}-\textbf{H}) + \boldsymbol{\beta}^\prime(\textbf{X}^\prime\textbf{X} - \textbf{X}^\prime\textbf{X})\boldsymbol{\beta} \\ &= \sigma^2(n-r(\textbf{X})) \end{align*}\]

The final result follows from the fact that $\textbf{I}-\textbf{H}$ is idempotent, its trace is equal to its rank, which is $n$ minus the rank of $\textbf{X}$. The natural estimator of $\sigma^2$ is thus \[ \widehat{\sigma}^2 =\frac{\text{SSE}}{n-r(\textbf{X})} \] This estimator is unbiased for $\sigma^2$ if the assumption $\boldsymbol{\epsilon} \sim (\textbf{0},\sigma^2\textbf{I})$ is met—in other words, it does not require a Gaussian assumption.

Coefficient Interpretation

Suppose we trained a linear model to predict the mileage of cars from various attributes about the vehicles. Observational data was collected on a number of automobiles and the following predictive equation was derived ($\widehat{Y} = \textbf{x}\widehat{\boldsymbol{\beta}})$.

\[\begin{align*} \widehat{\text{mpg}} = 45.7587 &-0.3933\times\text{cylinders} + 0.00013\times\text{displacement} \\ &-0.0053\times \text{weight} -0.0428\times\text{horsepower} \end{align*}\]

How do we interpret the regression coefficients of the trained model?

Since the model is linear, it is tempting to state that, for example, a change in 1 unit of displacement causes a change of 0.00013 in miles per gallon. This interpretation is not correct, because

We cannot conclude causality between inputs and the target variable. The data are purely observational so we can at best state that changes in the input variables are associated with different predicted values for miles per gallon.
We cannot interpret one input variable in the absence of the others. The signs of the regression coefficients are somewhat counterintuitive. Why would mileage go down for cars with more cylinders but go up with greater displacement. Does adding cylinders not imply a larger engine displacement? The point is that the inputs are related to each other, they do not vary freely from each other. When we interpret the magnitude of a regression coefficient in terms of the change in the target variable that corresponds to a unit change in the input variable, we are implicitly holding all other predictors fixed (ceteris paribus).

The correct interpretation of the displacement coefficient is thus

when displacement increases by one cubic inch and all other attributes remain constant, the expected mileage increases by 0.00013 miles per gallon.

This is known as the partial interpretation of the coefficients. See below for the concept of partial tests of the coefficients. A partial interpretation might not make sense if inputs cannot be varied freely of each other.

Rank Deficient Case

If $\textbf{X}$ is not of full column rank, $(\textbf{X}^\prime\textbf{X})^{-1}$ does not exist. Suppose that $\textbf{X}$ is a $(n \times p+1)$ matrix of rank $r(\textbf{X}) = k < p+1 < n$. The regular inverse $(\textbf{X}^\prime\textbf{X})^{-1}$ does not exist and the OLS solution cannot be computed as above. However, as long as we can find any solution to the normal equations we can make progress, even if the solution might not be unique.

There are several ways in which the rank deficiency can be addressed:

Change the model to one of full rank. If we know which columns cause the loss of rank, then we can eliminate them from the $\textbf{X}$ matrix. For example, in a model with an intercept and a qualitative $k$-level input variable, one-hot encoding the values as $k$ columns of zeros and ones will lead to a rank deficiency. Removing one of the columns and coding only $k-1$ levels will remove the problem. The level eliminated from the design matrix then serves as the reference level for the other categories.
Impose identifiability constraints. This method consists of adding a set of constraints $\textbf{A}\boldsymbol{\beta} = \textbf{0}$ to take up the slack in $\boldsymbol{\beta}$.
Use a generalized inverse $(\textbf{X}^\prime\textbf{X})^{-}$ to solve the unmodified normal equations.

We are discussing the third approach here. The normal equations \[ \textbf{X}^\prime\textbf{X}\boldsymbol{\beta} = \textbf{X}^\prime\textbf{Y} \] are a consistent set of linear equations and can be solved by using any generalized inverse $(\textbf{X}^\prime\textbf{X})^{-}$ of $\textbf{X}^\prime\textbf{X}$: \[ \widehat{\boldsymbol{\beta}} = (\textbf{X}^\prime\textbf{X})^{-}\textbf{X}^\prime\textbf{Y} \] The ratio of the error sum of squares and the error degrees of freedom remains an unbiased estimator of $\sigma^2$ in this situation: \[ \widehat{\sigma}^2 = \frac{\text{SSE}}{n-r(\textbf{X})} = \frac{\text{SSE}}{n-k} \]

Estimable functions

While the regular inverse $\textbf{A}^{-1}$ of a matrix $\textbf{A}$ is unique, there are many generalized inverses. The solution $\widehat{\boldsymbol{\beta}}$ depends on which particular g-inverse is used. This raises an important question: are there functions of the parameter estimator $\widehat{\boldsymbol{\beta}}$ that can be estimated unambiguously, where the choice of g-inverse is irrelevant?

The answer is Yes!
Estimable functions can be estimated unambiguously. A function $\textbf{L}\widehat{\boldsymbol{\beta}}$ is estimable, if $\textbf{L}$ is in the column space of $\textbf{X}$, meaning that $\textbf{L}$ can be generated from the columns of $\textbf{X}$. Put differently, the columns of $\textbf{X}$ serve as a generating set for the set of estimable functions. That means, of course, that $\textbf{X}$ itself is part of that set and hence $\textbf{X}\widehat{\boldsymbol{\beta}}$ is estimable. Phew! This proves that, although the OLS estimator $\widehat{\boldsymbol{\beta}}$ is not unique if $\textbf{X}$ is rank deficient, the predicted values $\textbf{X}\widehat{\boldsymbol{\beta}}$ are unique. Regardless of which generalized inverse you choose, you get the same fitted (predicted) values.

Testable hypotheses

The ambiguity in the choice of generalized inverse also goes away if we test hypotheses about $\boldsymbol{\beta}$, such as $H:\textbf{A}\boldsymbol{\beta} = \textbf{d}$, provided that $\textbf{A}\boldsymbol{\beta}$ is estimable. We say in that case that the hypothesis $H:\textbf{A}\boldsymbol{\beta} = \textbf{d}$ is testable.

Before we can describe a general procedure for testing hypotheses in the classic linear model, we need to discuss some distributional results, and we need to introduce an additional assumption: the errors are Gaussian distributed.

63.2 Partitioning Variability

We have so far encountered the error sum of squares \[ \text{SSE} = \sum_{i=1}^n \widehat{\epsilon}_i^2 = \sum_{i=1}^n (Y_i - \widehat{Y}_i)^2 = \textbf{Y}^\prime\textbf{Y} - \textbf{Y}^\prime\textbf{X}\widehat{\boldsymbol{\beta}} \] Several other sum of squares terms are important in analyzing linear models. The total sum of squares \[ \text{SST} = \sum_{i=1}^n (Y_i - \overline{Y})^2 = \textbf{Y}^\prime\textbf{Y} - n\overline{Y}^2 \] measures the total variability in the target variable and does not depend on the input variables in the model. SSE, on the other hand, depends on the structure of the model and on the chosen input variables. The difference between the two is called the model sum of squares (SSM) \[ \text{SSM} = \text{SST} - \text{SSE} = \textbf{Y}^\prime\textbf{X}\widehat{\boldsymbol{\beta}} - n\overline{Y}^2 \]

SSM and SST are also known as the corrected model and total sum of squares, due to the term $n\overline{Y}^2$, which adjusts for the mean response. SST measures total variability about the mean of the data.

Important

The sums of squares are random variables because they depend on $\textbf{Y}$. When we investigate the stochastic behavior (mean, variance, distribution) of a sum of square, we use capital notation for the target variable. For example, \[ \text{SSE} = \sum_{i=1}^n (Y_i - \widehat{Y}_i)^2 \] refers to a random variable. When we are concerned with the actual value for the error sum of squares based on a set of $n$ observations, we use lowercase notation. \[ \text{SSE} = \sum_{i=1}^n (y_i - \widehat{y}_i)^2 \] is the value of SSE in a particular analysis.

Each sum of square is associated with a number of degrees of freedom, which represent the number of independent pieces of information consumed in its calculation. For example, SST is a sum over $n$ terms, but one degree of freedom is lost because $\overline{Y}$ is calculated from the data, and it depends on the sum of the $Y_i$. If $\overline{Y}$ is given, you can only choose $n-1$ values freely, the final one is determined since the average has to be $\overline{Y}$.

Similarly, the number of free terms in the SSE calculation is $n$ minus the rank of $\textbf{X}$, we are losing one degree of freedom for every coefficient that is estimated.

When error and model sum of squares are divided by their respective degrees of freedom, we get the mean square terms for these sources of variability. No mean square is associated with the total source of variability.

The degrees of freedom, sums of squares, and mean squares are typically arranged in an analysis of variance (ANOVA) table (Table 63.1).

Table 63.1: The classical analysis of variance (ANOVA) table

Source	df	SS	MS
Model	$\text{dfM} = r(\textbf{X})-1$	$\text{SSM} = \textbf{Y}^\prime\textbf{X}\widehat{\boldsymbol{\beta}} - n\overline{Y}^2$	$\text{MSM} = \text{SSM}/\text{dfM}$
Error	$\text{dfE} = n-r(\textbf{X})$	$\text{SSE} = \textbf{Y}^\prime\textbf{Y} - \textbf{Y}^\prime\textbf{X}\widehat{\boldsymbol{\beta}}$	$\text{MSE} = \text{SSE}/\text{dfE}$
Total	$\text{dfT} = n-1$	$\text{SST} = \textbf{Y}^\prime\textbf{Y} - n\overline{Y}^2$

Sequential and Partial Sums of Squares

Suppose our model contains an intercept, $p$ predictor variables, and $\textbf{X}$ is of full rank $p+1$. The degrees of freedom are defined as the number of independent contributions to a sum of squares. What are the $p$ components from which SSM in Table 63.1 can be assembled?

Imagine that we start with an “empty” model, one that has only an intercept. To this model we add $x_1$. To that single predictor model we add $x_2$, and so forth. At each stage we measure the increase of SSM as another input variable is added. Denote as $\text{SS}(\boldsymbol{\beta}^* | \boldsymbol{\beta})$ the increase in SSM when the inputs corresponding to $\boldsymbol{\beta}^*$ are added to a model that contains the input variables corresponding to $\boldsymbol{\beta}$. For example, \[ \text{SS}(\beta_2,\beta_3 | \beta_0, \beta_1) \] is the increase in SSM when $x_2$ and $x_3$ are added to a model that contains an intercept and $x_1$. This has to be an increase because adding terms to a model reduces SSE (or at least does not change it), while SST is a constant.

The model sum of squares in Table 63.1 can be written in this notation as $\text{SSM} = \text{SS}(\beta_1, \cdots, \beta_p | \beta_0)$, the joint contribution of all $p$ input variables.

With this notation we can establish a type of sum of squares arithmetic.

\[ \begin{align*} \text{SSM} &= \text{SS}(\beta_1 | \beta_0) \\ &+ \text{SS}(\beta_2 | \beta_0, \beta_1) \\ &+ \text{SS}(\beta_3 | \beta_0, \beta_1, \beta_2) \\ &+ \cdots \\ &+ \text{SS}(\beta_p | \beta_0, \beta_1, \cdots, \beta_{p-1}) \end{align*} \tag{63.1}\]

The terms on the right hand side of Equation 63.1 are called sequential sums of squares as variables enter the model sequentially. The model sum of squares can be partitioned into the sequential contributions as each variable is added to a model containing its preceding terms.

The sequential sum of squares always add up to the model sum of squares, but the individual contributions depend on the order in which the variables enter the model: \[ \text{SS}(\beta_p | \beta_0) \ne \text{SS}(\beta_p | \beta_0, \beta_1, \cdots, \beta_{p-1}) \] When the input variables are orthogonal, that is, $\textbf{X}^\prime\textbf{X}$ is a diagonal matrix, the order of the variables is immaterial. In that case \[ \text{SS}(\beta_j | \beta_0) = \text{SS}(\beta_j | \beta_0, \beta_1) = \text{SS}(\beta_j | \forall \beta_k, k \ne j) \]

A partial sum of squares measures the change in model sum of squares between two models that are nested. The inputs in one model are a subset of the inputs in the other model. For example $\text{SS}(\beta_2, \beta_3 | \beta_0, \beta_1)$ is the partial sum of squares for adding $x_2$ and $x_3$ to a model with intercept and $x_1$. This can be written as the difference between the sums of squares of two models: \[ \text{SS}(\beta_2, \beta_3 | \beta_0, \beta_1) = \text{SS}(\beta_1, \beta_2, \beta_3 | \beta_0) - \text{SS}(\beta_1 | \beta_0) \] The models represented on the right hand side are called the full and the reduced model. You can derive the reduced model from the full model by imposing a constraint on the full model. In this example the constraint is $\beta_2= \beta_3 = 0$.

The partial sums of squares for individual predictor variables in a $p$-predictor model are \[ \begin{align*} \text{SS}(\beta_1 | \beta_0,\beta_2,\cdots,\beta_p) \\ \text{SS}(\beta_2 | \beta_0,\beta_1,\cdots,\beta_p) \\ \text{SS}(\beta_3 | \beta_0,\beta_1,\beta_2,\cdots,\beta_p) \\ \vdots \\ \text{SS}(\beta_p | \beta_0,\beta_1,\beta_2,\cdots,\beta_{p-1}) \\ \end{align*} \tag{63.2}\]

Unless the input variables are orthogonal, the individual partial sums of squares in Equation 63.2 do not add up to anything meaningful. Each sum of squares represents the contribution of a variable in the presence of all others.

Partial and sequential sums of squares play an important role in testing hypotheses in the classical linear model. More on this below.

Coefficient of Determination, $R^2$

The variability in the target variable $Y$ can be estimated as \[ s^2 = \frac{1}{n-1}\sum_{i=1}^n (Y_i - \overline{Y})^2 \] If the $Y_i$ had the same mean, this would be an unbiased estimator of $\text{Var}[Y]$. However, the regression model states that the mean of $Y$ is a function of the $X$s. The variability captured by $s^2$ includes the systematic effects of $X$ on $\text{E}[Y]$, it is thus a biased estimator of $\text{Var}[Y] = \sigma^2$. You recognize the numerator of $s^2$ as the total sum of squares (SST). If SST captures variability of $Y$ about a constant mean, how much of this is attributable to the input variables? To answer this we can look at the variability not attributable to the $X$s, the error sum of squares \[ \text{SSE} = \sum_{i=1}^n \widehat{\epsilon}_i^2 = \sum_{i=1}^n (Y_i - \widehat{Y}_i)^2 \] The ratio \[ R^2 = \frac{\text{SST}-\text{SSE}}{\text{SST}}=1-\frac{\text{SSE}}{\text{SST}} \] is known as the coefficient of determination or R-squared. The name R-squared comes from a simple relationship between $R^2$ and the Pearson correlation coefficient in the simple linear regression case. If $y_i = \beta_0 + \beta_1 x_i + \epsilon_i$, then $R^2$ is the square of the correlation coefficient between $X$ and $Y$: \[ R^2 = r_{xy}^2 \] In the multiple linear regression case $R^2$ retains the interpretation as a squared correlation, between the observed and the predicted values: \[ R^2 = r_{y\widehat{y}}^2 \] This interpretation also holds in the simple case.

$R^2$ ranges between 0 and 1; it achieves the lower bound $R^2=0$ if SSE = SST, the input variables do not explain any variability in $Y$. $R^2$ equals one when SSE is zero, the model fits the data “perfectly”, it interpolates the $y_i$.

The straightforward interpretation of $R^2$ as the proportion of variability explained by the input variables unfortunately can lead one to chase models that have a high $R^2$. This is a terrible practice for a number of reasons.

The value of $R^2$ does not tell us whether the model assumptions are met. You can explain a substantial amount of variability in the data with a seriously deficient model.
$R^2$ is a function of SSE, the prediction error on the training data set. This can be made arbitrarily small by adding input variables. If $R^2_{\text{cur}}$ is the coefficient of determination in a linear regression model, and you add a new predictor $x_{p+1}$, then \[ R^2_{\text{new}} \ge R^2_{\text{cur}} \] Predictors that make no relevant contribution will increase $R^2$ because their addition still reduces SSE.
A model that interpolates the data (fits “perfectly”) is not a good model. It is certainly not perfect if the goal is to build a model that generalizes well to unseen observations. Different metrics are needed to develop models that generalize and do not overfit the data.

Chasing $R^2$ values invariably leads to overfitting and models that memorize too much of the training observations to perform well on new data.

63.3 Gaussian Distribution

The previous results do not require a Gaussian assumption for the $\boldsymbol{\epsilon}$. If the distributional properties of the model errors are sharpened from $\boldsymbol{\epsilon} \sim (\textbf{0},\sigma^2\textbf{I})$ to \[ \boldsymbol{\epsilon} \sim G(\textbf{0},\sigma^2\textbf{I}) \] then the target variable is also Gaussian distributed because it is a linear combination of the $\boldsymbol{\epsilon}$: \[ \textbf{Y} \sim G(\textbf{X}\boldsymbol{\beta},\sigma^2\textbf{I}) \] Since the OLS estimator $\widehat{\boldsymbol{\beta}} = (\textbf{X}^\prime\textbf{X})^{-1}\textbf{X}^\prime\textbf{Y}$ is a linear combination of $\textbf{Y}$, it follows that $\widehat{\boldsymbol{\beta}}$ also has a multivariate Gaussian distribution: \[ \widehat{\boldsymbol{\beta}} \sim G(\boldsymbol{\beta},\sigma^2(\textbf{X}^\prime\textbf{X})^{-1}) \] By the same token, $\widehat{\textbf{Y}} = \textbf{H}\textbf{Y}$ and $\widehat{\boldsymbol{\epsilon}} = (\textbf{I}-\textbf{H})\textbf{Y}$ are Gaussian distributed:

\[\begin{align*} \widehat{\textbf{Y}} &= G(\textbf{X}\boldsymbol{\beta},\sigma^2\textbf{H}) \\ \widehat{\boldsymbol{\epsilon}} &= G(\textbf{0},\sigma^2(\textbf{I}-\textbf{H})) \end{align*}\]

Testing Hypotheses

Suppose we wish to test the hypothesis $H: \textbf{A}\boldsymbol{\beta} = \textbf{d}$ where $\textbf{A}$ is a known $(k \times p+1)$ matrix or rank $k$. For example, in a randomized complete block design with four treatments in three blocks, denote the parameters $\boldsymbol{\beta} = [\beta_0, \rho_1,\rho_2,\rho_3, \tau_1, \tau_2, \tau_3, \tau_4]$. The $\rho$ parameters correspond to block effects, the $\tau$ parameters to the treatment effects. The hypothesis of absence of any treatment effects is $H: \tau_1 = \tau_2 = \tau_3 = \tau_4$. Equivalently, it can be expressed as the simultaneous equations:

\[\begin{align*} \tau_1 &= \tau_2 \\ \tau_1 + \tau_2 &= 2 \tau_3 \\ \tau_1 + \tau_2 + \tau_3 &= 3\tau_4 \end{align*}\]

The matrix of treatment contrasts for the hypothesis is \[ \textbf{A} = \left \lbrack \begin{array}{rrrrrrrr} 0 & 0 & 0 & 0 & 1 & -1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 1 & -2 & 0 \\ 0 & 0 & 0 & 0 & 1 & 1 & 1 & -3 \\ \end{array} \right \rbrack \] and $\textbf{d} = \textbf{0}$.

Sum of squares reduction test

A test statistic for $H: \textbf{A}\boldsymbol{\beta} = \textbf{d}$ can be constructed by considering two models: the full model and a reduced model subject to $H$.

The difference in the error sum of squares between these two models is \[ \text{SSE}_r - \text{SSE}_f = \left(\textbf{A}\widehat{\boldsymbol{\beta}} - \textbf{d}\right)^\prime \left[\textbf{A}(\textbf{X}^\prime\textbf{X})^{-1}\textbf{A}^\prime\right]^{-1} \left(\textbf{A}\widehat{\boldsymbol{\beta}} - \textbf{d}\right) \tag{63.3}\]

The error sum of squares in the reduced model, $\text{SSE}_r$, is at least as large as $\text{SSE}_f$, since the hypothesis acts as a constraint on the parameters. Under the assumption of Gaussian errors, the test statistic \[ F = \frac{(\text{SSE}_r-\text{SSE}_f)/k}{\text{SSE}_f/(n-r(\textbf{X}))} = \frac{(\text{SSE}_r-\text{SSE}_f)/k}{\text{MSE}_f} \] has an $F$ distribution with $k$ numerator and $n-r(\textbf{X})$ denominator degrees of freedom. Notice that the term in the denominator is simply the error mean square from the full model. This test is known as the sum of squares reduction test. We could have used the error mean square from the reduced model in the denominator of $F$. We choose the larger model, because its error mean square is an unbiased estimator whether $H$ is true or not.

In case $\textbf{X}$ is rank deficient, the inverse in Equation 63.3 is replaced with a generalized inverse and the test is testable if $\textbf{A}\boldsymbol{\beta}$ is estimable.

In the Gaussian case discussed here, the sum of squares reduction test is also the likelihood ratio test.

Special Cases

Here are some special cases of the sum of squares reduction test.

ANOVA tests

The hypothesis $H: \beta_1 = \beta_2 = \cdots = \beta_p = 0$ can be tested based on the overall ANOVA table (Table 63.1). The difference $\text{SSE}_r - \text{SSE}_f$ corresponds to the partial sum of squares $\text{SSM} = \text{SS}(\beta_1, \cdots, \beta_p | \beta_0)$.

The $F$ test for the simultaneous absence of any variable contributions has test statistic \[ F = \frac{MSM}{MSE} \] which follows a $F_{\text{dfM},\text{dfE}}$ distribution under $H$.

Partial tests

A special case of the sum of square reduction test in the previous section is $\textbf{A} = [0,0,0,1,0,0], \textbf{d}=0$, a test of $H: \beta_j = 0$ (where the 1 is in the $(j+1)$^st position of $\textbf{A}$). This is a sum of squares reduction test with \[ \text{SSE}_r - \text{SSE}_f = \text{SS}(\beta_j | \beta_0, \text{ all others }) \] It can be shown that the test statistic reduces to \[ F = \frac{\widehat{\beta}_j^2}{\text{ese}(\widehat{\beta}_j)^2} \] where \[ \text{ese}(\widehat{\beta}_j) = \sqrt{\widehat{\sigma}^2 \ c_{jj}} \] is the estimated standard error of $\widehat{\beta}_j$ and $c_{jj}$ is the $j$^th diagonal value of $(\textbf{X}^\prime\textbf{X})^{-1}$.

This statistic has an $F_{1,n-r(\textbf{X})} = F_{1,\text{dfE}}$ distribution; consequently its square root follows a $t_{n-r(\textbf{X})} = t_{\text{dfE}}$ distribution. An alternative test statistic for $H:\beta_j = 0$ is thus given by \[ T = \frac{\widehat{\beta}_j}{\text{ese}(\widehat{\beta}_j)} \]

These tests statistics are commonly reported by statistical software for all model coefficients, accompanied with $p$-values for the two-sided alternative $H:\beta_j \ne 0$, and/or confidence limits.

These are partial tests, meaning that they test a coefficient against zero, given all the other inputs in the model. For example, a partial test of $H:\beta_2 = 0$ in the model \[ Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon \] tests whether $x_2$ is an important predictor in the presence of $x_1$ and $x_3$. While $x_2$ might be an important variable by itself, because it is correlated with $x_1$, its partial contribution might be small. The same can hold for $x_1$.

Note

Imagine that the $p$-values of variables cylinders and displacement in the miles per gallon model are 0.33 and 0.58, respectively. That would suggest that these variables do not make significant contributions toward explaining miles per gallon. The $p$-values are interpreted in the presence of the other variables in the model.

The correct interpretation is that the number of cylinders is not a significant predictor of miles per gallon in a model that accounts for engine displacement, vehicle weight, and horsepower and similarly for the displacement variable.

Sequential tests

A partial tests evaluates the significance of a predictor variable in the presence of all other variables in the model (conditional on all others). A sequential test evaluates the significance of a predictor variable conditional on the predictor variables that precede it in the model.

The sum of squares reduction test for the sequential test of $H:\beta_j = 0$ uses \[ \text{SSE}_r - \text{SSE}_f = \text{SS}(\beta_j | \beta_0, \beta_1, \cdots, \beta_{j-1}) \] and \[ F = \frac{\text{SS}(\beta_j | \beta_0, \beta_1, \cdots, \beta_{j-1})}{\text{MSE}_f} \] with a $F_{1,\text{dfE}}$ reference distribution.

Confidence Intervals for $\beta_j$

Based on the pivot statistic for the partial $t$-test we can construct a confidence interval for $\beta_j$. The $(1-\alpha)\times 100\%$ confidence interval for $\beta_j$ is given by \[ \widehat{\beta}_j \pm t_{\alpha/2,n-r(\textbf{X})}\times \text{ese}(\widehat{\beta}_j) \]

63.4 Prediction

To obtain the fitted values in the linear model $\textbf{Y} = \textbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}$ we apply the OLS estimates and get \[ \widehat{\textbf{Y}} = \textbf{X}\widehat{\boldsymbol{\beta}} \] To predict at a previously unobserved value of the inputs, a new observation $\textbf{x}_0$, we apply the regression equation and the OLS estimates to $\textbf{x}_0$: \[ \widehat{\textbf{Y}}_0 = \textbf{x}_0^\prime \widehat{\boldsymbol{\beta}} \]

This seems simple, there must be a trick to it. The “trick” lies in the question: what is being predicted?

The target of the prediction (the what) can be $Y$, the target variable, or $\text{E}[Y]$, the mean of the target variable. How are these different? $Y$ is a random variable and $\text{E}[Y]$ is a constant; $\text{Var}[Y] = \sigma^2$ and $\text{Var}[\text{E}[Y]] = 0$. The uncertainty in predicting an individual observation and predicting the mean function differ. The former has to take into account the inherent variability $\sigma^2$ in the data.

How are the predicted values themselves different?

For illustration, consider an SLR model $Y = \beta_0 + \beta_1x + \epsilon$. To predict $Y$ at $X=x_0$ we choose the obvious expression \[ \widehat{Y}_0 = \widehat{\beta}_0 + \widehat{\beta}_1x_0 + \widehat{\epsilon} \] substituting estimates for all unknowns on the right hand side. Since $\epsilon$ is a zero-mean random variable and cannot be observed directly, the best estimate is $\widehat{\epsilon} = \text{E}[\epsilon] = 0$ which leads to \[ \widehat{Y}_0 = \widehat{\beta}_0 + \widehat{\beta}_1x_0 \]

That is the same expression we use to predict $\text{E}[Y] = \beta_0 + \beta_1 x_0$, substituting the OLS estimates for unknowns on the right hand side: \[ \widehat{\text{E}}[Y_0] = \widehat{\beta}_0 + \widehat{\beta}_1x_0 \] And therein lies the crux. The predicted values for an observation and for the mean of an observation are the same. But their variability is not the same. We need to be very clear about what it is we are shooting for. More frequently one is interested in predicting observations, not averages of observations. In a study of health outcomes over time you might be more interested in predicting how a patient does at time $t$, rather than how the population of patients does on average at time $t$. Yet when folks see the different levels of confidence we have in the two predictions, they wish they could make predictions for the average.

Prediction Variance

The variance of $\widehat{\text{E}}[Y_0] = \textbf{x}_0^\prime\widehat{\boldsymbol{\beta}}$ is straightforward and depends only on the variance of $\widehat{\boldsymbol{\beta}}$: \[ \text{Var}{\widehat{\text{E}}[Y_0]} = \sigma^2 \, \textbf{x}_0^\prime(\textbf{X}^\prime\textbf{X})^{-1}\textbf{x}_0 \] To account for the variability in the data, the proper variance to consider when predicting an individual observation is \[ \text{Var}[\widehat{Y}_0 - Y_0] = \text{Var}[\widehat{Y}_0 - \epsilon] = \text{Var}[\textbf{x}_0^\prime\widehat{\boldsymbol{\beta}}-\epsilon] = \sigma^2\left(1+\textbf{x}_0^\prime(\textbf{X}^\prime\textbf{X})^{-1}\textbf{x}_0\right) \]

The additional $1+$ inside the parentheses does not seem like a big deal but has substantial numeric consequences.

Confidence and Prediction Intervals

When the errors have a Gaussian distribution, the following random variables have $t$ distributions with $n-r(\textbf{X})$ degrees of freedom:

\[ \begin{align*} T &= \frac{\widehat{Y}_0 - \text{E}[Y]_0}{\sqrt{\text{Var}[\widehat{Y}_0]}} \\ T &= \frac{\widehat{Y}_0 - Y_0}{\sqrt{\text{Var}[\widehat{Y}_0]-Y_0}} \end{align*} \]

The first is used to construct $(1-\alpha)\times 100\%$-level confidence intervals for the mean of the target \[ \widehat{y}_0 \pm t_{\frac{\alpha}{2},n-r(\textbf{X})} \sqrt{\sigma^2 \textbf{x}_0^\prime(\textbf{X}^\prime\textbf{X})^{-1}\textbf{x}_0} \] The second is used to construct $(1-\alpha)\times 100\%$-level prediction intervals for an individual target \[ \widehat{y}_0 \pm t_{\frac{\alpha}{2},n-r(\textbf{X})} \sqrt{\sigma^2 \left(1+ \textbf{x}_0^\prime(\textbf{X}^\prime\textbf{X})^{-1}\textbf{x}_0 \right)} \]

63.5 Diagnostics

Model diagnostics are useful to examine model assumptions and to decide whether the model is an adequate description of the data. Questions we would like to answer through model diagnostics include

Are the assumptions of the model met?
- Linearity (zero mean errors, $\text{E}[\boldsymbol{\epsilon}] = \textbf{0}$)
- Equal variance (homoscedasticity, $\text{Var}[\boldsymbol{\epsilon}] = \sigma^2\textbf{I}$)
- Independence of the errors, $\text{Cov}[\epsilon_i, \epsilon_j] = 0, \forall i,j$
- Can we assume that the errors are normally distributed?
How well does the model predict new observations?
Do observations have undue influence on the results?
Does the relationship between the inputs negatively affect the analysis?

The three basic types of linear model diagnostics are

Residual diagnostics to examine linearity, equal variance assumptions, and to detect outliers. Residual analysis relies on functions of $Y_i - \widehat{Y}_i$ to study the behavior of the unobservable $\epsilon_i$.
Case-deletion diagnostics find data points that exert high influence on the analysis. These diagnostics ask how an aspect of the analysis (variability, predicted values, coefficient estimates, …) changes if an observation is removed from the analysis.
Collinearity diagnostics examine the relationships among the inputs and whether they impact the analysis in a negative way. The situations at the end of the extremes include completely orthogonal inputs ($\textbf{X}^\prime\textbf{X}$ is a diagonal matrix) and inputs that are linear combinations of each other ($\textbf{X}^\prime\textbf{X}$ is non-singular and a unique OLS solution does not exist). Most applications fall in-between unless one or more inputs are factors.

Leverage

Linear model diagnostics depend on the leverage values $h_{ii}$, the diagonal values of the Hat matrix $\textbf{H}$. That is not surprising, because the fitted values are linear combinations of the entries in the Hat matrix \[ \widehat{\textbf{Y}} = \textbf{H}\textbf{Y} \] The $i$^th fitted value is a linear combination of the entries in the $i$^th row of $\textbf{H}$ and the elements of $\textbf{Y}$ \[ \widehat{Y}_i = \textbf{x}_i^\prime\widehat{\boldsymbol{\beta}} = \sum_{j=1}^n h_{ij}Y_j \]

From the last expression it is easy to establish that \[ \text{Var}[\widehat{Y}_i] = \sigma^2 \, \textbf{x}_i^\prime(\textbf{X}^\prime\textbf{X})^{-1}\textbf{x}_i = \sigma^2 h_{ii} \]

We can think of the leverages $h_{ii}$ as standardized squared distance measures that tell us how far the $i$^th data point is from the center of the $x$-data.

It is illustrative to look at the average variance of the fitted values across the data set. In the full rank case: \[ \frac{1}{n}\sum_{i=1}^n \text{Var}[\widehat{Y}_i] = \frac{1}{n} \sum_{i=1}^n \sigma^2h_{ii} = \sigma^2\left(\frac{p+1}{n}\right) \tag{63.4}\]

The last result follows because $\textbf{H}$ is a projection matrix and thus $\text{tr}(\textbf{H})$ equals its rank, $p+1$.

What does Equation 63.4 tell us about the precision of the estimated regression output? Suppose $p=4$ and $n=10$. The average variance is then $\sigma^2/2$. When $p=4$ and $n=1000$, the average variance is $\sigma^2/200$. As sample size increases, the estimates become more precise as $p$ remains fixed. As $p$ increases for a given sample size, the average variance of the fitted values increases. In high-dimensional problems—where $p$ is large—OLS estimates have high variability and are unstable. Regularized estimation methods such as Ridge or Lasso regression can perform better in those circumstances.

Since $\sum_{i=1}^n h_{ii} = p+1$, the average leverage value in the data is $(p+1)/n$, and a good threshold for high leverage points is $h_{ii} > 2(p+1)/n$. This is not necessarily a problematic data point, it simply states that the point is an outlying point in the $x$-space. High leverage points have the potential to influence aspects of the analysis, to be highly influential data points.

In summary, here are some important results involving the leverage values $h_{ii}$:

$h_{ii} = \textbf{x}_i^\prime (\textbf{X}^\prime\textbf{X})^{-1}\textbf{x}_i$
$\frac{1}{n} \le h_{ii} \le 1$. This holds only for the $n$ training observations, the leverage $\textbf{x}_0^\prime(\textbf{X}^\prime\textbf{X})^{-1}\textbf{x}_0$ of a new data point is not bounded in this way.
$\overline{h} = (p+1)/n$; high leverage points are those for which $h_{ii} > 2(p+1)/n$.
$\text{Var}[\widehat{Y}_i] = \sigma^2 h_{ii}$
$\text{Var}[Y_i - \widehat{Y}_i] = \sigma^2 (1-h_{ii})$
$\widehat{Y}_i = \sum_{j=1}^n h_{ij}Y_j$, the $i$^th fitted value is a linear combination of the target values with the values in the $i$^th row of $\textbf{H}$.
$\sum_{j=1}^n h_{ij} = 1$, the sum of the leverage values in the $i$^th row of $\textbf{H}$ is 1. Since the leverage values are bounded, they sum to 1 within a row, and the fitted values are linear combinations of $\textbf{H}$, this shows how a data point with $h_{ii} \approx 1$ has outsize influence. The fitted value is almost entirely determined by the input values of that observation.

Residual Diagnostics

Basic questions about the correctness of the model revolve around the properties of $\boldsymbol{\epsilon}$. The usual assumption $\boldsymbol{\epsilon} \sim (\textbf{0}, \sigma^2\textbf{I})$ states that the errors have zero mean, are uncorrelated, and have equal variance. Although $\boldsymbol{\epsilon}$ is unobservable, we should be able to check those assumptions by looking at the OLS residuals of the fitted model, $\widehat{\boldsymbol{\epsilon}} = \textbf{y} - \widehat{\textbf{y}}$. These are also called the raw residuals.

Unfortunately, the properties of $\widehat{\boldsymbol{\epsilon}}$ match the properties of $\boldsymbol{\epsilon}$ only partially. Because $\widehat{\boldsymbol{\epsilon}}$ is the result of fitting a model to data, the fitted residuals obey constraints that do not affect the model errors $\boldsymbol{\epsilon}$. Because \[ \textbf{X}^\prime \widehat{\boldsymbol{\epsilon}} = \textbf{X}^\prime (\textbf{I}-\textbf{H})\textbf{Y} = \textbf{0} \] the raw residuals sum to zero across each column of the $\textbf{X}$ matrix. In other words, there are only $n-r(\textbf{X})$ degrees of freedom in the raw residual vector. From a statistical perspective, the residuals have zero mean, $\text{E}[\widehat{\boldsymbol{\epsilon}}] = \textbf{0}$ and share this property with the model errors. The variance of $\boldsymbol{\epsilon}$ and $\widehat{\boldsymbol{\epsilon}}$ is different, however:

\[\begin{align*} \text{Var}[\boldsymbol{\epsilon}] &= \sigma^2 \textbf{I} \\ \text{Var}[\widehat{\boldsymbol{\epsilon}}] &= \sigma^2(\textbf{I} - \textbf{H}) \\ \text{Var}[\widehat{\epsilon}_i] &= \sigma^2 (1-h_{ii}) \end{align*}\]

While the model errors are uncorrelated, the fitted residuals are correlated, $\textbf{I} - \textbf{H}$ is not a diagonal matrix. The fitted residuals also do not have the same variance; the variance depends on the leverage of the $i$^th data point.

These properties (or lack thereof) should give pause in using the raw residuals to diagnose the assumptions of equal variance or uncorrelated errors. Instead, residual diagnostics use transformations of the raw residuals.

Studentized residuals

The unequal variance of the residuals can be handled by standardizing, dividing the residual by its standard deviation (the square root of its variance) \[ \frac{\widehat{\epsilon}_i}{\sigma\sqrt{1-h_{ii}}} \] $\sigma$ is unknown and the obvious solution is to substitute an estimator. Statisticians refer to this process, using an estimate to scale a random variable, as studentization. The studentized residual is \[ r_i = \frac{\widehat{\epsilon}_i}{\widehat{\sigma}\sqrt{1-h_{ii}}} \] The usual estimator for $\sigma$ is the square root of the estimator of $\sigma^2$, \[ \widehat{\sigma} = \sqrt{ \frac{\text{SSE}}{n-r(\textbf{X})}} \]

R-student residuals

A further adjustment can be made to the studentized residuals. Rather than use an estimate of the variance that is derived from all the data, an estimator can be used that does not depend on the $i$^th observation. This technique is called external studentization in contrast to the internal studentization that gives rise to $r_i$.

Fortunately, such an external estimate of the variance $\sigma^2$ that does not rely on the $i$^th observation can be computed based on the analysis of all $n$ observations. Not surprisingly, the leverage plays a role again: \[ \widehat{\sigma}^2_{-i} = \frac{(n-r(\textbf{X}))\widehat{\sigma}^2 - \frac{\widehat{\epsilon}^2_i}{1-h_{ii}}}{n-r(\textbf{X})-1} \]

The externally studentized residual is called the R-student residual, \[ t_i = \frac{\widehat{\epsilon}_i}{\widehat{\sigma}_{-i}\sqrt{1-h_{ii}}} \]

Since $\widehat{\epsilon}_i$ and $\widehat{\sigma}_{-i}$ are independent, $t_i$ behaves like a $t$-distributed random variable. The R-student residuals are good diagnostics to detect outliers and high-influence points (hips). Outliers are observations unusual in $y$-space. They are not necessarily hips, unless they are also high leverage points.

The $t_i$ work well for outliers and hips because an outlier has large $\widehat{\epsilon}_i$ and a hip has small $\sqrt{1-h_{ii}}$. Both effects increase the value of $t_i$. This is also true for the (internally) studentized residual. In addition, outliers or hips will have a large \[ \frac{\widehat{\epsilon}^2_i}{1-h_{ii}} \] the adjustment term in the computation of $\widehat{\sigma}^2_{-i}$. For those data points $\widehat{\sigma}^2_{-i} < \widehat{\sigma}^2$ and $t_i$ will be more sensitive than $r_i$.

When testing model assumptions such as linearity, equal variance (homoscedasticity), and checking for outliers, the R-student residuals are the recommended quantities. The threshold $|r_i| > 2$ is often applied to indicate outlying observations.

In a simple linear regression model you can plot the residuals against the input $x$. In a general linear model you plot the residuals against the fitted values. The residuals should display no obvious trend when plotted against $\widehat{y}$.

You can add the predictions from a smoothing spline or other nonparametric model to identify trends in the residuals. Ideally, the smoothing spline should not show any gross departures from a flat line at zero.

Case Deletion Diagnostics

Case deletion diagnostics express how much an aspect of the model changes when an observation is removed from the analysis. The RStudent residual is a leave-one-out diagnostic in this spirit, as it uses an estimator of $\sigma$ that does not incorporate the $i$th data point.

The two most important case deletion diagnostics are Cook’s D and the DFFITS. The name DFFITS stands for difference in fit, standardized. The statistic measures the change in predicted value in units of standard errors when the $i$th observation is deleted. We are concerned when a DFFIT exceeds in absolute value the threshold of $2\sqrt{(p+1)/n}$.

The Cook’s D (“D” for distance) statistic measures the change in the parameter estimates $\boldsymbol{\beta}$ when the $i$th observation is removed. If the purpose of modeling is to build a model that predicts well, focus more on DFFITS. If the purpose of modeling is to test hypotheses, then focus more on Cook’s D. We are concerned if the D statistic exceeds 1.

Collinearity Diagnostics

When inputs are related to each other it has two important consequences

we cannot interpret the regression coefficient for one input without considering the other inputs
with increasing dependence among the inputs, the least squares estimation procedure becomes increasingly numerically unstable.

The condition when inputs are linearly related is called collinearity and it negatively affects any calculations that involve the $(\textbf{X}^\prime\textbf{X})^{-1}$ matrix (which is about all the calculations.)

The extreme case is when an input variable is linear combination of other inputs, for example \[ Z = aX_2 + bX_3 \] Adding $Z$ to a model that contains inputs $X_2$ and $X_3$ leads to a singular (rank-deficient) $\textbf{X}$ matrix. The inverse cross-product matrix $(\textbf{X}^\prime\textbf{X})^{-1}$ does not exist and the OLS estimator cannot be computed.

Collinearity is the condition where inputs are highly correlated. They do not follow exact linear dependencies but are close to linearly dependent (a so-called near-linear dependency). The situation is more complex than involving just two input variables. For example, $X_1$ might not be strongly correlated with $X_2$ but can be strongly correlated with a linear combination of $X_2$, $X_3$, and $X_8$. This condition is called multicollinearity.

When multicollinearity is strong, the OLS estimates of the regression coefficients become unstable; small perturbations in the target values or inputs can lead to large changes in the coefficients. The $\widehat{\beta}_j$ estimate can be of the wrong size and/or sign.

A nontechnical diagnostic for multicollinearity is to compute the matrix of pairwise correlations among the inputs. Large values of $\text{Corr}[X_j,X_k]$ is a sufficient condition for collinearity, but it is not a necessary condition. Even with weak pairwise correlations you can have strong linear dependencies among multiple inputs.

Nevertheless, the pairwise correlation plot is a good place to start.

Here are some other, informal, ways to diagnose a multicollinearity problem:

The $R^2$ statistic indicates the model explains substantial variability in $Y$, but none (or few) of the inputs show statistically significant $p$-values. Because the $t$-tests are partial tests, the other input variables act as proxy for the variable being tested.
Different variable selection methods lead to very different models.
Standard errors of coefficients and/or fitted values are unusually large.
Slight perturbations of the data, for example, by adding some small Gaussian random noise, change the results dramatically.

A formal diagnosis relies on the computation of variance inflation factors (VIFs) or **condition indices*.

Variance inflation factors

Each predictor (input) variable in a linear model is associated with a variance inflation factor that quantifies the strength of linear dependencies between this input and all other inputs.

The $j$^th VIF measures how many times more variable the variance of the standardized coefficients are due to the involvement of $X_j$ in linear dependencies involving the other $X$s.

You can find the $\text{VIF}_j$ from the $R^2$ statistic of a multiple linear regression of $X_j$ on all the other input variables. \[ \text{VIF}_j = \frac{1}{1-R^2_j} \]

Notice that a variance inflation factor does not depend on $Y$. It is solely based on relationships among the inputs (predictors). Also, you can see from the model equation above that the VIF discovers more than a pairwise dependence on other variables. It models one input as a function of all other inputs.

Another way to compute the variance inflation factors is to fit the linear regression with a scaled and centered $\textbf{X}^*$ matrix. The columns of $\textbf{X}^*$ are centered at their sample mean and are scaled by dividing by $\sqrt{n-1}$ times their standard error. As a result, the $\textbf{X}^{*\prime} \textbf{X}^*$ matrix is the matrix of the empirical pairwise correlations of the inputs. The regression coefficients of this model are called the standardized coefficients ($\beta^*_j$) and the variance inflation factors are \[ \text{VIF}_j = \widehat{\text{Var}}[\widehat{\beta}^*_j] / \widehat{\sigma}^2 \]

The smallest possible VIF value is 1.0, it indicates that the input is not linearly related to the other variables. The thresholds are as follows

1 < VIF < 10: moderate collinearity
10 < VIF < 30: moderate to severe collinearity
VIF > 30: severe collinearity problem

Condition index and condition number

A formal diagnostic for multicollinearity, based on the eigenvalue decomposition of the (scaled-and-centered) $\textbf{X}^*$ matrix, examines the spread of the eigenvalues of ^{*}$. If $\textbf{X}^*$ is a centered and scaled version of $\textbf{X}$ such that $\textbf{X}^{*\prime}\textbf{X}$ is the empirical correlation matrix, its eigen decomposition is \[ \textbf{X}^{*\prime}\textbf{X} = \textbf{Q}\boldsymbol{\Lambda}\textbf{Q}^\prime \] where $\textbf{Q}$ is a $(p+p)$ orthogonal matrix of eigenvectors and $\boldsymbol{\Lambda}$ is a diagonal matrix with the eigenvalues $\lambda_j$ on its diagonal. The number of eigenvalues close to zero indicates the number of near linear dependencies in $\textbf{X}$ (or $\textbf{X}^*$). In this centered-and-scaled form the eigenvalues satisfy $\sum_{j=1}^p \lambda_j = p$, so if some eigenvalues get small, others need to get bigger.

The condition index associated with the $j$^th eigenvalue is \[ \phi_j = \frac{\max(\lambda_j)}{\lambda_j} \] and the condition number is $\max(\phi_j)$.

Condition indices larger than 900 indicate that near linear dependencies exist.

Caution

In contrast to variance inflation factors, where the $j$^th factor is associated with the $j$^th input, the eigenvalue $\lambda_j$ is not associated with a particular input. It is associated with a linear combination of all the $X$s.

An obvious remedy of the multicollinearity is to remove inputs that are associated with high variance inflation factors and to refit the model. If you cannot remove the variables from the model a different estimation method is called for. Regularization methods such as Ridge regression or Lasso regression handle high-dimensional problems and reduce the instability of the least-squares estimates by shrinking their values (suppressing the high variability of the coefficients). At the cost of introducing some bias, these estimators drastically reduce variability for an overall better mean square prediction error.

Source	df	SS	MS
Model	\(\text{dfM} = r(\textbf{X})-1\)	\(\text{SSM} = \textbf{Y}^\prime\textbf{X}\widehat{\boldsymbol{\beta}} - n\overline{Y}^2\)	\(\text{MSM} = \text{SSM}/\text{dfM}\)
Error	\(\text{dfE} = n-r(\textbf{X})\)	\(\text{SSE} = \textbf{Y}^\prime\textbf{Y} - \textbf{Y}^\prime\textbf{X}\widehat{\boldsymbol{\beta}}\)	\(\text{MSE} = \text{SSE}/\text{dfE}\)
Total	\(\text{dfT} = n-1\)	\(\text{SST} = \textbf{Y}^\prime\textbf{Y} - n\overline{Y}^2\)