47 Linear Algebra
47.1 Basics
Command of linear algebra is essential in data science, models and estimators are often expressed in terms of tensors, matrices, and vectors. Using scalar-based arithmetic becomes tedious very quickly as models become more complex. For example, the simple linear regression model and a straight line through the intercept model can be written as
\[Y_{i} = \beta_{0} + \beta_{1}x_{i} + \epsilon_{i}\]
\[Y_{i} = \beta_{1}x_{i} + \epsilon_{i}\]
Using scalar algebra, the estimates of the slope are quite different:
\[{\widehat{\beta}}_{1} = \frac{\left( \sum_{i = 1}^{n}{\left( y_{i} - \overline{y} \right)\left( x_{i} - \overline{x} \right)} \right)}{\sum_{i = 1}^{n}\left( x_{i} - \overline{x} \right)^{2}}\]
\[{\widehat{\beta}}_{1} = \frac{\left( \sum_{i = 1}^{n}{y_{i}x_{i}} \right)}{\sum_{i = 1}^{n}x_{i}^{2}}\]
The formulas get messier as we add another input variable to the model. Using matrix—vector notation, the estimator of all the regression coefficients takes the same form, regardless of the size of the model:
\[ \widehat{\boldsymbol{\beta}} = \left( \textbf{X}^{\prime}\textbf{X}\right)^{- 1}\textbf{X}^{\prime}\textbf{Y} \]
This chapter presents basic linear algebra concepts necessary to formulate and manipulate data science models and to perform basic operations on vectors of random variables. There are many great texts available that cover linear algebra for statisticians and data scientists in much greater depth—some of my favorites are Graybill (1976) and Graybill (1983).
A scalar is a single real number, a vector is an array of scalars arranged in a single column (a column vector) or a row (a row vector). A matrix is a two-dimensional array of scalars, a tensor is a multi-dimensional array.
The order of a vector or matrix is specified as (rows x columns) and is sometimes used as a subscript for clarity. For example,\(\textbf{A}_{(3 \times 5)}\) denotes a matrix with 3 rows and 5 columns. It can be viewed as a concatenation of five \((3 \times 1)\) column vectors:
\[\textbf{A}_{(3 \times 5)}=\begin{bmatrix} \begin{matrix} 1 \\ 1 \\ 1 \end{matrix} & \begin{matrix} 9.0 \\ 3.2 \\ 4.1 \end{matrix} & \begin{matrix} \begin{matrix} 6.2 \\ 1.4 \\ - 0.6 \end{matrix} & \begin{matrix} 1 \\ 0 \\ 0 \end{matrix} & \begin{matrix} 0 \\ 1 \\ 0 \end{matrix} \end{matrix} \end{bmatrix}\]
\(\textbf{a}_{1} = \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix}\ \ \ \ \ \textbf{a}_{2} = \begin{bmatrix} 9.0 \\ 3.2 \\ 4.1 \end{bmatrix}\ \ \ \ \textbf{a}_{3} = \begin{bmatrix} 6.2 \\ 1.4 \\ - 0.6 \end{bmatrix}\ \ \ \ \ \textbf{a}_{4} = \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix}\ \ \ \ \ \textbf{a}_{5} = \begin{bmatrix} 0 \\ 1 \\ 0 \end{bmatrix}\).
A matrix with as many rows as columns is called a square matrix.
Bold symbols are common, lowercase for vectors and uppercase for matrices, but there are some exceptions. When dealing with vectors of random variables, bold uppercase notation is used for a vector of random variables and bold lowercase notation is used for a vector of the realized values. For example, if \(Y_{1},\cdots,Y_{n}\) is a random sample of size \(n\), the vector of random variables is
\[\textbf{Y}_{(n \times 1)} = \begin{bmatrix} Y_{1} \\ \vdots \\ Y_{n} \end{bmatrix}\]
and the vector of realized values is
\[\textbf{y}_{(n \times 1)} = \begin{bmatrix} y_{1} \\ \vdots \\ y_{n} \end{bmatrix}\]
The difference is significant because \(\textbf{Y}\) is a random variable and \(\textbf{y}\) is a vector of constants. \(\textbf{Y}\) has a multi-variate distribution with mean and variance, \(\textbf{y}\) is just a vector of numbers.
We follow the convention that all vectors are column vectors, so that \(\textbf{y}_{(n)}\) serves as a shorthand for \(\textbf{y}_{(n \times 1)}\).
The typical element of a matrix is written as a scalar with subscripts that refer to rows and columns. For example, the statement
\[\textbf{A}= \left\lbrack a_{ij} \right\rbrack\]
says that matrix \(\textbf{A}\) consists of the scalars \(a_{ij}\); for example, \(a_{23}\) is the scalar in row 2, column 3 of the matrix.
47.2 Special Matrices
Special Values
A few special matrices, common in statistics and machine learning are
\(\textbf{1}_{n} = \begin{bmatrix} 1 \\ \vdots \\ 1 \end{bmatrix}\), the unit vector of size \(n\); all its elements are 1.
\(\textbf{0}_{n} = \begin{bmatrix} 0 \\ \vdots \\ 0 \end{bmatrix}\), the zero vector of size \(n\); all its elements are 0.
\(\textbf{0}_{(n \times k)} = \begin{bmatrix} 0 & 0 & 0 \\ \vdots & \cdots & \vdots \\ 0 & 0 & 0 \end{bmatrix}\), the zero matrix of order \((n \times k)\). All its elements are 0.
\(\textbf{J}_{(n \times k)} = \begin{bmatrix} 1 & 1 & 1 \\ \vdots & \cdots & \vdots \\ 1 & 1 & 1 \end{bmatrix}\), the unit matrix of size \((n \times k)\). All its elements are 1.
\(\textbf{I}_{n} = \begin{bmatrix} 1 & 0 & 0 \\ 0 & \ddots & 0 \\ 0 & 0 & 1 \end{bmatrix}\), the identity matrix of size \((n \times n)\) with 1s on the diagonal and 0s elsewhere.
If the order of these matrices is obvious from the context, the subscripts can be omitted.
Special Patterns
Patterned matrices play an important role in statistical operations. Recognizing a particular structure of a matrix we can reduce the amount of work required to find inverses, eigenvalues, determinants, etc. For example, if \(\textbf{D}\) is a diagonal matrix with \(d_{ii} \ne 0\), then an inverse \(\textbf{D}^{-1}\) can be found by simply replacing \(d_{ii}\) with \(1/d_{ii}\). Finding the inverse of a matrix in general is a much more elaborate task.
Performing operations on patterned matrices can lead to patterns in the result of the operations. An example is the multiplication of partitioned matrices.
A matrix with the same number of rows and columns is a square matrix.
Square matrices for which the typical element \(a_{ij} = a_{ji}\) are symmetric matrices. For example, the matrix \[ \begin{bmatrix} 54 & 37 & 5 \\ 37 & 27 & 4 \\ 5 & 4 & 21 \end{bmatrix} \] is a symmetric matrix, The values in the off-diagonal cells are mirrored across the diagonal. Symmetric matrices play an important role in statistics; cross-product, variance-covariance, “hat”, and correlation matrices are examples.
A diagonal matrix is a symmetric matrix where all off-diagonal entries are zero. The identity matrix \(\textbf{I}_{n}\) is diagonal, as is the following \[ \begin{bmatrix} 54 & 0 & 0 \\ 0 & 27 & 0 \\ 0 & 0 & 21 \end{bmatrix} \]
A triangular matrix is a square matrix with zeros above (lower triangular) or below (upper triangular) the diagonal. For example, \[ \textbf{A}_u = \begin{bmatrix} a_{11} & a_{12} & a_{13} \\ 0 & a_{22} & a_{23} \\ 0 & 0 & a_{33} \end{bmatrix} \] is upper-triangular and \[ \textbf{A}_l = \begin{bmatrix} a_{11} & 0 & 0 \\ a_{21} & a_{22} & 0 \\ a_{31} & a_{32} & a_{33} \end{bmatrix} \] is lower-triangular. Lower and upper triangular matrices play an important role in solving linear systems by LU decomposition. The procedure derives its name from factorizing a matrix into a lower (L) and upper (U) triangular form.
A matrix is said to be partitioned, if it can be arranged in separate blocks of sub-matrices. Most important in statistics and data science are partitions of square matrices: \[ \textbf{B}_{(n \times n)} = \begin{bmatrix} \textbf{B}_{11} & \textbf{B}_{12} \\ \textbf{B}_{21} & \textbf{B}_{22} \end{bmatrix} \]
The sizes of the \(B_{ij}\) sub-matrices are \((n_i \times n_j)\). An important case of partitioning occurs in the variance-covariance matrix of two random vectors (more on this below). Briefly, if \(\textbf{Y}= [\textbf{Y}_1, \textbf{Y}_2]^\prime\), then the variance matrix of \(\textbf{Y}\) is partitioned as follows: \[ \text{Var}[\textbf{Y}] = \begin{bmatrix} \text{Var}[\textbf{Y}_1] & \text{Cov}[\textbf{Y}_1,\textbf{Y}_2] \\ \text{Cov}[\textbf{Y}_2,\textbf{Y}_1] & \text{Var}[\textbf{Y}_2] \end{bmatrix} \]
A partitioned matrix is block-diagonal, if the off-diagonal sub-matrices are zero matrices. This occurs with the variance matrix of \(\textbf{Y}= [\textbf{Y}_1, \textbf{Y}_2]^\prime\), if \(\textbf{Y}_1\) and \(\textbf{Y}_2\) are uncorrelated: \[ \text{Var}[\textbf{Y}] = \begin{bmatrix} \text{Var}[\textbf{Y}_1] & \textbf{0}\\ \textbf{0}& \text{Var}[\textbf{Y}_2] \end{bmatrix} \]
47.3 Basic Operations on Matrices and Vectors
The basic operations on matrices and vectors are addition, subtraction, multiplication, transposition, and inversion. These are standard operations in manipulating matrix and vector equations. Decompositions such as Cholesky roots, LU decomposition, eigenvalue and singular value decomposition are more advanced operations that are important in solving statistical estimation problems.
Transpose
The transpose of a matrix is obtained by exchanging rows and columns. If \(a_{ij}\) is the element in row \(i\), column \(j\) of matrix \(\textbf{A}\), the transpose of \(\textbf{A}\), denoted \(\textbf{A}^\prime\), has typical element \(a_{ji}\). In case of the \((3\ \times 5)\) matrix shown previously, its transpose is
\[\textbf{A}^\prime_{(5 \times 3)} = \begin{bmatrix} \begin{matrix} 1 \\ 9.0 \\ \begin{matrix} 6.2 \\ 1 \\ 0 \end{matrix} \end{matrix} & \begin{matrix} 1 \\ 3.2 \\ \begin{matrix} 1.4 \\ 0 \\ 1 \end{matrix} \end{matrix} & \begin{matrix} 1 \\ 4.1 \\ \begin{matrix} - 0.6 \\ 0 \\ 0 \end{matrix} \end{matrix} \end{bmatrix}\]
The transpose of a column vector is a row vector:
\[\textbf{a}^{\prime} = \begin{bmatrix} a_{1} \\ \vdots \\ a_{n} \end{bmatrix}^\prime = \left\lbrack a_{1},\cdots,a_{n} \right\rbrack\]
Transposing a transpose produces the original matrix, \(\left( \textbf{A}^{\prime} \right)^{\prime}\ = \textbf{A}\).
A matrix is symmetric if it is equal to its transpose, \(\textbf{A}^\prime = \textbf{A}\). Symmetric matrices are square matrices (same numbers of rows and columns). The matrices \(\textbf{A}^\prime\textbf{A}\) and \(\textbf{A}\textbf{A}^\prime\) are always symmetric. A symmetric matrix whose off-diagonal elements are zero is called a diagonal matrix.
Here are some properties of transpose operations:
\((a\textbf{A})^\prime = (\textbf{A}a)^\prime = \textbf{A}^\prime a = a\textbf{A}^\prime\)
\(\left(a\textbf{A}+ b\textbf{B}\right)^\prime = a\textbf{A}^\prime + b\textbf{B}^\prime\)
\(\left(\textbf{A}^\prime \right)^\prime = \textbf{A}\)
\((\textbf{A}\textbf{B})^\prime = \textbf{B}^\prime \textbf{A}^\prime\)
If \(\textbf{D}\) is diagonal, then \(\textbf{D}^prime = \textbf{D}\).
Addition and Subtraction
The sum (difference) of two matrices is the matrix of the elementwise sums (differences) of their elements. These operations require that the matrices being summed or subtracted have the same order:
\[\textbf{A}_{(n \times k)} + \textbf{B}_{(n \times k)} = \left\lbrack a_{ij} + b_{ij} \right\rbrack\]
\[\textbf{A}_{(n \times k)} - \textbf{B}_{(n \times k)} = \left\lbrack a_{ij} - b_{ij} \right\rbrack\]
Suppose, for example, that \(\textbf{A}= \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix}\) and \(\textbf{B}=\begin{bmatrix} - 1 & - 2 & - 3 \\ - 4 & - 5 & - 6 \end{bmatrix}\).
Then,
\[\textbf{A}+ \textbf{B}= \begin{bmatrix} 0 & 0 & 0 \\ 0 & 0 & 0 \end{bmatrix}\]
\[\textbf{A}- \textbf{B}= \begin{bmatrix} 2 & 4 & 6 \\ 8 & 10 & 12 \end{bmatrix}\]
Since addition (subtraction) are elementwise operations, they can be combined with transposition:
\[\left( \textbf{A}+ \textbf{B}\right)^\prime = \textbf{A}^{\prime} + \textbf{B}^{\prime}\]
Multiplication
Two matrices conform for addition (subtraction) if they have the same order, that is, the same number of rows and columns. Multiplication of matrices requires a different type of conformity; two matrices \(\textbf{A}\) and \(\textbf{B}\) can be multiplied as \(\text{AB}\) (or \(\textbf{A}\text{×}\textbf{B}\)), if the number of columns in \(\textbf{A}\) equals the number of columns in \(\textbf{B}\). We say that in the product \(\textbf{A}\text{×}\textbf{B}\), \(\textbf{A}\) is post-multiplied by \(\textbf{B}\) or that \(\textbf{A}\) is multiplied into \(\textbf{B}\). The result of multiplying a \((n \times k)\) matrix into a \((k \times p)\) matrix is a \((n \times p)\) matrix.
Before examining the typical elements in the result of multiplication, let’s look at a special case, the inner product of two \((k \times 1)\) vectors \(\textbf{a}\) and \(\textbf{b}\), also called the dot product or the scalar product, is the result of multiplying the transpose of \(\textbf{a}\) into \(\textbf{b}\), a scalar value
\[ \textbf{a}^\prime\textbf{b}= \left\lbrack a_{1}, \cdots,a_{k} \right\rbrack \begin{bmatrix} b_{1} \\ \vdots \\ b_{k} \end{bmatrix} = a_{1}b_{1} + \cdots a_{k}b_{k} = \sum_{i = 1}^{k}{a_{i}b_{i}} \]
The square root of the dot product of a vector with itself is the Euclidean \({(L}_{2})\) norm of the vector,
\[ \left| \left| \textbf{a}\right| \right| = \sqrt{\textbf{a}^\prime\textbf{a}} = \sqrt{\sum_{i = 1}^{k}a_{i}^{2}} \]
The \(L_{2}\) norm plays an important role as a loss function in statistical models. The vector for which the norm is calculated is then often a vector of model errors.
Now let’s return to the problem of multiplying the \((n \times k)\) matrix \(\textbf{A}\) into the \((k \times p)\) matrix \(\textbf{B}\) and introduce one more piece of notation: the \(i\)th row of \(\textbf{A}\) is denoted \(\mathbf{\alpha}_{i}\) and the \(j\)th column of \(\textbf{B}\) is denoted \(\textbf{B}_{j}\). Now we can finally write the product \(\textbf{A}\text{×}\textbf{B}\) as a matrix whose typical element is the inner product of \(\mathbf{\alpha}_{i}\) and \(\textbf{B}_{j}\):
\[ \textbf{A}_{(n \times k)} \times \textbf{B}_{(k \times p)} = \left\lbrack \boldsymbol{\alpha}_{i}^\prime\ \textbf{b}_{j} \right\rbrack_{(n \times p)} \]
As an example, let \(\textbf{A}= \begin{bmatrix} 1 & 2 & 0 \\ 3 & 1 & - 3 \\ 4 & 1 & 2 \end{bmatrix}\) and \(\textbf{B}= \begin{bmatrix} 1 & 0 \\ 2 & 3 \\ 2 & 1 \end{bmatrix}\). The product \(\textbf{A}\times\textbf{B}\) is a \((3 \times 2)\) matrix with elements
\[ \textbf{A}\times\textbf{B}= \begin{bmatrix} 1 \times 1 + 2 \times 2 + 0 \times 2 & 1 \times 0 + 2 \times 3 + 0 \times 1 \\ 3 \times 1 + 1 \times 2 - 3 \times 2 & 3 \times 0 + 1 \times 3 - 3 \times 1 \\ 4 \times 1 + 1 \times 2 + 2 \times 2 & 4 \times 0 + 1 \times 3 + 2 \times 1 \end{bmatrix} = \begin{bmatrix} 5 & 6 \\ - 1 & 0 \\ 10 & 5 \end{bmatrix} \]
Here are a few helpful rules for matrix multiplication:
\(c\textbf{A}= \left\lbrack ca_{ij} \right\rbrack\)
\(c\left( \textbf{A}+ \textbf{B}\right) = c\textbf{A}+ c\textbf{B}\)
\(\textbf{C}\left( \textbf{A}+ \textbf{B}\right) = \textbf{C}\textbf{A}+ \textbf{C}\textbf{B}\)
\(\left( \textbf{A}\textbf{B}\right)\textbf{C}= \textbf{A}(\textbf{B}\textbf{C})\)
\(\left( \textbf{A}+ \textbf{B}\right)\left( \textbf{C}+ \mathbf{D} \right) = \textbf{A}\textbf{C}+ \textbf{A}\mathbf{D} + \mathbf{BC} + \mathbf{BD}\)
\(\left( \textbf{A}\textbf{B}\right)^\prime = \textbf{B}^\prime\textbf{A}^\prime\)
\(\left( c\textbf{A}\right)^\prime = c\textbf{A}^\prime\)
Inversion and Rank
In scalar algebra, division and multiplication are inverse operations, dividing a non-zero scalar by itself yields the multiplicative identity: \(\frac{a}{a} = 1\). What is the equivalent of this operation for matrices? First, inversion of a matrix does not reduce it to a scalar, the multiplicative identity for matrices is the identity matrix \(\textbf{I}\), a diagonal matrix with 1s on the diagonal. Second, the inversion is only defined for square matrices. If \(\textbf{A}\) is an \((n \times n)\) matrix, the matrix \(\textbf{B}\) for which
\[ \textbf{A}\textbf{B}= \textbf{I} \]
is called the inverse of \(\textbf{A}\), denoted as \(\textbf{A}^{- 1}\).
Full rank matrices
Inverse matrices do not have to exist, even for square matrices. If \(\textbf{A}\) has an inverse matrix, then \(\textbf{A}\) is called a nonsingular matrix. In that case, \(\textbf{A}^{- 1}\textbf{A}= \textbf{A}\textbf{A}^{- 1} = \text{I}\).
For the inverse of a square matrix to exist, for the matrix to be non-singular, the matrix must be of full rank. The rank of a matrix, denoted \(r(\textbf{A})\), is the number of its linearly independent columns. What does that mean? Suppose we are dealing with a \((n \times k)\) matrix \(\textbf{B}\) and its column vectors are \(\textbf{B}_{1},\cdots,\textbf{B}_{k}\). A linear combination of the columns of \(\textbf{B}\) is
\[ c_{1}\textbf{b}_{1} + c_{2}\textbf{b}_{2} + \cdots + c_{k}\textbf{b}_{k} = q \]
If you can find a set of scalars \(c_{1},\cdots,c_{k}\) such that \(q = 0\), then the columns of \(\textbf{B}\) are linearly dependent. If the only set of scalars that yields \(q = 0\) is
\[c_{1} = c_{2} = \cdots = c_{k} = 0\]
then the columns of \(\textbf{B}\) are not linearly dependent and the rank of \(\textbf{B}\) is \(k\).
Here are a few more useful results about the rank of a matrix:
\(r\left( \textbf{A}\right) = r\left( \textbf{A}^\prime \right) = r\left( \textbf{A}^\prime\textbf{A}\right) = r\left( \textbf{A}\textbf{A}^{\prime} \right)\)
\(r\left( \textbf{A}\textbf{B}\right) \leq \min\left\{ r\left( \textbf{A}\right),r\left( \textbf{B}\right) \right\}\)
\(r\left( \textbf{A}+ \textbf{B}\right) \leq r\left( \textbf{A}\right) + r(\textbf{B})\)
The first two results are particularly important in statistical models. In models with linear structures, it is common to collect the \(p\) input variables in a linear model, including the intercept as a column of ones, into a matrix \(\textbf{X}_{(n\ \times p + 1)}\):
\[ \textbf{X}_{(n\ \times p + 1)} = \begin{bmatrix} 1 & x_{11} & \begin{matrix} \cdots & x_{1p} \end{matrix} \\ \vdots & \vdots & \begin{matrix} \ddots & \vdots \end{matrix} \\ 1 & x_{n1} & \begin{matrix} \cdots & x_{np} \end{matrix} \end{bmatrix} \]
Suppose we want to solve the linear system \(\textbf{Y}= \textbf{X}\textbf{c}\) for \(\textbf{c}\). Start by pre-multiplying both sides of the equation with the transpose of \(\textbf{X}\):
\[ \textbf{X}^{\prime}\textbf{Y}= \textbf{X}^{\prime}\textbf{X}\textbf{c} \]
If we had an inverse of \(\textbf{X}^\prime\textbf{X}\), then we can now pre-multiply both sides with that inverse and isolate \(\text{c}\):
\[ \left( \textbf{X}^\prime\textbf{X}\right)^{\mathbf{- 1}}\textbf{X}^{\prime}\textbf{Y}= \left( \textbf{X}^\prime\textbf{X}\right)^{-1}\textbf{X}^\prime\textbf{X}\textbf{c}= \textbf{I}\textbf{c}= \textbf{c} \]
We have a solution to the system, namely \({\textbf{c}=\left( \textbf{X}^\prime\textbf{X}\right)}^{- 1}\textbf{X}^{\prime}\textbf{Y}\), only if the inverse \(\left( \textbf{X}^\prime\textbf{X}\right)^{-1}\) exists. And that requires this \((p + 1) \times (p + 1)\) matrix is of full rank \(r\left( \left( \textbf{X}^\prime\textbf{X}\right)^{- 1} \right) = p + 1\). This, in turn is equivalent to saying that \(\textbf{X}\) has full rank \(p + 1\) because of property (i).
Here are some useful results about inverse matrices:
\(\left( \textbf{A}^{- 1} \right)^\prime = \left( \textbf{A}^\prime \right)^{- 1}\)
\(\left( \textbf{A}^{- 1} \right)^{- 1} = \textbf{A}\)
\(\left( \textbf{A}\textbf{B}\right)^{- 1} = \textbf{B}^{-1}\textbf{A}^{-1}\)
\(r\left( \textbf{A}^{- 1} \right) = r(\textbf{A})\)
Inverse of partitioned matrices
Suppose that \(\textbf{B}_{(n \times n)}\) is a partitioned matrix \[ \textbf{B}_{(n \times n)} = \begin{bmatrix} \textbf{B}_{11} & \textbf{B}_{12} \\ \textbf{B}_{21} & \textbf{B}_{22} \end{bmatrix} \]
where \(B_{ij}\) is of dimension \(n_i \times n_j\). If \(\textbf{A}= \textbf{B}^{-1}\), then \(\textbf{A}\) is also a partitioned matrix with \[ \begin{align*} \textbf{A}_{11} &= \left[\textbf{B}_{11} - \textbf{B}_{12}\textbf{B}_{22}^{-1}\textbf{B}_{21} \right]^{-1} \\ \textbf{A}_{12} &= -\textbf{B}_{11}^{-1}\textbf{B}_{12}\left[\textbf{B}_{22} - \textbf{B}_{21}\textbf{B}_{11}^{-1}\textbf{B}_{12}\right]^{-1} = \textbf{B}_{11}^{-1}\textbf{B}_{12}\textbf{A}_{22} \\ \textbf{A}_{22} &= \left[\textbf{B}_{22} - \textbf{B}_{21}\textbf{B}_{11}^{-1}\textbf{B}_{12}\right] \\ \textbf{A}_{21} &= -\textbf{B}_{22}^{-1}\textbf{B}_{21}\left[\textbf{B}_{11} - \textbf{B}_{12}\textbf{B}_{22}^{-1}\textbf{B}_{21}\right]^{-1} -\textbf{B}_{22}^{-1}\textbf{B}_{21}\textbf{A}_{11} \end{align*} \]
Rank-deficient matrices
The linear system \(\textbf{Y}= \textbf{X}\textbf{c}\) has a unique solution if \(\textbf{X}\) is an \((n\times n)\) nonsingular matrix. If \(\textbf{X}\) is not square or if \(\textbf{X}\) is singular, then we can still find a solution to the linear system, but it is not unique. If the matrix \(\textbf{X}\) is of less than full rank (singular), it is called a rank-deficient matrix. We can then make progress by using a generalized inverse matrix instead of the (non-existing) inverse.
Also called pseudo inverses or g-inverses, the generalized inverse of matrix \(\textbf{X}\) is denoted \(\textbf{X}^{-}\) and satisfies
\[ \textbf{X}\textbf{X}^{-}\textbf{X}= \textbf{X} \] Suppose we can find such a generalized inverse \(\left( \textbf{X}^{\prime}\textbf{X}\right)^{-}\) of \(\textbf{X}^\prime\textbf{X}\). What if we use that in the solution of the linear system,
\[ \textbf{c}= {\left( \textbf{X}^{\prime}\textbf{X}\right)^{-}\textbf{X}}^{\prime}\textbf{Y} \]
Unfortunately, whereas regular inverses are unique, there are (infinitely) many generalized inverses that satisfy the condition \((\textbf{X}^\prime\textbf{X})\left( \textbf{X}^\prime\textbf{X}\right)^{-}\textbf{X}^\prime\textbf{X}= \textbf{X}^\prime\textbf{X}\). So, there will be infinitely many possible solutions to the linear system. Fortunately, it turns out that generalized inverses have some nice properties, for example, \(\textbf{X}\left( \textbf{X}^\prime\textbf{X}\right)^{-}\textbf{X}\) is invariant to the choice of the generalized inverse. Even if the solution \(\textbf{c}\) is not unique, \(\textbf{X}\textbf{c}\) is unique. This result is important in linear models with rank-deficient design matrices, a condition that is common when the model contains classification variables. While the parameter estimates in such a model are not unique—because we need to use a generalized inverse to derive the estimates—the predicted values are the same, no matter which generalized inverse we selected.
Here are some properties of generalized inverses, these resemble the properties of the inverse:
If \(c\) is nonzero, then \((c\textbf{A})^- = (1/c)\textbf{A}^-\)
\(\textbf{A}\textbf{A}^bA = \textbf{A}\text{ and } \textbf{A}^-\textbf{A}\textbf{A}^- = \textbf{A}^-\)
\(\left(\textbf{A}^\prime\right)^- = \left(\textbf{A}^-\right)^\prime\)
\(\left(\textbf{A}^- \right)^- = \textbf{A}\)
\(r(\textbf{A}^-) = r(\textbf{A}) = r(\textbf{A}\textbf{A}^-) = r(\textbf{A}\textbf{A}^-\textbf{A})\)
If \(\textbf{A}\) is symmetric, so is \(\textbf{A}^-\)
A technique for constructing generalized inverses relies on the singular value decomposition and is covered below. For a diagonal matrix \(\textbf{D}\), constructing a generalized inverse is simple: replace the non-zero diagonal values \(d_{ii}\) with \(1\d_{ii}\), and leave the zero diagonal values in place.
Determinant
The rank reduces a matrix to a single scalar value, the number of linearly independent columns of the matrix. Another value that reduces a square matrix to a single scalar is the determinant, written as \(det(\textbf{A})\) or \(|\textbf{A}|\).
If a matrix \(\textbf{A}\) has an inverse, then it is square and its determinant is nonzero.
The determinant has a geometric interpretation which is not that relevant for our discussion. What matters is that the determinant appears frequently in expressions of multivariate probability distributions and knowing how to manipulate the determinants.
\(|\textbf{A}| = |\textbf{A}^\prime|\)
\(|\textbf{I}| = 1\)
\(\left| c\textbf{A}\right| = c^{n}\mathbf{|A}\mathbf{|}\)
If \(\textbf{A}\) is singular, then \(\left| \textbf{A}\right| = 0\)
If each element of a row (column) of \(\textbf{A}\) is zero, then \(\left| \textbf{A}\right| = 0\)
If two rows (column) of \(\textbf{A}\) are identical, then \(\left| \textbf{A}\right| = 0\)
\(\left| \textbf{A}\textbf{B}\right| = \left| \textbf{A}\right|\ \left| \textbf{B}\right|\)
\(\left| \textbf{A}^{- 1} \right| = 1/|\textbf{A}|\)
If \(\textbf{A}\) is a triangular matrix, then \(|\textbf{A}| = \prod_{i = 1}^{n}a_{ii}\)
If \(\textbf{A}\) is a diagonal matrix, then \(|\textbf{A}| = \prod_{i = 1}^{n}a_{ii}\)
Trace
The trace operator, \(\text{tr}(\textbf{A})\), applies only to square matrices. The trace of an \(\textbf{A}_{(n \times n)}\) matrix is the sum of its diagonal elements:
\[ \text{tr}\left( \textbf{A}\right) = \sum_{i = 1}^{n}a_{ii} \]
The trace plays an important role in statistics in determining expected values of quadratic forms of random variables, for example, sums of squares in linear models. An important property of the trace is its invariance under cyclic permutations,
\[ \text{tr}\left( \mathbf{ABC} \right) = \text{tr}\left( \mathbf{BCA} \right) = \text{tr}(\mathbf{CAB}) \]
provided the matrices conform to multiplication. As a special case, suppose that \(\textbf{B}\) is an \((n \times n)\) nonsingular matrix, then \[ \text{tr}\left(\textbf{B}^{-1}\textbf{A}\textbf{B}\right) = \text{tr}(\textbf{A}) \]
Some other useful properties of the trace are
\(\text{tr}\left( \textbf{A}+ \textbf{B}\right) = \text{tr}\left( \textbf{A}\right) + \text{tr}\left( \textbf{B}\right)\)
\(\text{tr}(a\textbf{A}+ b\textbf{B}) = a\text{tr}(\textbf{A}) + b\text{tr}(\textbf{B})\)
\(\text{tr}\left( \textbf{A}\right) = \text{tr}\left( \textbf{A}^\prime \right)\)
\(\textbf{Y}^\prime\textbf{Ay} = \text{tr}\left( \textbf{Y}^\prime\textbf{Ay} \right)\)
If \(\textbf{A}\) is \((n \times m)\), then \(\text{tr}(\textbf{A}\textbf{A}^\prime) = \text{tr}(\textbf{A}^\prime\textbf{A}) = \sum_{i=1}^n\sum_{j=1}^m a_{ij}^2\)
\(tr\left( \textbf{A}\right) = r(\textbf{A})\) if \(\textbf{A}\) is symmetric and idempotent (\(\textbf{A}\textbf{A}= \textbf{A}\) and \(\textbf{A}= \textbf{A}^\prime\))
If \(\lambda_1, \cdots, \lambda_n\) are the eigenvalues of \(\textbf{A}\), then
- \(\text{tr}(\textbf{A}) = \sum_{i=1}^n \lambda_i\)
- \(\text{tr}(\textbf{A}^k) = \sum_{i=1}^n \lambda_i^k\)
- \(\text{tr}(\textbf{A}^{-1}) = \sum_{i=1}^n \lambda_i^{-1}\)
An important result about the trace, which plays a role in working with projection matrices in statistical models, is the following. If \(\textbf{A}\) is an \((m \times n)\) matrix of rank \(r\), then the matrix \(\textbf{B}= (\textbf{I}- \textbf{A}(\textbf{A}^\prime\textbf{A})^{-}\textbf{A}^\prime)\) has trace \(\text{tr}(\textbf{B}) = m - r\). The “hat” matrix, discussed in Section 47.6.1, presents a special case.
47.4 Random Vectors
If the elements of a vector are random variables, the vector object itself is a random variable. You can think of random vectors as a convenient mechanism to collect random variables. Suppose we draw a random sample \(Y_{1},\cdots,Y_{n}\), then we can collect the \(n\) random variables in a single random vector
\[ \textbf{Y}= \begin{bmatrix} Y_{1} \\ \vdots \\ Y_{n} \end{bmatrix} \]
Expected Value
Since each \(Y_{i}\) has a probability distribution, a mean (expected value) \(\text{E}\left\lbrack Y_{i} \right\rbrack\), a variance \(\text{Var}\left\lbrack Y_{i} \right\rbrack\), and so forth, the same applies to their collection. The expected value (mean) of a random vector is the vector of the expected values of its elements:
\[ \text{E}\left\lbrack \textbf{Y}\right\rbrack = \begin{bmatrix} \text{E}\left\lbrack Y_{1} \right\rbrack \\ \vdots \\ \text{E}\left\lbrack Y_{n} \right\rbrack \end{bmatrix} \]
Suppose that \(\textbf{A},\ \textbf{B},\ \textbf{c}\) are matrices and vectors of constants, respectively, and that \(\textbf{Y}\) and \(\mathbf{U}\) are random vectors. The following are useful expectation operations in this situations:
\(\text{E}\left\lbrack \textbf{A}\right\rbrack = \textbf{A}\)
\(\text{E}\left\lbrack \mathbf{AYB} + \mathbf{C} \right\rbrack = \textbf{A}\text{E}\left\lbrack \textbf{Y}\right\rbrack\textbf{B}+ \textbf{C}\)
\(\text{E}\left\lbrack \mathbf{AY} + \mathbf{c} \right\rbrack = \textbf{A}\text{E}\left\lbrack \textbf{Y}\right\rbrack + \textbf{c}\)
\(\text{E}\left\lbrack \mathbf{AY} + \mathbf{BU} \right\rbrack = \textbf{A}\text{E}\left\lbrack \textbf{Y}\right\rbrack + \textbf{B}\ \text{E}\lbrack\mathbf{U}\rbrack\)
Covariance Matrix
While the distribution of \(Y_{i}\) is univariate, \(\textbf{Y}\) has a multivariate (\(n\)-variate) distribution. The mean of the distribution is represented by a vector. The variance of the distribution is represented by a matrix, the variance-covariance matrix, a special case of a covariance matrix.
The covariance matrix between random vectors \(\textbf{Y}_{(k \times 1)}\) and \(\mathbf{U}_{(p \times 1)}\) is a \((k \times p)\) matrix whose typical elements are the covariances between the elements of \(\textbf{Y}\) and \(\mathbf{U}\):
\[ \text{Cov}\left\lbrack \textbf{Y},\mathbf{U} \right\rbrack = \left\lbrack \text{Cov}(Y_{i},U_{j}) \right\rbrack \]
The covariance matrix can be written in terms of expected values of \(\textbf{Y}\), \(\mathbf{U}\), and \(\textbf{Y}\mathbf{U}^\prime\)
\[ \text{Cov}\left\lbrack \textbf{Y},\mathbf{U} \right\rbrack = \text{E}\left\lbrack \left( \textbf{Y}- \text{E}\lbrack\textbf{Y}\rbrack \right)\left( \mathbf{U} - \text{E}\left\lbrack \mathbf{U} \right\rbrack \right)^\prime \right\rbrack = \text{E}\left\lbrack \textbf{Y}\mathbf{U}^\prime \right\rbrack - \text{E}\left\lbrack \textbf{Y}\right\rbrack \text{E}\left\lbrack \mathbf{U} \right\rbrack^{\prime} \]
Some useful rules to manipulate covariance matrices are:
\(\text{Cov}\left\lbrack \mathbf{AY},\mathbf{U} \right\rbrack = \textbf{A}\text{Cov}\lbrack\textbf{Y},\mathbf{U}\rbrack\)
\(\text{Cov}\left\lbrack \textbf{Y},\mathbf{BU} \right\rbrack = \text{Cov}\left\lbrack \textbf{Y},\mathbf{U} \right\rbrack\textbf{B}^\prime\)
\(\text{Cov}\left\lbrack \mathbf{AY},\mathbf{BU} \right\rbrack = \textbf{A}\text{Cov}\left\lbrack \textbf{Y},\mathbf{U} \right\rbrack\ \textbf{B}^\prime\)
\(\text{Cov}\left\lbrack a\textbf{Y}+ b\mathbf{U},c\mathbf{W} + d\textbf{V}\right\rbrack = ac\text{Cov}\left\lbrack \textbf{Y},\mathbf{W} \right\rbrack + bc\text{Cov}\left\lbrack \mathbf{U},\mathbf{W} \right\rbrack + ad\text{Cov}\left\lbrack \textbf{Y},\textbf{V}\right\rbrack + bd\text{Cov}\lbrack\mathbf{U},\textbf{V}\rbrack\)
Variance-covariance Matrix
The variance-covariance matrix (or variance matrix for short) of a random vector \(\textbf{Y}\) is the covariance matrix of \(\textbf{Y}\) with itself.
\[ \text{Var}\left\lbrack \textbf{Y}\right\rbrack = \text{Cov}\left\lbrack \textbf{Y},\textbf{Y}\right\rbrack = \text{E}\left\lbrack \left( \textbf{Y}- \text{E}\lbrack\textbf{Y}\rbrack \right)\left( \textbf{Y}-\text{E}\left\lbrack \textbf{Y}\right\rbrack \right)^\prime \right\rbrack = \text{E}\left\lbrack \textbf{Y}\textbf{Y}^\prime \right\rbrack - \text{E}\left\lbrack \textbf{Y}\right\rbrack \text{E}\left\lbrack \textbf{Y}\right\rbrack^{\prime} \]
The diagonal entries of the variance-covariance matrix contain the variances of the \(Y_{i}\). The off-diagonal cells contain the covariances \(\text{Cov}\left\lbrack Y_{i},Y_{j} \right\rbrack\). If the variance matrix is diagonal, the elements of random vector \(\textbf{Y}\) are uncorrelated. Two random vectors \(\textbf{Y}\) and \(\mathbf{U}\) are uncorrelated if their variance matrix is block-diagonal:
\[ \text{Var}\begin{bmatrix} \textbf{Y}_{1} \\ \textbf{Y}_{2} \end{bmatrix} = \begin{bmatrix} \text{Var}\lbrack\textbf{Y}_{2}\rbrack & \textbf{0}\\ \textbf{0}& \text{Var}\lbrack\textbf{Y}_{1}\rbrack \end{bmatrix} \]
A very special variance-covariance matrix in statistical models is the scaled identity matrix, \(\sigma^{2}\textbf{I}\). This is the variance matrix of uncorrelated observations drawn from the same distribution—a common assumption for the error terms in models.
The rules for working with covariances extend to working with variance matrices:
\(\text{Var}\left\lbrack \mathbf{AY} \right\rbrack = \textbf{A}\ \text{Var}\left\lbrack \textbf{Y}\right\rbrack\textbf{A}^\prime\)
\(\text{Var}\left\lbrack \textbf{Y}+ \textbf{A}\right\rbrack = \text{Var}\lbrack\textbf{Y}\rbrack\)
\(\text{Var}\left\lbrack \textbf{A}^\prime\textbf{Y}\right\rbrack = \textbf{A}^\prime\text{Var}\left\lbrack \textbf{Y}\right\rbrack\textbf{A}\)
\(\text{Var}\left\lbrack a\textbf{Y}\right\rbrack = a^{2}\text{Var}\lbrack\textbf{Y}\rbrack\)
\(\text{Var}\left\lbrack a\textbf{Y}+ b\mathbf{U} \right\rbrack = a^{2}\text{Var}\left\lbrack \textbf{Y}\right\rbrack + b^{2}\text{Var}\left\lbrack \mathbf{U} \right\rbrack + 2ab\ \text{Cov}\lbrack\textbf{Y},\mathbf{U}\rbrack\)
Finally, an important result about expected values of quadratic forms, heavily used to in decomposing variability is
\[ \text{E}\left\lbrack \textbf{Y}^\prime\mathbf{AY} \right\rbrack = tr\left( \textbf{A}\text{Var}\left\lbrack \textbf{Y}\right\rbrack \right) + \text{E}\left\lbrack \textbf{Y}\right\rbrack^\prime\textbf{A}\ \text{E}\lbrack\textbf{Y}\rbrack \]
Correlation Matrix
Suppose \(\textbf{Y}_n\) is a random vector with a positive definite (full rank) variance matrix \(\textbf{V}= \text{Var}[\textbf{Y}] = [v_{ij}]\). The correlation matrix is the \((n \times n)\) matrix with typical element \[ \textbf{R}= \left[\text{Corr}[Y_i,Y_j]\right] = \frac{v_{ij}}{\sqrt{v_{ii} \ v_{jj}}} \]
You can express \(\textbf{R}\) as a product of matrices, \[ \textbf{R}= \textbf{D}^{-1/2} \textbf{V}\textbf{D}^{-1/2} \] where \(\textbf{D}\) is a \((n \times n)\) diagonal matrix that contains the variances \(v_{ii}\) on its diagonal.
47.5 Cross-product Matrix
Statistical models are often expressed in terms of a vector of length \(p\) derived from the input variables of the model: \(\textbf{x}= [x_1, x_2, \cdots, x_p]^\prime\). With continuous input variables, \(x_j\) refers to the value of the variable itself or a transformation such as the logarithm, square root, etc. The values of \(x_j\) can also be constructed from transformations and functions involving other variables, for example \(x_3 = x_1 * x_2\).
In the case of a qualitative factor with \(k\) levels, the factor can occupy \(k-1\) binary variables in the model vector. Often, a 1 is added at the beginning of the vector to accommodate an intercept term: \(\textbf{x}= [1, x_1, x_2, \cdots, x_p]^\prime\).
The vector of inputs for the \(i\)th observation is then \(\textbf{x}_i = [1, x_{i1}, x_{i2}, \cdots, x_{ip}]^\prime\).
When these input rows are stacked, we obtain the \((n \times p+1)\) model matrix (sometimes called the design matrix) \[ \textbf{X}_{(n\times p+1)} = \begin{bmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1p} \\ 1 & x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \cdots & x_{np} \end{bmatrix} \tag{47.1}\]
A cross-product matrix is the product of the transpose of a matrix with itself: \[ \textbf{X}^\prime\textbf{X} \] A cross-product matrix is square and symmetric. The cross-product of the model matrix plays an important role in estimating model parameters and in determining their properties. For the model matrix in Equation 47.1, the entries of the cross-product matrix take the following form \[ \textbf{X}^\prime\textbf{X}= \begin{bmatrix} n & \sum x_{i1} & \sum x_{i2} & \cdots & \sum x_{ip} \\ \sum x_{i1} & \sum x_{i1}^2 & \sum x_{i1}x_{i2} & \cdots & \sum x_{i1}x_{ip} \\ \sum x_{i2} & \sum x_{i2}x_{i1} & \sum x_{i2}^2 & \cdots & \sum x_{i2}x_{ip} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ \sum x_{ip} & \cdots & \cdots & \cdots & \sum x_{ip}^2 \end{bmatrix} \]
The sum of squares of the \(\textbf{X}\) columns are on the diagonal, the sum of the cross-products are in the off-diagonal cells.
47.6 Projection Matrices
A projection matrix (projector) is a symmetric idempotent matrix. So first, what is an idempotent matrix?
A matrix \(\textbf{P}\) is idempotent if \(\textbf{P}^2 = \textbf{P}\). And If \(\textbf{P}\) is idempotent, so is \(\textbf{I}- \textbf{P}\). This is easy to prove: \[ \begin{align*} (\textbf{I}- \textbf{P})(\textbf{I}-\textbf{P}) &= \textbf{I}\textbf{I}-\textbf{I}\textbf{P}- \textbf{P}\textbf{I}+ \textbf{P}\textbf{P}\\ &= \textbf{I}- \textbf{P}- \textbf{P}+ \textbf{P}\textbf{P}\\ &= \textbf{I}- \textbf{P}- \textbf{P}+ \textbf{P}= \textbf{I}- \textbf{P} \end{align*} \]
A symmetric idempotent matrix is called a projector, because it maps a vector from a higher-dimensional space to their corresponding points in a lower-dimensional space. Think of shining a light on an object and catching its shadow on the wall. It is the projection of a 3-D object onto a 2-dimensional space. (An idempotent matrix that is not symmetric is called an oblique projector.)
Suppose we want to find a solution for \(\boldsymbol{\beta}\) in the linear model \(\textbf{Y}_{(n \times 1)} = \textbf{X}\boldsymbol{\beta}_{(p + 1 \times 1)} + \mathbf{\epsilon}\). The vector \(\textbf{Y}\) is a vector in \(n\)-dimensional space \(\mathbb{R}^{n}\) and the model places a restriction on the predicted values \(\widehat{\textbf{Y}} = \textbf{X}\widehat{\boldsymbol{\beta}}\): the predicted values are confined to a \((p + 1)\)-dimensional subspace of \(\mathbb{R}^{n}\). Regardless of how we choose the estimator \(\widehat{\boldsymbol{\beta}}\), we are dealing with projecting vector \(\textbf{Y}\) onto a subspace of \(\mathbb{R}^{n}\).
We can thus think of the problem of finding the best estimator in this model as the problem of finding the best projection onto the space generated by the columns of \(\textbf{X}\). In that case, why not choose the projection that minimizes the distance between the observed values \(\textbf{Y}\) and the predicted values \(\widehat{\textbf{Y}}\)? This is achieved by projecting \(\textbf{Y}\) perpendicular (orthogonal) onto the subspace generated by \(\textbf{X}\). In other words, our solution is the vector \(\widehat{\boldsymbol{\beta}}\) that satisfies
\[ \left( \textbf{Y}- \textbf{X}\widehat{\boldsymbol{\beta}} \right)^\prime\textbf{X}\widehat{\boldsymbol{\beta}} = 0 \]
Multiplying out and rearranging terms yields
\[ \widehat{\boldsymbol{\beta}}^{\prime}\textbf{X}^{\prime}\textbf{Y}= \widehat{\boldsymbol{\beta}}^{\prime}\textbf{X}^\prime\textbf{X}\widehat{\boldsymbol{\beta}} \]
which implies that
\[ \textbf{X}^\prime\textbf{X}\widehat{\boldsymbol{\beta}} = \textbf{X}^{\prime}\textbf{Y} \]
This is a linear system of the form \(\textbf{A}\textbf{z}= \textbf{b}\) with \(\textbf{A}=\textbf{X}^\prime\textbf{X}\), \(\textbf{z}= \widehat{\boldsymbol{\beta}}\), and \(\textbf{b}= \textbf{X}^\prime\textbf{Y}\).
If \(\textbf{X}\) is of full column rank, then \(\left( \textbf{X}^\prime\textbf{X}\right)^{- 1}\) exists and we can solve:
\[ \widehat{\boldsymbol{\beta}}=\left( \textbf{X}^\prime\textbf{X}\right)^{- 1}\textbf{X}^{\prime}\textbf{Y} \]
The “Hat” Matrix
The ordinary least squares estimator is the orthogonal projection of \(\textbf{Y}\) onto the subspace created by the columns of \(\textbf{X}\). To see the projection matrix at work, compute the predicted values, \(\textbf{X}\widehat{\boldsymbol{\beta}}\):
\[ \textbf{X}\widehat{\boldsymbol{\beta}}=\textbf{X}\left( \textbf{X}^{\prime}\textbf{X}\right)^{- 1}\textbf{X}^{\prime}\textbf{Y}= \widehat{\textbf{Y}} \]
The matrix \(\textbf{X}\left( \textbf{X}^\prime\textbf{X}\right)^{- 1}\textbf{X}^{\prime}\) has a very special role in regression analysis, it is often called the “hat” matrix and denoted \(\textbf{H}\), because pre-multiplying \(\textbf{Y}\) with \(\textbf{H}\) puts the “hats” (carets) on \(\textbf{Y}\):
\[\textbf{H}\textbf{Y}= \widehat{\textbf{Y}}\]
Let’s verify that \(\textbf{H}\) is indeed a projection matrix, which requires that \(\textbf{H}\) is symmetric and idempotent, \(\textbf{H}^\prime = \textbf{H}\) and \(\textbf{H}\textbf{H}= \textbf{H}\): \[ \begin{align*} \textbf{H}^\prime &= \left(\textbf{X}\left( \textbf{X}^\prime\textbf{X}\right)^{- 1}\textbf{X}^{\prime}\right)^\prime \\ &= \left((\textbf{X}^\prime)^\prime \left( \textbf{X}^\prime\textbf{X}\right)^{- 1} \textbf{X}^\prime\right) \\ &= \left(\textbf{X}\left( \textbf{X}^\prime\textbf{X}\right)^{- 1}\textbf{X}^{\prime}\right)^\prime = \textbf{H} \end{align*} \]
\[ \begin{align*} \textbf{H}\textbf{H}&= \textbf{X}\left( \textbf{X}^\prime\textbf{X}\right)^{- 1}\textbf{X}^{\prime} \times \textbf{X}\left( \textbf{X}^\prime\textbf{X}\right)^{-1}\textbf{X}^\prime \\ &= \textbf{X}\left( \textbf{X}^\prime\textbf{X}\right)^{- 1} \textbf{X}^\prime\textbf{X}\left( \textbf{X}^\prime\textbf{X}\right)^{-1}\textbf{X}^\prime \\ & =\textbf{X}\left( \textbf{X}^\prime\textbf{X}\right)^{- 1}\textbf{X}^{\prime} = \textbf{H} \end{align*} \]
The hat matrix in the regression model is indeed a projection matrix.
Projector Properties
Here are some useful results about projection matrices that come in handy when working out the properties of estimators in regression-type models:
Projection matrices are typically not of full rank. If an \((n \times n)\) idempotent matrix is of rank \(n\), then it is the identity matrix \(\textbf{I}\).
If \(\textbf{A}\) is (symmetric) idempotent, then \(\textbf{I}- \textbf{A}\) is (symmetric) idempotent.
If \(\textbf{P}\) is non-singular, then \(\mathbf{PA}\textbf{P}^{-1}\) is an idempotent matrix.
If \(\textbf{P}\) is an \((n \times n)\) projection matrix of rank \(r\), then it has \(r\) eigenvalues that have value 1 and \(n-r\) eigenvalues that are zero.
If \(\textbf{P}\) is a projection matrix, then \(r(\textbf{P}) = \text{tr}(\textbf{P})\).
If \(\textbf{P}_1\) and \(\textbf{P}_2\) are projection matrices and \(\textbf{P}_1 - \textbf{P}_2\) is p.s.d, then \(\textbf{P}_1 - \textbf{P}_2\) is also a projection matrix. (See the next section on the concept of definiteness)
You can use these properties to show that in the linear regression model with uncorrelated errors and equal variance the variance matrix of the model residuals \(\textbf{Y}- \widehat{\textbf{Y}}\) is
\[ \text{Var}\left\lbrack \textbf{Y}- \widehat{\textbf{Y}} \right\rbrack = \sigma^{2}(\textbf{I}- \textbf{H}) \]
A Special Case
Consider the special case where \(\textbf{X}= \textbf{1}_{n}\), a column vector of ones. The corresponding linear model is \(\textbf{Y}= \textbf{1}\beta + \boldsymbol{\epsilon}\), an intercept-only model. The hat matrix for this model is
\[ \textbf{1}\left( \textbf{1}^{\prime}\textbf{1}\right)^{- 1}\textbf{1}^\prime = \frac{1}{n}\textbf{1}\textbf{1}^\prime = \frac{1}{n}\textbf{J}= \begin{bmatrix} \frac{1}{n} & \cdots & \frac{1}{n} \\ \vdots & \ddots & \vdots \\ \frac{1}{n} & \cdots & \frac{1}{n} \end{bmatrix} \] a matrix filled with \(\frac{1}{n}\). We know from the previous discussion that this is a projection matrix. The projection of \(\textbf{Y}\) onto the space generated by \(\textbf{1}_{n}\) is
\[ \frac{1}{n}\mathbf{JY} = \begin{bmatrix} \overline{Y} \\ \vdots \\ \overline{Y} \end{bmatrix} \]
The predicted values are all the same, the sample mean. In other words, \(\beta = \overline{Y}\). Since the projector is idempotent, deriving the variance of the predicted value in the iid case is simple:
\[ \text{Var}\left\lbrack \textbf{H}\textbf{Y}\right\rbrack = \textbf{H}\text{Var}\left\lbrack \textbf{Y}\right\rbrack\textbf{H}^\prime = \sigma^{2}\textbf{H}\textbf{H}^{\prime} = \sigma^{2}\textbf{H}= \sigma^{2}\frac{1}{n}\textbf{J}=\begin{bmatrix} \frac{\sigma^{2}}{n} & \cdots & \frac{\sigma^{2}}{n} \\ \vdots & \ddots & \vdots \\ \frac{\sigma^{2}}{n} & \cdots & \frac{\sigma^{2}}{n} \end{bmatrix} \]
Orthogonal Projection
Suppose \(\Omega\) is a subspace of \(n\)-dimensional Euclidean space and \(\Omega^\complement\) is its complement. If \(\textbf{P}_\Omega \textbf{y}= \textbf{u}\), then \(\textbf{P}_\Omega\) is the unique projector onto \(\Omega\) and \(\Omega\) equals the range space of \(\textbf{P}\) (the space spanned by the columns of \(\textbf{P}\)).
Each vector \(\textbf{y}\) can be decomposed uniquely into two vectors \(\textbf{u}\) and \(\textbf{v}\), where \(\textbf{u}\) lies in the range space of \(\textbf{P}\) and \(\textbf{v}\) lies in its orthogonal complement: \[ \textbf{y}= \textbf{u}+ \textbf{v}= \textbf{P}_\Omega \textbf{y}+ \textbf{P}_{\Omega^\complement}\textbf{y} \]
The projector onto the complement of \(\Omega\) is \(\textbf{P}_{\Omega^\complement}=\textbf{I}_n - \textbf{P}_{\Omega}\) and it is easy to show that the two projectors are orthogonal: \[ \textbf{P}\,(\textbf{I}- \textbf{P}) = \textbf{P}- \textbf{P}\textbf{P}= \textbf{P}- \textbf{P}= \textbf{0} \]
A statistical application of these results is in the linear model \(\textbf{Y}= \textbf{X}\boldsymbol{\beta}+ \boldsymbol{\epsilon}\). The projector onto the column space of \(\textbf{X}\) is \(\textbf{P}_\Omega = \textbf{X}(\textbf{X}^\prime\textbf{X})^{-1}\textbf{X}\), the hat matrix \(\textbf{H}\). The orthogonal decomposition of \(\textbf{Y}\) is \[ \begin{align*} \textbf{Y}&= \widehat{\textbf{Y}} + (\textbf{Y}- \widehat{\textbf{Y}}) \\ &= \textbf{H}\textbf{Y}+ (\textbf{I}- \textbf{H})\textbf{Y} \end{align*} \]
Because the two projectors \(\textbf{H}\) and \(\textbf{I}- \textbf{H}\) are orthogonal, it is easy to show that the fitted values and the residuals in the linear model are uncorrelated, if \(\text{Var}[\textbf{Y}] = \sigma^2\textbf{I}\): \[ \begin{align*} \text{Cov}[\widehat{\textbf{Y}}, \textbf{Y}-\widehat{\textbf{Y}}] &= \text{Cov}[\textbf{H}\textbf{Y}, (\textbf{I}-\textbf{H})\textbf{Y}] \\ &= \textbf{H}\text{Cov}[\textbf{Y},\textbf{Y}](\textbf{I}-\textbf{H})^\prime \\ &= \sigma^2\textbf{I}\, \textbf{H}(\textbf{I}- \textbf{H}) = \sigma^2\textbf{I}(\textbf{H}-\textbf{H}\textbf{H}) \\ &= \sigma^2\textbf{I}\, \textbf{0} = \textbf{0} \end{align*} \]
47.7 Definite Matrices
A real symmetric \((n \times n)\) matrix \(\textbf{A}\) is said to be positive definite, if \[ \textbf{x}^\prime\textbf{A}\textbf{x}> 0 \] for all non-zero vectors \(\textbf{x}\) in \(\mathbb{R}^n\). \(\textbf{A}\) is said to be positive semidefinite (or non-negative definite) if \[ \textbf{x}^\prime\textbf{A}\textbf{x}\ge 0 \] When a matrix \(\textbf{A}\) is positive definite, it is nonsingular. Also, \(\textbf{A}^{-1}\) is then positive definite. Let’s use the abbreviations p.d. and p.s.d. for positive definite and positive semidefinite matrices. Then we can state the following:
If \(\textbf{A}\) is p.d. and \(c > 0\), then \(c\textbf{A}\) is p.d.
If \(\textbf{A}\) and \(\textbf{B}\) are p.d., then \(\textbf{A}+ \textbf{B}\) is p.d.
If \(\textbf{A}\) and \(\textbf{B}\) are p.s.d., then \(\textbf{A}+ \textbf{B}\) is p.s.d.
If \(\textbf{A}\) is p.d. and \(\textbf{B}\) is p.s.d., then \(\textbf{A}+ \textbf{B}\) is p.d.
If \(\textbf{A}\) and \(\textbf{B}\) are p.d. and conform to multiplication, then \(\textbf{A}\textbf{B}\textbf{A}\) and \(\textbf{B}\textbf{A}\textbf{B}\) are .p.d.
If \(\textbf{B}\) is p.s.d., then \(\text{tr}(\textbf{B}) \ge 0\).
If \(\textbf{A}\) is p.d., then \(r(\textbf{C}\textbf{A}\textbf{C}^\prime) = r(\textbf{C})\)
If \(\textbf{X}\) is \((n \times p)\) of rank \(p\), then \(\textbf{X}^\prime\textbf{X}\) is p.d.
The diagonal elements of a p.d. matrix are all positive
Projection matrices are positive semidefinite.
- If \(\textbf{P}_1\) and \(\textbf{P}_2\) are projection matrices and \(\textbf{P}_1 - \textbf{P}_2\) is p.s.d, then \(\textbf{P}_1 - \textbf{P}_2\) is also a projection matrix.
The variance-covariance matrix (Section 47.4.3) of a random vector is a positive semi-definite matrix. It is positive definite if none of the elements of the random vector can be expressed as a linear combination of the other elements.
Quadratic Forms
The term \(\textbf{x}^\prime\textbf{A}\textbf{x}\) is called a quadratic form, it is a scalar quantity. \(\textbf{A}\) is referred to as the matrix of the quadratic form. Quadratic forms where \(\textbf{x}\) is a vector of random variables play an important role in statistical analysis. The sum of squares decomposition in an analysis of variance can be expressed in terms of quadratic forms.
Let \(\textbf{Y}\) denote a \((n \times 1)\) random vector with mean vector \(\text{E}[Y] = \boldsymbol{\mu}\) and variance matrix \(\text{Var}[Y]=\boldsymbol{\Sigma}\). If \(\textbf{A}\) is an \((n \times n)\) symmetric matrix, then the random variable \(\textbf{Y}^\prime\textbf{A}\textbf{Y}\) has mean \[ \text{E}[\textbf{Y}^\prime\textbf{A}\textbf{Y}] = \text{tr}(\textbf{A}\boldsymbol{\Sigma}) + \boldsymbol{\mu}^\prime\textbf{A}\boldsymbol{\mu} \] Note that the mean of the quadratic form is not just \(\boldsymbol{\mu}^\prime\textbf{A}\boldsymbol{\mu}\), because the quadratic form involves cross-products of the \(Y_j\). The trace term carries that information.
In the special case where the elements of \(\textbf{Y}\) are uncorrelated and have equal variance, \(\text{Var}[\textbf{Y}] = \sigma^2\textbf{I}\) and the mean of the quadratic form simplifies to \[ \text{E}[\textbf{Y}^\prime\textbf{A}\textbf{Y}] = \sigma^2\text{tr}(\textbf{A}) + \boldsymbol{\mu}^\prime\textbf{A}\boldsymbol{\mu} \]
47.8 Matrix Factorization
The factorization (or decomposition) of a matrix \(\textbf{A}\) expresses the matrix as a product of two or more matrices, for example, \(\textbf{A}= \textbf{B}\textbf{C}\). Matrix multiplication defines the result of \(\textbf{B}\textbf{C}\), but how can we reverse this and express the result as a product of matrices?
The reasons for matrix factorization are to simplify computation, to make working with the matrix easier, and to reveal important properties of \(\textbf{A}\). Consequently, not any factorization will do, we are interested in very specific forms of writing \(\textbf{A}\) as a product. Two important matrix factorizations are the eigenvalue decomposition and the singular value decomposition.
The word eigen in eigenvalue has German roots. It means to own, the German noun Eigentum refers to possession (ownership). It also means an inherent property that is special to a person or entity. Eigen can refer to a peculiar property. To make the concept of an eigenvalue or an eigenvector understandable, the idea of an inherent property of oneself is appropriate. To see the connection we consider first linear transformations from \(\mathbb{R}^n\) tp \(\mathbb{R}^m\).
Linear Transformations
If \(\textbf{A}_{(m \times n)}\) is a matrix and \(x_n\) is a vector in \(\mathbb{R}^n\), then \[ \textbf{y}= \textbf{A}\textbf{x} \] is a vector in \(\mathbb{R}^m\). The matrix \(\textbf{A}\) linearly transformed the \(n\)-vector \(\textbf{x}\) into the \(m\)-vector \(\textbf{y}\). You can say that \(\textbf{A}\) “moved” \(\textbf{x}\) to %\(\textbf{y}\). The transform is called linear because the transformation \(\textbf{A}\) of a linear combination of two vectors—\(c_1\textbf{x}_1 + c_2\textbf{x}_2\)—is identical to the linear combination of the tranformed vectors: \[ \textbf{A}(c_1\textbf{x}_1 + c_2\textbf{x}_2) = c_1(\textbf{A}\textbf{x}_1) + c_2(\textbf{A}\textbf{x}_2) \]
Eigenvalue Decomposition
Suppose \(\textbf{A}\) is a square matrix, \((n \times n)\). Is there a vector \(\textbf{x}\) for which the transformation of \(\textbf{x}\) results in a multiple of itself? That is, can we find some vector \(\textbf{x}\) and some number \(\lambda\) for which \[ \textbf{A}\textbf{x}= \lambda\textbf{x} \]
Rearranging terms, the equation can be written as \[ (\textbf{A}- \lambda\textbf{I})\textbf{x}= \textbf{0} \]
Clearly, \(\textbf{x}=\textbf{0}\) satisfies the equation regardless of the value of \(\lambda\). A nonzero solution exists if and only if \(det(\textbf{A}-\lambda\textbf{I}) = 0\). The values for \(\lambda\) for which this characteristic equation holds are called the eigenvalues of \(\textbf{A}\).
If \(\lambda_j\) is such a value, and \(\textbf{x}_j\) is a vector for which \[ \textbf{A}\textbf{x}_j = \lambda_j \textbf{x}_j \] then \(\textbf{x}_j\) is called the eigenvector of \(\textbf{A}\) corresponding to the eigenvalue \(\lambda_j\).
Some interesting results about eigenvalues:
If \(\textbf{A}_{(n\times n)}\) is singular, then it has at least one zero eigenvalue.
For \(\textbf{A}_{(n\times n)}\) and \(\textbf{C}_{(n \times n)}\) nonsingular, the following matrices have the same eigenvalues: \(\textbf{A}\), \(\textbf{C}^{-1}\textbf{A}\textbf{C}\), \(\textbf{C}\textbf{A}\textbf{C}^{-1}\).
\(\textbf{A}\) and \(\textbf{A}^\prime\) have the same eigenvalues.
If \(\textbf{A}\) is nonsingular and \(\lambda\) is one of its eigenvalues, then \(\lambda\) is an eigenvalue of \(\textbf{A}^{-1}\).
If \(\textbf{P}\) is an \((n \times n)\) projection matrix, then it has \(r\) unit eigenvalues and \(n-r\) zero eigenvalues.
We have finally arrived at the following: the eigenvalue decomposition (or eigendecomposition) of a square \((n \times n)\) matrix \(\textbf{A}\) is a factorization in terms of eigenvalues and eigenvectors: \[ \textbf{A}= \textbf{Q}\boldsymbol{\Lambda}\textbf{Q}^{-1} \] where \(\boldsymbol{\Lambda}\) is a diagonal matrix containing the \(n\) eigenvalues \(\lambda_1, \cdots, \lambda_n\) and \(\textbf{Q}\) is an \((n \times n)\) matrix of eigenvectors.
If \(\textbf{A}\) is a real symmetric matrix, then \(\textbf{Q}\) is an orthogonal matrix, which implies that \(\textbf{Q}^{-1} = \textbf{Q}^\prime\). An orthogonal (or orthonormal) matrix is a square matrix of real numbers whose columns are independent (perpendicular) of each other and have unit length. If \(\textbf{Q}\) is orthogonal, this property can be expressed as \(\textbf{Q}^\prime\textbf{Q}=\textbf{I}\), which implies that the transpose of the orthogonal matrix is its inverse.
We can then write the eigenvalue decomposition as \[ \textbf{A}= \textbf{Q}\boldsymbol{\Lambda}\textbf{Q}^\prime \]
One interpretation of the eigenvalues is the extent to which they shrink or stretch \(\textbf{A}\) in the direction of the eigenvectors. For a real symmetric \(\textbf{A}\) we can write \[ \textbf{A}\textbf{q} = \lambda\textbf{q} \] where \(\textbf{q}\) is one of the orthonormal eigenvectors, \(||\textbf{q}||=1\). Then \[ ||\textbf{A}\textbf{q}|| = ||\lambda\textbf{q}|| = |\lambda| \, ||\textbf{q}|| = |\lambda| \]
The eigenvalues of a real symmetric matrix also tell us about its definiteness. Matrix \(\textbf{A}\) is positive definite if and only if all of its eigenvalues are positive. The matrix is positive semi-definite, if anc only if all of its eigenvalues are non-negative. That means, if you find a zero eigenvalue, the matrix cannot be positive definite, it cannot be nonsingular. The number of nonzero eigenvalues gives us the rank of the matrix.
Constructing an inverse
If \(\textbf{A}\) is of full rank, then an inverse can be constructed from the eigenvalue decomposition as \[ \textbf{A}^{-1} = \textbf{Q}\boldsymbol{\Lambda}^{-1}\textbf{Q}^{-1} \] and in the case of a real symmetric matrix: \[ \textbf{A}= \textbf{Q}\boldsymbol{\Lambda}^{-1}\textbf{Q}^\prime \]
The number of nonzero diagonal values of \(\boldsymbol{\Lambda}\) equals the rank of the decomposed matrix. For a nonsingular (full-rank) \(\textbf{A}\), all eigenvalues are greater than zero. The \(\lambda_j\) can be close to zero, however, so when computing eigenvalues with finite-precision arithmetic (computers!), we apply thresholds to declare a singularity.
Singular Value Decomposition (SVD)
The matrix factorized in the eigendecomposition is a square \((n \times n)\) matrix. What can we say about arbitrary \(\textbf{A}_{(m \times n)}\) matrices? The eigendecomposition cannot be applied directly, because we need a square matrix. What happens if we consider the decomposition of the square symmetric \(\textbf{A}^\prime \textbf{A}\) instead? We know that there is an eigenvalue decomposition of the form \[ \textbf{A}^\prime \textbf{A}= \textbf{Q}\boldsymbol{\Lambda}\textbf{Q}^\prime \] The singular values of \(\textbf{A}\) are the square roots of the (non-negative) eigenvalues of \(\textbf{A}^\prime\textbf{A}\). They are the lengths of the vectors \(\textbf{A}\textbf{q}_1, \cdots, \textbf{A}\textbf{q}_n\), where \(\textbf{q}_j\) is the \(j\)th eigenvector of \(\textbf{A}^\prime\textbf{A}\).
The singular value decomposition (SVD) of the \((m \times n)\) matrix \(\textbf{A}\) is written as \[ \textbf{A}= \textbf{U}\boldsymbol{\Sigma}\textbf{V}^\prime \] where \(\boldsymbol{\Sigma}\) is a \((m \times n)\) diagonal matrix containing the singular values of \(\textbf{A}\) (the square roots of the eigenvalues of \(\textbf{A}^\prime \textbf{A}\)), \(\textbf{U}\) is a \((m \times m)\) orthogonal matrix of left singular vectors, and \(\textbf{V}\) is a \((n \times n)\) matrix of right singular vectors of \(\textbf{A}\).
The SVD is an important operation on matrices with many applications in statistics and data science:
- Principal component analysis
- Reducing the dimensionality of a high-dimensional problem
- Computing a low-rank approximation to a matrix
- Imputation of missing values
- Spectral analysis
- Image processing
- Computing inverse and generalized inverse matrices
Constructing an inverse
If \(\textbf{A}\) is square and nonsingular, we can find its inverse from the elements of the singular value decomposition. Since \[ \textbf{A}= \textbf{U}\boldsymbol{\Sigma}\textbf{V}^\prime \] \[ \textbf{A}^{-1} = \left(\textbf{V}^\prime\right)^{-1}\boldsymbol{\Sigma}^{-1}\textbf{U}^{-1} = \textbf{V}\boldsymbol{\Sigma}^{-1}\textbf{U}^\prime \] The first equality applies the rule for the inverse of a product, the second equality uses the fact that \(\textbf{U}\) and \(\textbf{V}\) are orthogonal matrices. As in the case of using the eigendecomposition for inversion, \(\boldsymbol{\Sigma}^{-1}\) is a diagonal matrix with the reciprocals of the singular values on the diagonal.
Constructing a generalized inverse
If \(\textbf{A}\) is nonsingular, we cannot construct an inverse, but the previous approach can be modified to construct a generalized inverse as \[ \textbf{A}^{-1} = \textbf{V}{\boldsymbol{\Sigma}^*}^{-1}\textbf{U}^\prime \] where \({\boldsymbol{\Sigma}^*}^{-1}\) is a diagonal matrix; its \(i\)th diagonal value is the \(i\)th singular value if \(\sigma_i\) is nonzero, and is zero otherwise. Since the singular values are often arranged in order, \(\sigma_1 \ge \sigma_2 \ge \cdots \ge \sigma_n\), this amounts to setting the last \(n-r\) singular values to zero, if \(r\) is the rank of \(\textbf{A}\).
You can show that a matrix so constructed satisfies the conditions for a generalized inverse.
47.9 Matrix Differentiation
Estimation of parameters in statistical models often requires minimization or maximization of an objective function. For example, the ordinary least squares (OLS) principle finds the OLS estimator as the function of the data that minimizes the error sum of squares of the model. Maximum likelihood finds estimators of the parameters as the functions of the data that maximizes the joint likelihood (the joint distribution function) of the data.
The parameters of the models appear as elements of vectors and matrices. Finding estimators of the parameters thus requires calculus on vectors and matrices. Consider matrix \(\textbf{A}\), whose elements depend on a scalar parameter \(\theta\), \(\textbf{A}= \left\lbrack a_{ij}(\theta) \right\rbrack\). The derivative of \(\textbf{A}\) with respect to \(\theta\) is the matrix of the derivatives of the typical elements \(a_{ij}(\theta)\) with respect to \(\theta\). We write this formally as
\[ \frac{\partial\textbf{A}}{\partial\theta} = \left\lbrack \frac{\partial a_{ij}(\theta)}{\partial\theta} \right\rbrack \]
Here are some useful results from vector and matrix calculus where \(\textbf{A}\) and \(\textbf{B}\) are functions of \(\theta\) and \(\textbf{X}\) does not depend on \(\theta\):
\(\frac{{\partial ln}\left| \textbf{A}\right|}{\partial\theta} = \frac{1}{\left| \textbf{A}\right|}\frac{\partial\left| \textbf{A}\right|}{\partial\theta} = tr\left( \textbf{A}^{- 1}\frac{\partial\textbf{A}}{\partial\theta} \right)\)
\(\frac{\partial\textbf{A}^{- 1}}{\partial\theta} = - \textbf{A}^{- 1}\frac{\partial\textbf{A}}{\partial\theta\ }\textbf{A}^{- 1}\)
\(\frac{\partial tr\left( \mathbf{AB} \right)}{\partial\theta} = tr\left( \frac{\mathbf{\partial}\textbf{A}}{\partial\theta}\textbf{B}\right) + tr\left( \textbf{A}\frac{\mathbf{\partial}\textbf{B}}{\partial\theta} \right)\)
\(\frac{\partial\textbf{X}^\prime\textbf{A}^{- 1}\textbf{X}}{\partial\theta} = - \textbf{X}^\prime\textbf{A}^{- 1}\frac{\partial\textbf{A}}{\partial\theta}\textbf{A}^{- 1}\textbf{X}\)
\(\frac{\partial\textbf{X}^{\prime}\mathbf{Ax}}{\partial\textbf{X}} = 2\mathbf{Ax}\)
\(\frac{\partial\textbf{X}^\prime\textbf{A}}{\partial\textbf{X}} = \frac{\partial\textbf{A}^\prime\textbf{X}}{\partial\textbf{X}} = \textbf{A}\)
The results above refer to taking the derivative of a matrix (or vector) with respect to a scalar parameter. The matrix (or vector) of derivatives has the same dimension as the matrix (or vector). A different situation arises when we take derivatives with respect to the elements of a vector. The important case for our discussion is finding the derivative of a scalar function. In statistical models and machine learning applications such a function is frequently the objective function of a minimization problem.
For example, consider a nonlinear model with one input variable (\(x\)) and two parameters \(\boldsymbol{\theta}= [\theta_1, \theta_2]\), \[ \text{E}[Y] = f(x; \boldsymbol{\theta}) = 1 - \theta_1 \exp \{-x^{\theta_2}\} \] The derivatives of the mean function \(f(x; \boldsymbol{\theta})\) with respect to the parameters are given by
\[\begin{align*} \frac{\partial f(x;\boldsymbol{\theta})}{\partial \theta_1} &= -\exp\{-x^{\theta_2}\} \\ \frac{\partial f(x;\boldsymbol{\theta})}{\partial \theta_2} &= \theta_1 \log(x) x^{\theta_2} \exp\{-x^{\theta_2}\} \end{align*}\]
Gradient Vector
The derivative of a function \(f(\boldsymbol{\theta})\) with respect to the vector \(\boldsymbol{\theta}_{(p \times 1)}\) is the vector of the partial derivatives of the function
\[ \frac{\partial f\left( \boldsymbol{\theta}\right)}{\partial\boldsymbol{\theta}} = \begin{bmatrix} \frac{\partial f\left( \boldsymbol{\theta}\right)}{\partial\theta_{1}} \\ \vdots \\ \frac{\partial f\left( \boldsymbol{\theta}\right)}{\partial\theta_{p}} \end{bmatrix} \]
This is called the gradient of \(f(\boldsymbol{\theta})\) with respect to \(\boldsymbol{\theta}\).
Returning to the two-parameter example above, suppose we with to fit the model by (nonlinear) least squares. This requires minimization of the objective function \[ \text{SSE}(\boldsymbol{\theta}) = \sum_{i=1}^n \left( y_i - f(x_i;\boldsymbol{\theta})\right)^2 = \sum_{i=1}^n \left(y_i - 1 + \theta_1\exp\left\{-x_i^{\theta_2}\right\}\right)^2 \tag{47.2}\]
with respect to \(\theta_1\) and \(\theta_2\). The error (residual) sum of squares is a function of the parameter vector \(\boldsymbol{\theta}\). Function minimization involves finding the values \(\widehat{\theta}_1\) and \(\widehat{\theta}_2\) for which the derivative of Equation 47.2 is zero. The gradient of the objective function \(\text{SSE}(\boldsymbol{\theta})\) with respect to \(\boldsymbol{\theta}\) is \[ \frac{\partial \,\text{SSE}(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}} = \begin{bmatrix} -2\sum_{i=1}^n \left(y_i - f(x_i;\boldsymbol{\theta})\right)\frac{\partial f(x_i;\boldsymbol{\theta})}{\partial \theta_1} \\ -2\sum_{i=1}^n \left(y_i - f(x_i;\boldsymbol{\theta})\right)\frac{\partial f(x_i;\boldsymbol{\theta})}{\partial \theta_2} \\ \end{bmatrix} \]
Hessian Matrix
Taking the derivative of a scalar function \(f\) with respect to a vector yields a vector. By the same token, taking second derivatives of \(f\) yields a matrix of second derivatives:
\[ \frac{\partial^2 f(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}\, \partial \boldsymbol{\theta}^\prime} = \left[\frac{\partial^2 f(\boldsymbol{\theta})}{\partial \theta_i \partial \theta_j}\right] \]
This states that the typical element of the matrix of second derivatives is the second derivative of the scalar function with respect to the \(i\)th and \(j\)th element of \(\boldsymbol{\theta}\):
\[ \frac{\partial^2 f(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}\partial \boldsymbol{\theta}^\prime} = \begin{bmatrix} \frac{\partial^2 f(\boldsymbol{\theta})}{\partial \theta_1 \partial \theta_1} & \frac{\partial^2 f(\boldsymbol{\theta})}{\partial \theta_1 \partial \theta_2} & \cdots & \frac{\partial^2 f(\boldsymbol{\theta})}{\partial \theta_1 \partial \theta_p} \\ \frac{\partial^2 f(\boldsymbol{\theta})}{\partial \theta_2 \partial \theta_1} & \frac{\partial^2 f(\boldsymbol{\theta})}{\partial \theta_2 \partial \theta_2} & \cdots & \frac{\partial^2 f(\boldsymbol{\theta})}{\partial \theta_2 \partial \theta_p} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f(\boldsymbol{\theta})}{\partial \theta_p \partial \theta_1} & \cdots & \vdots & \frac{\partial^2 f(\boldsymbol{\theta})}{\partial \theta_p \partial \theta_p} \\ \end{bmatrix} \]
The matrix of second derivatives is called the Hessian matrix. The Hessian is the Jacobian of the gradient.
47.10 Multivariate Gaussian Distribution
Definition
A scalar random variable \(Y\) has a Gaussian distribution function with mean \(\mu\) and variance \(\sigma^{2}\) if its probability density function is given by
\[ f(y) = \frac{1}{\sqrt{2\pi\sigma^{2}}}\exp\left\{ - \frac{1}{2\sigma^{2}}(y - \mu)^{2} \right\} \]
We also say that \(Y\) is normally distributed with mean \(\mu\) and variance \(\sigma^{2}\) but prefer the name Gaussian over Normal for the reasons given in Section 45.5. The shorthand expressions \(Y \sim G(\mu,\sigma^{2})\) or \(Y \sim N(\mu,\sigma^{2})\) are common.
The generalization from a scalar random variable \(Y\) to a random vector \(\textbf{Y}_{(n \times 1)}\) with a multivariate Gaussian distribution is as follows. \(\textbf{Y}_{(n \times 1)}\) has a multivariate Gaussian (normal) distribution with mean \(\boldsymbol{\mu}\) and variance matrix \(\textbf{V}\), if its density is given by
\[ f\left( \textbf{Y}\right)=\frac{\left| \textbf{V}\right|^{- 1/2}}{(2\pi)^{\frac{n}{2}}}\exp\left\{ - \frac{1}{2}\left( \textbf{Y}- \boldsymbol{\mu}\right)^\prime\textbf{V}^{- 1}\left( \textbf{Y}- \boldsymbol{\mu}\right) \right\} \]
This is denoted with the shorthand \(\textbf{Y}\sim G_{n}(\boldsymbol{\mu},\textbf{V})\) or \(\textbf{Y}\sim N_{n}(\boldsymbol{\mu},\textbf{V}\)\). If the dimension of the distribution is clear from context, the subscript \(n\) can be omitted. A special case is the standard multivariate Gaussian distribution with mean \(\textbf{0}\) and variance matrix \(\textbf{I}\) :
\[ f\left( \textbf{Y}\right)=\frac{1}{(2\pi)^{\frac{n}{2}}}\exp\left\{ - \frac{1}{2}\textbf{Y}^{\prime}\textbf{Y}\right\} = \frac{1}{(2\pi)^{\frac{n}{2}}}\exp\left\{ - \frac{1}{2}\sum_{i}^{n}y_{i}^{2} \right\} \]
But this is just the product of the \(n\) univariate densities of \(G(0,1)\) random variables:
\[ f\left( \textbf{Y}\right) = f\left( y_{1} \right) \times \cdots \times f\left( y_{n} \right) \]
where
\[ f\left( y_{i} \right) = \frac{1}{(2\pi)^{1/2}}\exp\left\{ - \frac{1}{2}y^{2} \right\} \]
If the variance matrix is diagonal—that is, the \(Y_{i}\) are uncorrelated—the multivariate normal distribution is the product of the univariate distributions. The random variables are then independent.
Properties
Gaussian distributions have amazing (magical) properties.
Linear combinations are Gaussian
For example, a linear combination of Gaussian random variables also follows a Gaussian distribution. Formally, this can be expressed as follows: if \(\textbf{Y}\sim G_{n}\left( \boldsymbol{\mu},\textbf{V}\right)\) and \(\textbf{A}\) and \(\textbf{B}\) are a matrix and vector of constants (not random variables), respectively, then \(\mathbf{AY} + \textbf{B}\) follows a \(G(\textbf{A}\boldsymbol{\mu},\mathbf{AVA})\) distribution.
A special case of this result is that if \(\textbf{Y}\sim G_{n}\left( \boldsymbol{\mu},\textbf{V}\right)\), \(\textbf{Y}- \boldsymbol{\mu}\) has a \(G\left( 0,\textbf{V}\right)\) distribution.
Because a linear function of a Gaussian random variable is Gaussian distributed, you can define all multivariate Gaussian distributions as linear transformations of the standard multivariate Gaussian distribution. If \(\textbf{Z}\sim G_{n}(\textbf{0},\textbf{I})\), and \(\textbf{V}= \textbf{C}^\prime\textbf{C}\), then \(\textbf{Y}= \mathbf{C}^\prime\textbf{Z}+ \boldsymbol{\mu}\) has a \(G(\boldsymbol{\mu},\textbf{V})\) distribution.
Zero covariance implies independence
Another unusual property of Gaussian random variables is that if they are uncorrelated, they are also stochastically independent. We derived this above for the special case of \(\textbf{Y}\sim G(\textbf{0},\sigma^{2}\textbf{I})\).
You cannot in general conclude that random variables are independent based on their lack of correlation. For Gaussian random variables you can. This result can be extended to Gaussian random vectors. Suppose \(\textbf{Y}_{(n \times 1)} \sim G(\boldsymbol{\mu},\ \textbf{V})\) is partitioned into two sub-vectors of size \(s\) and \(k\), where \(n = s + k\). Then we can similarly partition the mean vector and variance matrix:
\[ \textbf{Y}_{(n \times 1)} = \begin{bmatrix} \textbf{Y}_{1(s \times 1)} \\ \textbf{Y}_{2(k \times 1)} \end{bmatrix},\qquad \boldsymbol{\mu}= \begin{bmatrix} \boldsymbol{\mu}_{1} \\ \boldsymbol{\mu}_{2} \end{bmatrix},\qquad \textbf{V}= \begin{bmatrix} \textbf{V}_{11} & \textbf{V}_{12} \\ \textbf{V}_{21} & \textbf{V}_{22} \end{bmatrix} \]
If \(\textbf{V}_{12} = \textbf{0}\), then \(\textbf{Y}_{1}\) and \(\textbf{Y}_{2}\) are independent. Also, each partition is Gaussian distributed, for example, \(\textbf{Y}_{1} \sim G(\boldsymbol{\mu}_{1},\ \textbf{V}_{11})\). We call the distribution of \(\textbf{Y}_{1}\) the marginal distribution.
It follows immediately that each element of \(\textbf{Y}\) follows a (univariate) Gaussian distribution, \(Y_{i} \sim G(\mu_{i},V_{ii})\)—all marginal univariate distributions are Gaussian.
Conditionals are Gaussian
The conditional distribution of \(\textbf{Y}_{1}\) given \(\textbf{Y}_{2}\) is also a Gaussian distribution, specifically:
\[ \textbf{Y}_{1}|\textbf{Y}_{2} \sim G\left( \boldsymbol{\mu}_{1}\mathbf{+}\textbf{V}_{12}\textbf{V}_{22}^{- 1}\left( \textbf{Y}_{2} - \boldsymbol{\mu}_{2} \right),\ \textbf{V}_{11} - \textbf{V}_{12}\textbf{V}_{22}^{- 1}\textbf{V}_{12}^\prime \right) \]
This result plays an important role when predicting Gaussian random variables, for example in mixed models.
Notice that the variance matrix of the conditional distribution does not depend on the particular value \(\textbf{Y}_{2} = \textbf{Y}_{2}\) on which the distribution is conditioned. However, the mean of the conditional distribution does depend on \(\textbf{Y}_{2}\) unless \(\textbf{V}_{12} = \textbf{0}\), a condition established earlier for independence of \(\textbf{Y}_{1}\) and \(\textbf{Y}_{2}\).
Quadratic forms are Chi-square
If \(\textbf{Y}\) is multivariate Gaussian with mean \(\boldsymbol{\mu}\) and variance \(\sigma^2\textbf{I}\), and \(\textbf{P}\) is an \((n \times n)\) projector of rank \(r\), then the quadratic form \[ Q = \frac{1}{\sigma^2} \left(\textbf{Y}-\boldsymbol{\mu}\right)^\prime \textbf{P}\left(\textbf{Y}-\boldsymbol{\mu}\right) \]
follows a chi-square distribution with \(r\) degrees of freedom.
Working out the distribution of sums of squares in the analysis of variance relies heavily on this result.
Maximum Likelihood Estimator
Suppose that we want to find an estimator for \(\boldsymbol{\beta}\) in the linear model \(\textbf{Y}= \textbf{X}\boldsymbol{\beta}+ \boldsymbol{\epsilon}\). If the model errors follow a Gaussian distribution with mean \(\textbf{0}\) and variance \(\textbf{V}\), then \(\textbf{Y}\) follows a Gaussian distribution because it is a linear function of \(\boldsymbol{\epsilon}\). The probability density function of \(\textbf{Y}\) is
\[ f\left( \textbf{Y}\right)=\frac{\left| \textbf{V}\right|^{- 1/2}}{(2\pi)^{\frac{n}{2}}}\exp\left\{ - \frac{1}{2}\left( \textbf{Y}- \textbf{X}\boldsymbol{\beta}\right)^\prime\textbf{V}^{- 1}\left( \textbf{Y}- \textbf{X}\boldsymbol{\beta}\right) \right\} \]
This joint distribution of the data can be used to derive the maximum likelihood estimator (MLE) of \(\boldsymbol{\beta}\). Maximum likelihood estimation considers this as a function of \(\boldsymbol{\beta}\) rather than a function of \(\textbf{Y}.\) Maximizing this likelihood function \(\mathcal{l(}\boldsymbol{\beta};\textbf{Y})\) is equivalent to maximizing its logarithm and working on the log scale is much simpler. The log-likelihood fu1nction for this problem is given by
\[ ln\mathcal{\{ l}\left( \boldsymbol{\beta};\textbf{Y}\right\} = l\left( \boldsymbol{\beta};\textbf{Y}\right) = - \frac{1}{2}\ln\left( \left| \textbf{V}\right| \right) - \frac{n}{2}\ln(2\pi) - \frac{1}{2}\left( \textbf{Y}- \textbf{X}\boldsymbol{\beta}\right)^\prime\textbf{V}^{- 1}\left( \textbf{Y}- \textbf{X}\boldsymbol{\beta}\right) \]
Finding the maximum of this function with respect to \(\boldsymbol{\beta}\) is equivalent to minimizing the quadratic form \[ \left( \textbf{Y}- \textbf{X}\boldsymbol{\beta}\right)^\prime\textbf{V}^{-1}\left( \textbf{Y}- \textbf{X}\boldsymbol{\beta}\right) \] with respect to \(\boldsymbol{\beta}\). Applying the results about matrix differentiation from above we obtain
\[\begin{align*} \frac{\partial}{\partial\boldsymbol{\beta}}\left( \textbf{Y}- \textbf{X}\boldsymbol{\beta}\right)^\prime\textbf{V}^{- 1}\left( \textbf{Y}- \textbf{X}\boldsymbol{\beta}\right) &= \frac{\partial}{\partial\boldsymbol{\beta}}\left\{ \textbf{Y}^\prime\textbf{V}^{- 1}\textbf{Y}- 2\textbf{Y}^\prime\textbf{V}^{- 1}\textbf{X}\boldsymbol{\beta}+ \boldsymbol{\beta}^\prime\textbf{X}^\prime\textbf{V}^{- 1}\textbf{X}\boldsymbol{\beta}\right\} \\ &= - 2\textbf{X}^\prime\textbf{V}^{- 1}\textbf{Y}+ 2\textbf{X}^\prime\textbf{V}^{- 1}\textbf{X}\boldsymbol{\beta} \end{align*}\]
The derivative is zero when \(\textbf{X}^\prime\textbf{V}^{- 1}\textbf{Y}= \textbf{X}^\prime\textbf{V}^{- 1}\textbf{X}\boldsymbol{\beta}\).
If \(\textbf{X}\) is of full column rank, then \(\textbf{X}^\prime\textbf{V}^{- 1}\textbf{X}\) is non-singular and its inverse exists. Pre-multiplying both sides of the equation with that inverse yields the solution
\[\begin{align*} \textbf{X}^\prime\textbf{V}^{- 1}\textbf{Y}&= \textbf{X}^\prime\textbf{V}^{- 1}\textbf{X}\widehat{\boldsymbol{\beta}} \\ \left( \textbf{X}^{\prime}\textbf{V}^{- 1}\textbf{X}\right)^{- 1}\textbf{X}^\prime\textbf{V}^{- 1}\textbf{Y}&= {\left( \textbf{X}^{\prime}\textbf{V}^{- 1}\textbf{X}\right)^{- 1}\textbf{X}}^\prime\textbf{V}^{- 1}\textbf{X}\widehat{\boldsymbol{\beta}}\\ \left( \textbf{X}^{\prime}\textbf{V}^{- 1}\textbf{X}\right)^{- 1}\textbf{X}^\prime\textbf{V}^{- 1}\textbf{Y}&= \widehat{\boldsymbol{\beta}} \end{align*}\]
The maximum likelihood estimator of \(\boldsymbol{\beta}\) is the generalized least squares estimator
\[ \widehat{\boldsymbol{\beta}}=\left( \textbf{X}^{\prime}\textbf{V}^{- 1}\textbf{X}\right)^{- 1}\textbf{X}^\prime\textbf{V}^{- 1}\textbf{Y} \]
A special case arises when the model errors \(\boldsymbol{\epsilon}\) are uncorrelated. Since the errors are Gaussian distributed, we know that the errors are then independent. The variance matrix \(\textbf{V}\) is then a diagonal matrix
\[ \textbf{V}= \begin{bmatrix} \sigma_{1}^{2} & \cdots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \cdots & \sigma_{n}^{2} \end{bmatrix} \]
A further special case arises when the diagonal entries are all the same,
\[ \textbf{V}= \begin{bmatrix} \sigma^{2}\ & \cdots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \cdots & \sigma^{2} \end{bmatrix} = \sigma^{2}\textbf{I} \]
We can write the error distribution in this case as \(\boldsymbol{\epsilon}\sim G\left(\textbf{0},\sigma^{2}\textbf{I}\right)\) and the model for \(\textbf{Y}\) as \(\textbf{Y}\sim G\left( \textbf{X}\boldsymbol{\beta},\sigma^{2}\textbf{I}\right)\). This is known as the iid case, the errors are independent and identically distributed. You will encounter the iid assumption a lot in statistical modeling and in machine learning, because it allows you to write multivariate joint distributions as products of univariate distributions and this greatly simplifies matters.
Under the iid assumption for the Gaussian linear model we can substitute \(\sigma^{2}\textbf{I}\) for \(\textbf{V}\) in the formula for \(\widehat{\boldsymbol{\beta}}\) and obtain
\[{\widehat{\boldsymbol{\beta}}=\left( \textbf{X}^{\prime}\textbf{X}\right)}^{- 1}\textbf{X}^\prime\textbf{Y}\]
the ordinary least squares estimator. Notice that \(\sigma^{2}\) cancels out of the formula; the value of the OLS estimator does not depend on the intrinsic variability of the data. However, the variability of the OLS estimator does depend on \(\sigma^{2}\) (and on \(\textbf{X}\)).
47.11 Sherman, Morrison, Woodbury Formula
This remarkable formula is at the heart of many regression-type diagnostics and cross-validation techniques. A version of this formula was first given by Gauss in 1821. Around 1950, it appeared in papers by Sherman and Morrison (1949) and Woodbury (1950).
Suppose we are in a full-rank linear modeling context with design matrix \(\textbf{X}_{(n \times p + 1)}\), so that the inverse \(\left( \textbf{X}^\prime\textbf{X}\right)^{-1}\) exists. In diagnosing the quality of a model, we are interested in measuring the prediction error for the \(i\)th observation as if the data point had not contributed to the analysis. This is an example of a leave-one-out estimate: remove an observation from the data, redo the analysis, and measure how well the quantity of interest can be computed for the withheld observation.
If you do this in turn for all \(n\) observations, you must fit the model \(n + 1\) times, an overall fit to the training data with \(n\) observations, and \(n\) additional fits with training data sets of size \(n - 1\), leaving out each observation in turn. The computationally expensive part of fitting the linear model is building the cross-product matrix \(\textbf{X}^\prime\textbf{X}\) and computing its inverse \(\left( \textbf{X}^\prime\textbf{X}\right)^{-1}\).
The Sherman-Morrison-Woodbury formula allows us to compute the inverse of the cross-product matrix based on \(\left( \textbf{X}^\prime\textbf{X}\right)^{- 1}\) as if the \(i\)th observation had been removed.
Denote as \(\textbf{X}_{-i}\) the design matrix with the \(i\)th observation removed. Then
\[ \left( \textbf{X}_{-i}^\prime\textbf{X}_{-i} \right)^{- 1} = \left( \textbf{X}^\prime\textbf{X}- \textbf{X}_{i}\textbf{X}_{i}^{\prime} \right)^{-1}\ = \left( \textbf{X}^\prime\textbf{X}\right)^{- 1} + \frac{\left( \textbf{X}^\prime\textbf{X}\right)^{-1}{\textbf{X}_{i}\textbf{X}_{i}^{\prime}\left( \textbf{X}^\prime\textbf{X}\right)}^{- 1}}{1 - \textbf{X}_{i}^{\prime}\left( \textbf{X}^\prime\textbf{X}\right)^{-1}\textbf{X}_{i}} \]
Because of this remarkable result, leave-one-out statistics can be calculated easily—without retraining any models—based on the fit to the full training data alone. Note that the quantity in the denominator of the right-hand side is the diagonal value of \(\textbf{I}- \textbf{H}\), where \(\textbf{H}\) is the hat matrix. If \(h_{ii}\) denotes the diagonal values of \(\textbf{H}\), we can write the update formula as
\[ \left( \textbf{X}^\prime\textbf{X}\right)^{- 1} + \frac{\left( \textbf{X}^\prime\textbf{X}\right)^{- 1}{\textbf{X}_{i}\textbf{X}_{i}^{\prime}\left( \textbf{X}^\prime\textbf{X}\right)}^{- 1}}{1 - h_{ii}} \]
The leverage values \(h_{ii}\) play an important role in the computation of residual, influence, and case-deletion diagnostics in linear models.