3 Linear Algebra Review

3.1 Basics

Command of linear algebra is essential in data science, models and estimators are often expressed in terms of tensors, matrices, and vectors. Using scalar-based arithmetic becomes tedious very quickly as models become more complex. For example, the simple linear regression model and a straight line through the intercept model can be written as

\[Y_{i} = \beta_{0} + \beta_{1}x_{i} + \epsilon_{i}\]

\[Y_{i} = \beta_{1}x_{i} + \epsilon_{i}\]

Using scalar algebra, the estimates of the slope are quite different:

\[{\widehat{\beta}}_{1} = \frac{\left( \sum_{i = 1}^{n}{\left( y_{i} - \overline{y} \right)\left( x_{i} - \overline{x} \right)} \right)}{\sum_{i = 1}^{n}\left( x_{i} - \overline{x} \right)^{2}}\]

\[{\widehat{\beta}}_{1} = \frac{\left( \sum_{i = 1}^{n}{y_{i}x_{i}} \right)}{\sum_{i = 1}^{n}x_{i}^{2}}\]

The formulas get messier as we add another input variable to the model. Using matrix—vector notation, the estimator of all the regression coefficients takes the same form, regardless of the size of the model:

\[\widehat{\boldsymbol{\beta}} = \left( \textbf{X}^{\prime}\textbf{X}\right)^{- 1}\textbf{X}^{\prime}\textbf{Y}\]

A scalar is a single real number, a vector is an array of scalars arranged in a single column (a column vector) or a row (a row vector). A matrix is a two-dimensional array of scalars, a tensor is a multi-dimensional array.

The order of a vector or matrix is specified as (rows x columns) and is sometimes used as a subscript for clarity. For example,\(\textbf{A}_{(3 \times 5)}\) denotes a matrix with 3 rows and 5 columns. It can be viewed as a concatenation} of five \((3 \times 1)\) column vectors:

\[\textbf{A}_{(3 \times 5)}=\begin{bmatrix} \begin{matrix} 1 \\ 1 \\ 1 \end{matrix} & \begin{matrix} 9.0 \\ 3.2 \\ 4.1 \end{matrix} & \begin{matrix} \begin{matrix} 6.2 \\ 1.4 \\ - 0.6 \end{matrix} & \begin{matrix} 1 \\ 0 \\ 0 \end{matrix} & \begin{matrix} 0 \\ 1 \\ 0 \end{matrix} \end{matrix} \end{bmatrix}\]

\(\textbf{a}_{1} = \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix}\ \ \ \ \ \textbf{a}_{2} = \begin{bmatrix} 9.0 \\ 3.2 \\ 4.1 \end{bmatrix}\ \ \ \ \textbf{a}_{3} = \begin{bmatrix} 6.2 \\ 1.4 \\ - 0.6 \end{bmatrix}\ \ \ \ \ \textbf{a}_{4} = \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix}\ \ \ \ \ \textbf{a}_{5} = \begin{bmatrix} 0 \\ 1 \\ 0 \end{bmatrix}\).

A matrix with as many rows as columns is called a square matrix.

Bold symbols are common, lowercase for vectors and uppercase for matrices, but there are some exceptions. When dealing with vectors of random variables, bold uppercase notation is used for a vector of random variables and bold lowercase notation is used for a vector of the realized values. For example, if \(Y_{1},\cdots,Y_{n}\) is a random sample of size \(n\), the vector of random variables is

\[\textbf{Y}_{(n \times 1)} = \begin{bmatrix} Y_{1} \\ \vdots \\ Y_{n} \end{bmatrix}\]

and the vector of realized values is

\[\textbf{y}_{(n \times 1)} = \begin{bmatrix} y_{1} \\ \vdots \\ y_{n} \end{bmatrix}\]

The difference is significant because \(\textbf{Y}\) is a random variable and \(\textbf{y}\) is a vector of constants. \(\textbf{Y}\) has a multi-variate distribution with mean and variance, \(\textbf{y}\) is just a vector of numbers.

We follow the convention that all vectors are column vectors, so that \(\textbf{y}_{(n)}\) serves as a shorthand for \(\textbf{y}_{(n \times 1)}\).

The typical element of a matrix is written as a scalar with subscripts that refer to rows and columns. For example, the statement

\[\textbf{A}= \left\lbrack a_{ij} \right\rbrack\]

says that matrix \(\textbf{A}\) consists of the scalars \(a_{ij}\); for example, \(a_{23}\) is the scalar in row 2, column 3 of the matrix.

3.2 Special Matrices

Special Values

A few special matrices, common in statistics and machine learning are

\(\textbf{1}_{n} = \begin{bmatrix} 1 \\ \vdots \\ 1 \end{bmatrix}\), the unit vector of size \(n\); all its elements are 1.
\(\textbf{0}_{n} = \begin{bmatrix} 0 \\ \vdots \\ 0 \end{bmatrix}\), the zero vector of size \(n\); all its elements are 0.
\(\textbf{0}_{(n \times k)} = \begin{bmatrix} 0 & 0 & 0 \\ \vdots & \cdots & \vdots \\ 0 & 0 & 0 \end{bmatrix}\), the zero matrix of order \((n \times k)\). All its elements are 0.
\(\textbf{J}_{(n \times k)} = \begin{bmatrix} 1 & 1 & 1 \\ \vdots & \cdots & \vdots \\ 1 & 1 & 1 \end{bmatrix}\), the unit matrix of size \((n \times k)\). All its elements are 1.
\(\textbf{I}_{n} = \begin{bmatrix} 1 & 0 & 0 \\ 0 & \ddots & 0 \\ 0 & 0 & 1 \end{bmatrix}\), the identity matrix of size \((n \times n)\) with 1s on the diagonal and 0s elsewhere.

If the order of these matrices is obvious from the context, the subscripts can be omitted.

Special Patterns

Patterned matrices play an important role in statistical operations. Recognizing a particular structure of a matrix we can reduce the amount of work required to find inverses, eigenvalues, determinants, etc. For example, if \(\textbf{D}\) is a diagonal matrix with \(d_{ii} \ne 0\), then an inverse \(\textbf{D}^{-1}\) can be found by simply replacing \(d_{ii}\) with \(1/d_{ii}\). Finding the inverse of a matrix in general is a much more elaborate task.

Performing operations on patterned matrices can lead to patterns in the result of the operations. An example is the multiplication of partitioned matrices.

A matrix with the same number of rows and columns is a square matrix.

Square matrices for which the typical element \(a_{ij} = a_{ji}\) are symmetric matrices. For example, the matrix \[ \begin{bmatrix} 54 & 37 & 5 \\ 37 & 27 & 4 \\ 5 & 4 & 21 \end{bmatrix} \] is a symmetric matrix, The values in the off-diagonal cells are mirrored across the diagonal. Symmetric matrices play an important role in statistics; cross-product, variance-covariance, “hat”, and correlation matrices are examples.

A diagonal matrix is a symmetric matrix where all off-diagonal entries are zero. The identity matrix \(\textbf{I}_{n}\) is diagonal, as is the following \[ \begin{bmatrix} 54 & 0 & 0 \\ 0 & 27 & 0 \\ 0 & 0 & 21 \end{bmatrix} \]

A triangular matrix is a square matrix with zeros above (lower triangular) or below (upper triangular) the diagonal. For example, \[ \textbf{A}_u = \begin{bmatrix} a_{11} & a_{12} & a_{13} \\ 0 & a_{22} & a_{23} \\ 0 & 0 & a_{33} \end{bmatrix} \] is upper-triangular and \[ \textbf{A}_l = \begin{bmatrix} a_{11} & 0 & 0 \\ a_{21} & a_{22} & 0 \\ a_{31} & a_{32} & a_{33} \end{bmatrix} \] is lower-triangular. Lower and upper triangular matrices play an important role in solving linear systems by LU decomposition. The procedure derives its name from factorizing a matrix into a lower (L) and upper (U) triangular form.

A matrix is said to be partitioned, if it can be arranged in separate blocks of sub-matrices. Most important in statistics and data science are partitions of square matrices: \[ \textbf{B}_{(n \times n)} = \begin{bmatrix} \textbf{B}_{11} & \textbf{B}_{12} \\ \textbf{B}_{21} & \textbf{B}_{22} \end{bmatrix} \]

The sizes of the \(B_{ij}\) sub-matrices are \((n_i \times n_j)\). An important case of partitioning occurs in the variance-covariance matrix of two random vectors (more on this below). Briefly, if \(\textbf{Y}= [\textbf{Y}_1, \textbf{Y}_2]^\prime\), then the variance matrix of \(\textbf{Y}\) is partitioned as follows: \[ \text{Var}[\textbf{Y}] = \begin{bmatrix} \text{Var}[\textbf{Y}_1] & \text{Cov}[\textbf{Y}_1,\textbf{Y}_2] \\ \text{Cov}[\textbf{Y}_2,\textbf{Y}_1] & \text{Var}[\textbf{Y}_2] \end{bmatrix} \]

A partitioned matrix is block-diagonal, if the off-diagonal sub-matrices are zero matrices. This occurs with the variance matrix of \(\textbf{Y}= [\textbf{Y}_1, \textbf{Y}_2]^\prime\), if \(\textbf{Y}_1\) and \(\textbf{Y}_2\) are uncorrelated: \[ \text{Var}[\textbf{Y}] = \begin{bmatrix} \text{Var}[\textbf{Y}_1] & \textbf{0}\\ \textbf{0}& \text{Var}[\textbf{Y}_2] \end{bmatrix} \]

3.3 Basic Operations on Matrices and Vectors

The basic operations on matrices and vectors are addition, subtraction, multiplication, transposition, and inversion. These are standard operations in manipulating matrix and vector equations. Decompositions such as Cholesky roots, eigenvalue and singular value decompositions are more advanced operations that are important in solving estimation problems in statistics.

Transpose

The transpose of a matrix is obtained by exchanging rows and columns. If \(a_{ij}\) is the element in row \(i\), column \(j\) of matrix \(\textbf{A}\), the transpose of \(\textbf{A}\), denoted \(\textbf{A}^\prime\), has typical element \(a_{ji}\). In case of the \((3\ \times 5)\) matrix shown previously, its transpose is

\[\textbf{A}^\prime_{(5 \times 3)} = \begin{bmatrix} \begin{matrix} 1 \\ 9.0 \\ \begin{matrix} 6.2 \\ 1 \\ 0 \end{matrix} \end{matrix} & \begin{matrix} 1 \\ 3.2 \\ \begin{matrix} 1.4 \\ 0 \\ 1 \end{matrix} \end{matrix} & \begin{matrix} 1 \\ 4.1 \\ \begin{matrix} - 0.6 \\ 0 \\ 0 \end{matrix} \end{matrix} \end{bmatrix}\]

The transpose of a column vector is a row vector:

\[\textbf{a}^{\prime} = \begin{bmatrix} a_{1} \\ \vdots \\ a_{n} \end{bmatrix}^\prime = \left\lbrack a_{1},\cdots,a_{n} \right\rbrack\]

Transposing a transpose produces the original matrix, \(\left( \textbf{A}^{\prime} \right)^{\prime}\ = \textbf{A}\).

A matrix is symmetric if it is equal to its transpose, \(\textbf{A}^\prime = \textbf{A}\). Symmetric matrices are square matrices (same numbers of rows and columns). The matrices \(\textbf{A}^\prime\textbf{A}\) and \(\textbf{A}\textbf{A}^\prime\) are always symmetric. A symmetric matrix whose off-diagonal elements are zero is called a diagonal matrix.

Addition and Subtraction

The sum (difference) of two matrices is the matrix of the elementwise sums (differences) of their elements. These operations require that the matrices being summed or subtracted have the same order:

\[\textbf{A}_{(n \times k)} + \textbf{B}_{(n \times k)} = \left\lbrack a_{ij} + b_{ij} \right\rbrack\]

\[\textbf{A}_{(n \times k)} - \textbf{B}_{(n \times k)} = \left\lbrack a_{ij} - b_{ij} \right\rbrack\]

Suppose, for example, that \(\textbf{A}= \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix}\) and \(\textbf{B}=\begin{bmatrix} - 1 & - 2 & - 3 \\ - 4 & - 5 & - 6 \end{bmatrix}\). Then,

\[\textbf{A}+ \textbf{B}= \begin{bmatrix} 0 & 0 & 0 \\ 0 & 0 & 0 \end{bmatrix}\]

\[\textbf{A}- \textbf{B}= \begin{bmatrix} 2 & 4 & 6 \\ 8 & 10 & 12 \end{bmatrix}\]

Since addition (subtraction) are elementwise operations, they can be combined with transposition:

\[\left( \textbf{A}+ \textbf{B}\right)^\prime = \textbf{A}^{\prime} + \textbf{B}^{\prime}\]

Multiplication

Two matrices conform for addition (subtraction) if they have the same order, that is, the same number of rows and columns. Multiplication of matrices requires a different type of conformity; two matrices \(\textbf{A}\) and \(\textbf{B}\) can be multiplied as \(\text{AB}\) (or \(\textbf{A}\text{×}\textbf{B}\)), if the number of columns in \(\textbf{A}\) equals the number of columns in \(\textbf{B}\). We say that in the product \(\textbf{A}\text{×}\textbf{B}\), \(\textbf{A}\) is post-multiplied by \(\textbf{B}\) or that \(\textbf{A}\) is multiplied into \(\textbf{B}\). The result of multiplying a \((n \times k)\) matrix into a \((k \times p)\) matrix is a \((n \times p)\) matrix.

Before examining the typical elements in the result of multiplication, let’s look at a special case, the inner product of two \((k \times 1)\) vectors \(\textbf{A}\) and \(\textbf{B}\), also called the dot product or the scalar product, is the result of multiplying the transpose of \(\textbf{A}\) into \(\textbf{B}\), a scalar value

\[\textbf{A}^\prime\textbf{B}= \left\lbrack a_{1}, \cdots,a_{k} \right\rbrack\begin{bmatrix} b_{1} \\ \vdots \\ b_{k} \end{bmatrix} = a_{1}b_{1} + \cdots a_{k}b_{k} = \sum_{i = 1}^{k}{a_{i}b_{i}}\]

The square root of the dot product of a vector with itself is the Euclidean \({(L}_{2})\) norm of the vector,

\[\left| \left| \textbf{a}\right| \right| = \sqrt{\textbf{a}^\prime\textbf{a}} = \sum_{i = 1}^{k}a_{i}^{2}\]

The \(L_{2}\) norm plays an important role as a loss function in statistical models. The vector for which the norm is calculated is then often a vector of model errors.

Now let’s return to the problem of multiplying the \((n \times k)\) matrix \(\textbf{A}\) into the \((k \times p)\) matrix \(\textbf{B}\) and introduce one more piece of notation: the \(i\)^th row of \(\textbf{A}\) is denoted \(\mathbf{\alpha}_{i}\) and the \(j\)^th column of \(\textbf{B}\) is denoted \(\textbf{B}_{j}\). Now we can finally write the product \(\textbf{A}\text{×}\textbf{B}\) as a matrix whose typical element is the inner product of \(\mathbf{\alpha}_{i}\) and \(\textbf{B}_{j}\):

\[\textbf{A}_{(n \times k)} \times \textbf{B}_{(k \times p)} = \left\lbrack \boldsymbol{\alpha}_{i}^\prime\ \textbf{b}_{j} \right\rbrack_{(n \times p)}\ \]

As an example, let \(\textbf{A}= \begin{bmatrix} 1 & 2 & 0 \\ 3 & 1 & - 3 \\ 4 & 1 & 2 \end{bmatrix}\) and \(\textbf{B}= \begin{bmatrix} 1 & 0 \\ 2 & 3 \\ 2 & 1 \end{bmatrix}\). The product \(\textbf{A}\times\textbf{B}\) is a \((3 \times 2)\) matrix with elements

\[\textbf{A}\times\textbf{B}= \begin{bmatrix} 1 \times 1 + 2 \times 2 + 0 \times 2 & 1 \times 0 + 2 \times 3 + 0 \times 1 \\ 3 \times 1 + 1 \times 2 - 3 \times 2 & 3 \times 0 + 1 \times 3 - 3 \times 1 \\ 4 \times 1 + 1 \times 2 + 2 \times 2 & 4 \times 0 + 1 \times 3 + 2 \times 1 \end{bmatrix} = \begin{bmatrix} 5 & 6 \\ - 1 & 0 \\ 10 & 5 \end{bmatrix}\]

Here are a few helpful rules for matrix multiplication:

\(c\textbf{A}= \left\lbrack ca_{ij} \right\rbrack\)
\(c\left( \textbf{A}+ \textbf{B}\right) = c\textbf{A}+ c\textbf{B}\)
\(\textbf{C}\left( \textbf{A}+ \textbf{B}\right) = \textbf{C}\textbf{A}+ \textbf{C}\textbf{B}\)
\(\left( \textbf{A}\textbf{B}\right)\textbf{C}= \textbf{A}(\textbf{B}\textbf{C})\)
\(\left( \textbf{A}+ \textbf{B}\right)\left( \textbf{C}+ \mathbf{D} \right) = \textbf{A}\textbf{C}+ \textbf{A}\mathbf{D} + \mathbf{BC} + \mathbf{BD}\)
\(\left( \textbf{A}\textbf{B}\right)^\prime = \textbf{B}^\prime\textbf{A}^\prime\)
\(\left( c\textbf{A}\right)^\prime = c\textbf{A}^\prime\)

Inversion and Rank

In scalar algebra, division and multiplication are inverse operations, dividing a non-zero scalar by itself yields the multiplicative identity: \(\frac{a}{a} = 1\). What is the equivalent of this operation for matrices? First, inversion of a matrix does not reduce it to a scalar, the multiplicative identity for matrices is the identity matrix \(\textbf{I}\), a diagonal matrix with 1s on the diagonal. Second, the inversion is only defined for square matrices. If \(\textbf{A}\) is an \((n \times n)\) matrix, the matrix \(\textbf{B}\) for which

\[ \textbf{A}\textbf{B}= \textbf{I} \]

is called the inverse of \(\textbf{A}\), denoted as \(\textbf{A}^{- 1}\).

Full rank matrices

Inverse matrices do not have to exist, even for square matrices. If \(\textbf{A}\) has an inverse matrix, then \(\textbf{A}\) is called a nonsingular matrix. In that case, \(\textbf{A}^{- 1}\textbf{A}= \textbf{A}\textbf{A}^{- 1} = \text{I}\).

For the inverse of a square matrix to exist, for the matrix to be non-singular, the matrix must be of full rank. The rank of a matrix, denoted \(r(\textbf{A})\), is the number of its linearly independent columns. What does that mean? Suppose we are dealing with a \((n \times k)\) matrix \(\textbf{B}\) and its column vectors are \(\textbf{B}_{1},\cdots,\textbf{B}_{k}\). A linear combination of the columns of \(\textbf{B}\) is

\[ c_{1}\textbf{b}_{1} + c_{2}\textbf{b}_{2} + \cdots + c_{k}\textbf{b}_{k} = q \]

If you can find a set of scalars \(c_{1},\cdots,c_{k}\) such that \(q = 0\), then the columns of \(\textbf{B}\) are linearly dependent. If the only set of scalars that yields \(q = 0\) is

\[c_{1} = c_{2} = \cdots = c_{k} = 0\]

then the columns of \(\textbf{B}\) are not linearly dependent and the rank of \(\textbf{B}\) is \(k\).

Here are a few more useful results about the rank of a matrix:

\(r\left( \textbf{A}\right) = r\left( \textbf{A}^\prime \right) = r\left( \textbf{A}^\prime\textbf{A}\right) = r\left( \textbf{A}\textbf{A}^{\prime} \right)\)
\(r\left( \textbf{A}\textbf{B}\right) \leq \min\left\{ r\left( \textbf{A}\right),r\left( \textbf{B}\right) \right\}\)
\(r\left( \textbf{A}+ \textbf{B}\right) \leq r\left( \textbf{A}\right) + r(\textbf{B})\)

The first two results are particularly important in statistical models. In models with linear structures, it is common to collect the \(p\) input variables in a linear model, including the intercept as a column of ones, into a matrix \(\textbf{X}_{(n\ \times p + 1)}\):

\[ \textbf{X}_{(n\ \times p + 1)} = \begin{bmatrix} 1 & x_{11} & \begin{matrix} \cdots & x_{1p} \end{matrix} \\ \vdots & \vdots & \begin{matrix} \ddots & \vdots \end{matrix} \\ 1 & x_{n1} & \begin{matrix} \cdots & x_{np} \end{matrix} \end{bmatrix} \]

Suppose we want to solve the linear system \(\textbf{Y}= \textbf{X}\textbf{c}\) for \(\textbf{c}\). Start by pre-multiplying both sides of the equation with the transpose of \(\textbf{X}\):

\[ \textbf{X}^{\prime}\textbf{Y}= \textbf{X}^{\prime}\textbf{X}\textbf{c} \]

If we had an inverse of \(\textbf{X}^\prime\textbf{X}\), then we can now pre-multiply both sides with that inverse and isolate \(\text{c}\):

\[ \left( \textbf{X}^\prime\textbf{X}\right)^{\mathbf{- 1}}\textbf{X}^{\prime}\textbf{Y}= \left( \textbf{X}^\prime\textbf{X}\right)^{-1}\textbf{X}^\prime\textbf{X}\textbf{c}= \textbf{I}\textbf{c}= \textbf{c} \]

We have a solution to the system, namely \({\textbf{c}=\left( \textbf{X}^\prime\textbf{X}\right)}^{- 1}\textbf{X}^{\prime}\textbf{Y}\), only if the inverse \(\left( \textbf{X}^\prime\textbf{X}\right)^{-1}\) exists. And that requires this \((p + 1) \times (p + 1)\) matrix is of full rank \(r\left( \left( \textbf{X}^\prime\textbf{X}\right)^{- 1} \right) = p + 1\). This, in turn is equivalent to saying that \(\textbf{X}\) has full rank \(p + 1\) because of property (i).

Here are some useful results about inverse matrices:

\(\left( \textbf{A}^{- 1} \right)^\prime = \left( \textbf{A}^\prime \right)^{- 1}\)
\(\left( \textbf{A}^{- 1} \right)^{- 1} = \textbf{A}\)
\(\left( \textbf{A}\textbf{B}\right)^{- 1} = \textbf{B}^{-1}\textbf{A}^{-1}\)
\(r\left( \textbf{A}^{- 1} \right) = r(\textbf{A})\)

Inverse of partitioned matrices

Suppose that \(\textbf{B}_{(n \times n)}\) is a partitioned matrix \[ \textbf{B}_{(n \times n)} = \begin{bmatrix} \textbf{B}_{11} & \textbf{B}_{12} \\ \textbf{B}_{21} & \textbf{B}_{22} \end{bmatrix} \]

where \(B_{ij}\) is of dimension \(n_i \times n_j\). If \(\textbf{A}= \textbf{B}^{-1}\), then \(\textbf{A}\) is also a partitioned matrix with \[ \begin{align*} \textbf{A}_{11} &= \left[\textbf{B}_{11} - \textbf{B}_{12}\textbf{B}_{22}^{-1}\textbf{B}_{21} \right]^{-1} \\ \textbf{A}_{12} &= -\textbf{B}_{11}^{-1}\textbf{B}_{12}\left[\textbf{B}_{22} - \textbf{B}_{21}\textbf{B}_{11}^{-1}\textbf{B}_{12}\right]^{-1} = \textbf{B}_{11}^{-1}\textbf{B}_{12}\textbf{A}_{22} \\ \textbf{A}_{22} &= \left[\textbf{B}_{22} - \textbf{B}_{21}\textbf{B}_{11}^{-1}\textbf{B}_{12}\right] \\ \textbf{A}_{21} &= -\textbf{B}_{22}^{-1}\textbf{B}_{21}\left[\textbf{B}_{11} - \textbf{B}_{12}\textbf{B}_{22}^{-1}\textbf{B}_{21}\right]^{-1} -\textbf{B}_{22}^{-1}\textbf{B}_{21}\textbf{A}_{11} \end{align*} \]

Rank-deficient matrices

The linear system \(\textbf{Y}= \textbf{X}\textbf{c}\) has a unique solution if \(\textbf{X}\) is an \((n\times n)\) nonsingular matrix. If \(\textbf{X}\) is not square or if \(\textbf{X}\) is singular, then we can still find a solution to the linear system, but it is not unique. If the matrix \(\textbf{X}\) is of less than full rank (singular), it is called a rank-deficient matrix. We can then make progress by using a generalized inverse matrix instead of the (non-existing) inverse.

Also called pseudo inverses or g-inverses, the generalized inverse of matrix \(\textbf{X}\) is denoted \(\textbf{X}^{-}\) and satisfies

\[ \textbf{X}\textbf{X}^{-}\textbf{X}= \textbf{X} \] Suppose we can find such a generalized inverse \(\left( \textbf{X}^{\prime}\textbf{X}\right)^{-}\) of \(\textbf{X}^\prime\textbf{X}\). What if we use that in the solution of the linear system,

\[ \textbf{c}= {\left( \textbf{X}^{\prime}\textbf{X}\right)^{-}\textbf{X}}^{\prime}\textbf{Y} \]

Unfortunately, whereas regular inverses are unique, there are (infinitely) many generalized inverses that satisfy the condition \((\textbf{X}^\prime\textbf{X})\left( \textbf{X}^\prime\textbf{X}\right)^{-}\textbf{X}^\prime\textbf{X}= \textbf{X}^\prime\textbf{X}\). So, there will be infinitely many possible solutions to the linear system. Fortunately, it turns out that generalized inverses have some nice properties, for example, \(\textbf{X}\left( \textbf{X}^\prime\textbf{X}\right)^{-}\textbf{X}\) is invariant to the choice of the generalized inverse. Even if the solution \(\textbf{c}\) is not unique, \(\textbf{X}\textbf{c}\) is unique. This result is important in linear models with rank-deficient design matrices, a condition that is common when the model contains classification variables. While the parameter estimates in such a model are not unique—because we need to use a generalized inverse to derive the estimates—the predicted values are the same, no matter which generalized inverse we selected.

Here are some properties of generalized inverses, these resemble the properties of the inverse:

If \(c\) is nonzero, then \((c\textbf{A})^- = (1/c)\textbf{A}^-\)
\(\textbf{A}\textbf{A}^bA = \textbf{A}\text{ and } \textbf{A}^-\textbf{A}\textbf{A}^- = \textbf{A}^-\)
\(\left(\textbf{A}^\prime\right)^- = \left(\textbf{A}^-\right)^\prime\)
\(\left(\textbf{A}^- \right)^- = \textbf{A}\)
\(r(\textbf{A}^-) = r(\textbf{A}) = r(\textbf{A}\textbf{A}^-) = r(\textbf{A}\textbf{A}^-\textbf{A})\)
If \(\textbf{A}\) is symmetric, so is \(\textbf{A}^-\)

A technique for constructing generalized inverses relies on the singular value decomposition and is covered below. For a diagonal matrix \(\textbf{D}\), constructing a generalized inverse is simple: replace the non-zero diagonal values \(d_{ii}\) with \(1\d_{ii}\), and leave the zero diagonal values in place.

Determinant

The rank reduces a matrix to a single scalar value, the number of linearly independent columns of the matrix. Another value that reduces a square matrix to a single scalar is the determinant, written as \(det(\textbf{A})\) or \(|\textbf{A}|\).

If a matrix \(\textbf{A}\) has an inverse, then it is square and its determinant is nonzero.

The determinant has a geometric interpretation which is not that relevant for our discussion. What matters is that the determinant appears frequently in expressions of multivariate probability distributions and knowing how to manipulate the determinants.

\(|\textbf{A}| = |\textbf{A}^\prime|\)
\(|\textbf{I}| = 1\)
\(\left| c\textbf{A}\right| = c^{n}\mathbf{|A}\mathbf{|}\)
If \(\textbf{A}\) is singular, then \(\left| \textbf{A}\right| = 0\)
If each element of a row (column) of \(\textbf{A}\) is zero, then \(\left| \textbf{A}\right| = 0\)
If two rows (column) of \(\textbf{A}\) are identical, then \(\left| \textbf{A}\right| = 0\)
\(\left| \textbf{A}\textbf{B}\right| = \left| \textbf{A}\right|\ \left| \textbf{B}\right|\)
\(\left| \textbf{A}^{- 1} \right| = 1/|\textbf{A}|\)
If \(\textbf{A}\) is a triangular matrix, then \(|\textbf{A}| = \prod_{i = 1}^{n}a_{ii}\)
If \(\textbf{A}\) is a diagonal matrix, then \(|\textbf{A}| = \prod_{i = 1}^{n}a_{ii}\)

Trace

The trace operator, \(tr(\textbf{A})\), applies only to square matrices. The trace of an \(\textbf{A}_{(n \times n)}\) matrix is the sum of its diagonal elements:

\[tr\left( \textbf{A}\right) = \sum_{i = 1}^{n}a_{ii}\]

The trace plays an important role in statistics in determining expected values of quadratic forms of random variables, for example, sums of squares in linear models. An important property of the trace is its invariance under cyclic permutations,

\[tr\left( \mathbf{ABC} \right) = tr\left( \mathbf{BCA} \right) = tr(\mathbf{CAB})\]

provided the matrices conform to multiplication.

Some other useful properties of the trace are

\(tr\left( \textbf{A}+ \textbf{B}\right) = tr\left( \textbf{A}\right) + tr\left( \textbf{B}\right)\)
\(tr\left( \textbf{A}\right) = tr\left( \textbf{A}^\prime \right)\)
\(\textbf{Y}^\prime\text{Ay} = tr\left( \textbf{Y}^\prime\text{Ay} \right)\)
\(tr\left( c\textbf{A}\right) = c \times tr\left( \textbf{A}\right)\)
\(tr\left( \textbf{A}\right) = r(\textbf{A})\) if \(\textbf{A}\) is symmetric and idempotent (\(\textbf{A}\textbf{A}= \textbf{A}\) and \(\textbf{A}= \textbf{A}^\prime\))

3.4 Random Vectors

If the elements of a vector are random variables, the vector object itself is a random variable. You can think of random vectors as a convenient mechanism to collect random variables. Suppose we draw a random sample \(Y_{1},\cdots,Y_{n}\), then we can collect the \(n\) random variables in a single random vector

\[\textbf{Y}= \begin{bmatrix} Y_{1} \\ \vdots \\ Y_{n} \end{bmatrix}\]

Expected Value

Since each \(Y_{i}\) has a probability distribution, a mean (expected value) \(\text{E}\left\lbrack Y_{i} \right\rbrack\), a variance \(\text{Var}\left\lbrack Y_{i} \right\rbrack\), and so forth, the same applies to their collection. The expected value (mean) of a random vector is the vector of the expected values of its elements:

\[\text{E}\left\lbrack \textbf{Y}\right\rbrack = \begin{bmatrix} \text{E}\left\lbrack Y_{1} \right\rbrack \\ \vdots \\ \text{E}\left\lbrack Y_{n} \right\rbrack \end{bmatrix}\]

Suppose that \(\textbf{A},\ \textbf{B},\ \textbf{c}\) are matrices and vectors of constants, respectively, and that \(\textbf{Y}\) and \(\mathbf{U}\) are random vectors. The following are useful expectation operations in this situations:

\(\text{E}\left\lbrack \textbf{A}\right\rbrack = \textbf{A}\)
\(\text{E}\left\lbrack \mathbf{AYB} + \mathbf{C} \right\rbrack = \textbf{A}\text{E}\left\lbrack \textbf{Y}\right\rbrack\textbf{B}+ \textbf{C}\)
\(\text{E}\left\lbrack \mathbf{AY} + \mathbf{c} \right\rbrack = \textbf{A}\text{E}\left\lbrack \textbf{Y}\right\rbrack + \textbf{c}\)
\(\text{E}\left\lbrack \mathbf{AY} + \mathbf{BU} \right\rbrack = \textbf{A}\text{E}\left\lbrack \textbf{Y}\right\rbrack + \textbf{B}\ \text{E}\lbrack\mathbf{U}\rbrack\)

Covariance Matrix

While the distribution of \(Y_{i}\) is univariate, \(\textbf{Y}\) has a multivariate (\(n\)-variate) distribution. The mean of the distribution is represented by a vector. The variance of the distribution is represented by a matrix, the variance-covariance matrix, a special case of a covariance matrix.

The covariance matrix between random vectors \(\textbf{Y}_{(k \times 1)}\) and \(\mathbf{U}_{(p \times 1)}\) is a \((k \times p)\) matrix whose typical elements are the covariances between the elements of \(\textbf{Y}\) and \(\mathbf{U}\):

\[\text{Cov}\left\lbrack \textbf{Y},\mathbf{U} \right\rbrack = \left\lbrack \text{Cov}(Y_{i},U_{j}) \right\rbrack\]

The covariance matrix can be written in terms of expected values of \(\textbf{Y}\), \(\mathbf{U}\), and \(\textbf{Y}\mathbf{U}^\prime\)

\[\text{Cov}\left\lbrack \textbf{Y},\mathbf{U} \right\rbrack = \text{E}\left\lbrack \left( \textbf{Y}- \text{E}\lbrack\textbf{Y}\rbrack \right)\left( \mathbf{U} - \text{E}\left\lbrack \mathbf{U} \right\rbrack \right)^\prime \right\rbrack = \text{E}\left\lbrack \textbf{Y}\mathbf{U}^\prime \right\rbrack - \text{E}\left\lbrack \textbf{Y}\right\rbrack \text{E}\left\lbrack \mathbf{U} \right\rbrack^{\prime}\]

Some useful rules to manipulate covariance matrices are:

\(\text{Cov}\left\lbrack \mathbf{AY},\mathbf{U} \right\rbrack = \textbf{A}\text{Cov}\lbrack\textbf{Y},\mathbf{U}\rbrack\)
\(\text{Cov}\left\lbrack \textbf{Y},\mathbf{BU} \right\rbrack = \text{Cov}\left\lbrack \textbf{Y},\mathbf{U} \right\rbrack\textbf{B}^\prime\)
\(\text{Cov}\left\lbrack \mathbf{AY},\mathbf{BU} \right\rbrack = \textbf{A}\text{Cov}\left\lbrack \textbf{Y},\mathbf{U} \right\rbrack\ \textbf{B}^\prime\)
\(\text{Cov}\left\lbrack a\textbf{Y}+ b\mathbf{U},c\mathbf{W} + d\textbf{V}\right\rbrack = ac\text{Cov}\left\lbrack \textbf{Y},\mathbf{W} \right\rbrack + bc\text{Cov}\left\lbrack \mathbf{U},\mathbf{W} \right\rbrack + ad\text{Cov}\left\lbrack \textbf{Y},\textbf{V}\right\rbrack + bd\text{Cov}\lbrack\mathbf{U},\textbf{V}\rbrack\)

Variance-covariance Matrix

The variance-covariance matrix (or variance matrix for short) of a random vector \(\textbf{Y}\) is the covariance matrix of \(\textbf{Y}\) with itself.

\[\text{Var}\left\lbrack \textbf{Y}\right\rbrack = \text{Cov}\left\lbrack \textbf{Y},\textbf{Y}\right\rbrack = \text{E}\left\lbrack \left( \textbf{Y}- \text{E}\lbrack\textbf{Y}\rbrack \right)\left( \textbf{Y}-\text{E}\left\lbrack \textbf{Y}\right\rbrack \right)^\prime \right\rbrack = \text{E}\left\lbrack \textbf{Y}\textbf{Y}^\prime \right\rbrack - \text{E}\left\lbrack \textbf{Y}\right\rbrack \text{E}\left\lbrack \textbf{Y}\right\rbrack^{\prime}\]

The diagonal entries of the variance-covariance matrix contain the variances of the \(Y_{i}\). The off-diagonal cells contain the covariances \(\text{Cov}\left\lbrack Y_{i},Y_{j} \right\rbrack\). If the variance matrix is diagonal, the elements of random vector \(\textbf{Y}\) are uncorrelated. Two random vectors \(\textbf{Y}\) and \(\mathbf{U}\) are uncorrelated if their variance matrix is block-diagonal:

\[\text{Var}\begin{bmatrix} \textbf{Y}_{1} \\ \textbf{Y}_{2} \end{bmatrix} = \begin{bmatrix} \text{Var}\lbrack\textbf{Y}_{2}\rbrack & \textbf{0}\\ \textbf{0}& \text{Var}\lbrack\textbf{Y}_{1}\rbrack \end{bmatrix}\]

A very special variance-covariance matrix in statistical models is the scaled identity matrix, \(\sigma^{2}\textbf{I}\). This is the variance matrix of uncorrelated observations drawn from the same distribution—a common assumption for the error terms in models.

The rules for working with covariances extend to working with variance matrices:

\(\text{Var}\left\lbrack \mathbf{AY} \right\rbrack = \textbf{A}\ \text{Var}\left\lbrack \textbf{Y}\right\rbrack\textbf{A}^\prime\)
\(\text{Var}\left\lbrack \textbf{Y}+ \textbf{A}\right\rbrack = \text{Var}\lbrack\textbf{Y}\rbrack\)
\(\text{Var}\left\lbrack \textbf{A}^\prime\textbf{Y}\right\rbrack = \textbf{A}^\prime\text{Var}\left\lbrack \textbf{Y}\right\rbrack\textbf{A}\)
\(\text{Var}\left\lbrack a\textbf{Y}\right\rbrack = a^{2}\text{Var}\lbrack\textbf{Y}\rbrack\)
\(\text{Var}\left\lbrack a\textbf{Y}+ b\mathbf{U} \right\rbrack = a^{2}\text{Var}\left\lbrack \textbf{Y}\right\rbrack + b^{2}\text{Var}\left\lbrack \mathbf{U} \right\rbrack + 2ab\ \text{Cov}\lbrack\textbf{Y},\mathbf{U}\rbrack\)

Finally, an important result about expected values of quadratic forms, heavily used to in decomposing variability is

\[\text{E}\left\lbrack \textbf{Y}^\prime\mathbf{AY} \right\rbrack = tr\left( \textbf{A}\text{Var}\left\lbrack \textbf{Y}\right\rbrack \right) + \text{E}\left\lbrack \textbf{Y}\right\rbrack^\prime\textbf{A}\ \text{E}\lbrack\textbf{Y}\rbrack\]

Correlation Matrix

Suppose \(\textbf{Y}_n\) is a random vector with a positive definite (full rank) variance matrix \(\textbf{V}= \text{Var}[\textbf{Y}] = [v_{ij}]\). The correlation matrix is the \((n \times n)\) matrix with typical element \[ \textbf{R}= \left[\text{Corr}[Y_i,Y_j]\right] = \frac{v_{ij}}{\sqrt{v_{ii} \ v_{jj}}} \]

You can express \(\textbf{R}\) as a product of matrices, \[ \textbf{R}= \textbf{D}^{-1/2} \textbf{V}\textbf{D}^{-1/2} \] where \(\textbf{D}\) is a \((n \times n)\) diagonal matrix that contains the variances \(v_{ii}\) on its diagonal.

3.5 Cross-product Matrix

Statistical models are often expressed in terms of a vector of length \(p\) derived from the input variables of the model: \(\textbf{x}= [x_1, x_2, \cdots, x_p]^\prime\). With continuous input variables, \(x_j\) refers to the value of the variable itself or a transformation such as the logarithm, square root, etc. The values of \(x_j\) can also be constructed from transformations and functions involving other variables, for example \(x_3 = x_1 * x_2\).

In the case of a qualitative factor with \(k\) levels, the factor can occupy \(k-1\) binary variables in the model vector. Often, a 1 is added at the beginning of the vector to accommodate an intercept term: \(\textbf{x}= [1, x_1, x_2, \cdots, x_p]^\prime\).

The vector of inputs for the \(i\)^th observation is then \(\textbf{x}_i = [1, x_{i1}, x_{i2}, \cdots, x_{ip}]^\prime\).

When these input rows are stacked, we obtain the \((n \times p+1)\) model matrix (sometimes called the design matrix) \[ \textbf{X}_{(n\times p+1)} = \begin{bmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1p} \\ 1 & x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \cdots & x_{np} \end{bmatrix} \tag{3.1}\]

A cross-product matrix is the product of the transpose of a matrix with itself: \[ \textbf{X}^\prime\textbf{X} \] A cross-product matrix is square and symmetric. The cross-product of the model matrix plays an important role in estimating model parameters and in determining their properties. For the model matrix in Equation 3.1, the entries of the cross-product matrix take the following form \[ \textbf{X}^\prime\textbf{X}= \begin{bmatrix} n & \sum x_{i1} & \sum x_{i2} & \cdots & \sum x_{ip} \\ \sum x_{i1} & \sum x_{i1}^2 & \sum x_{i1}x_{i2} & \cdots & \sum x_{i1}x_{ip} \\ \sum x_{i2} & \sum x_{i2}x_{i1} & \sum x_{i2}^2 & \cdots & \sum x_{i2}x_{ip} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ \sum x_{ip} & \cdots & \cdots & \cdots & \sum x_{ip}^2 \end{bmatrix} \]

The sum of squares of the \(\textbf{X}\) columns are on the diagonal, the sum of the cross-products are in the off-diagonal cells.

3.6 Idempotent Matrices

A matrix is called idempotent if multiplying the matrix by itself yields the matrix–\(\textbf{A}\textbf{A}= \textbf{A}\). Because \(\textbf{A}\) must conform to itself for multiplication, idempotent matrices are square matrices. Idempotent matrices that are also symmetric play an important role in statistical estimation. Idempotent matrices are projection matrices, that means they map a vector from a space to a sub-space. Symmetric idempotent matrices are orthogonal projection matrices.

Among some of the interesting properties of idempotent matrices are the following:

The trace of an idempotent matrix is equal to its rank.
The trace of an idempotent matrix is an integer.
The eigenvalues of an idempotent matrix are either 0 or 1.

Projections

For example, suppose we want to find a solution for \(\boldsymbol{\beta}\) in the linear model \(\textbf{Y}_{(n \times 1)} = \textbf{X}\boldsymbol{\beta}_{(p + 1 \times 1)} + \mathbf{\epsilon}\). The vector \(\textbf{Y}\) is a vector in \(n\)-dimensional space \(\mathbb{R}^{n}\) and the model places a restriction on the predicted values \(\widehat{\textbf{Y}} = \textbf{X}\widehat{\boldsymbol{\beta}}\): the predicted values are confined to a \((p + 1)\)-dimensional sub-space of \(\mathbb{R}^{n}\). Regardless of how we choose the estimator \(\widehat{\boldsymbol{\beta}}\), we are dealing with projecting vector \(\textbf{Y}\) onto a sub-space of \(\mathbb{R}^{n}\).

We can thus think of the problem of finding the best estimator in this model as the problem of finding the best projection onto the space generated by the columns of \(\textbf{X}\). In that case, why not choose the projection that minimizes the distance between the observed values \(\textbf{Y}\) and the predicted values \(\widehat{\textbf{Y}}\). This is achieved by projecting \(\textbf{Y}\) perpendicular (orthogonal) onto the sub-space generated by \(\textbf{X}\). In other words, our solution is the vector \(\widehat{\boldsymbol{\beta}}\) that satisfies

\[\left( \textbf{Y}- \textbf{X}\widehat{\boldsymbol{\beta}} \right)^\prime\textbf{X}\widehat{\boldsymbol{\beta}} = 0\]

Multiplying out and rearranging terms yields

\[{\widehat{\boldsymbol{\beta}}}^{\prime}\textbf{X}^{\prime}\textbf{Y}= {\widehat{\boldsymbol{\beta}}}^{\prime}\textbf{X}^\prime\textbf{X}\widehat{\boldsymbol{\beta}}\]

which implies that

\[\textbf{X}^{\prime}\textbf{Y}= \textbf{X}^\prime\textbf{X}\widehat{\boldsymbol{\beta}}\]

If \(\textbf{X}\) is of full column rank, then \(\left( \textbf{X}^\prime\textbf{X}\right)^{- 1}\) exists and we can solve:

\[\widehat{\boldsymbol{\beta}}=\left( \textbf{X}^\prime\textbf{X}\right)^{- 1}\textbf{X}^{\prime}\textbf{Y}\]

The “Hat” Matrix

The ordinary least squares estimator is the orthogonal projection of \(\textbf{Y}\) onto the sub-space created by the columns of \(\textbf{X}\). To see the projection matrix at work, compute the predicted values, \(\textbf{X}\widehat{\boldsymbol{\beta}}\):

\[\textbf{X}\widehat{\boldsymbol{\beta}}=\textbf{X}\left( \textbf{X}^{\prime}\textbf{X}\right)^{- 1}\textbf{X}^{\prime}\textbf{Y}= \widehat{\textbf{Y}}\]

The matrix \(\textbf{X}\left( \textbf{X}^\prime\textbf{X}\right)^{- 1}\textbf{X}^{\prime}\) has a very special role in regression analysis, it is often called the “hat” matrix and denoted \(\textbf{H}\), because pre-multiplying \(\textbf{Y}\) with \(\textbf{H}\) puts the hats on \(\textbf{Y}\):

\[\textbf{H}\textbf{Y}= \widehat{\textbf{Y}}\]

Let’s verify that \(\textbf{H}\) is indeed a projection matrix, which requires that \(\textbf{H}\textbf{H}= \textbf{H}\):

\[\textbf{H}\textbf{H}= \textbf{X}\left( \textbf{X}^\prime\textbf{X}\right)^{- 1}\textbf{X}^{\prime} \times \textbf{X}\left( \textbf{X}^\prime\textbf{X}\right)^{-1}\textbf{X}^\prime=\textbf{X}\left( \textbf{X}^\prime\textbf{X}\right)^{- 1}\textbf{X}^{\prime} = \textbf{H}\]

Matrices with the property that \(\textbf{H}\textbf{H}= \textbf{H}\) are called idempotent matrices, these are projection matrices. If, in addition, \(\textbf{H}\) is symmetric, \(\textbf{H}^\prime = \textbf{H}\), the matrix is called symmetric idempotent—these are orthogonal projection matrices. (An idempotent matrix that is not symmetric is called an oblique projector.)

The hat matrix in the regression model is a symmetric idempotent matrix.

Here are some results about (symmetric) idempotent matrices that come in handy when working out the properties of estimators in regression-type models:

Projection matrices are typically not of full rank. If an \((n \times n)\) idempotent matrix is of rank \(n\), then it is the identity matrix \(\textbf{I}\).
If \(\textbf{A}\) is (symmetric) idempotent, then \(\textbf{I}- \textbf{A}\) is (symmetric) idempotent.
If \(\mathbf{P}\) is non-singular, then \(\mathbf{PA}\mathbf{P}^{-1}\) is an idempotent matrix.

You can use these properties to show that in the linear regression model with uncorrelated errors and equal variance the variance matrix of the model residuals \(\textbf{Y}- \widehat{\textbf{Y}}\) is

\[\text{Var}\left\lbrack \textbf{Y}- \widehat{\textbf{Y}} \right\rbrack = \sigma^{2}(\textbf{I}- \textbf{H})\]

Consider the special case where \(\textbf{X}= \textbf{1}_{n}\), a column vector of ones. The corresponding linear model is \(\textbf{Y}= \textbf{1}\beta + \boldsymbol{\epsilon}\), an intercept-only model. The hat matrix for this model is

\[\textbf{1}\left( \textbf{1}^{\prime}\textbf{1}\right)^{- 1}\textbf{1}^\prime = \frac{1}{n}\textbf{1}\textbf{1}^\prime = \frac{1}{n}\textbf{J}= \begin{bmatrix} \frac{1}{n} & \cdots & \frac{1}{n} \\ \vdots & \ddots & \vdots \\ \frac{1}{n} & \cdots & \frac{1}{n} \end{bmatrix} \]a matrix filled with \(\frac{1}{n}\). The projection of \(\textbf{Y}\) onto the space generated by \(\textbf{1}_{n}\) is

\[\frac{1}{n}\mathbf{JY} = \begin{bmatrix} \overline{Y} \\ \vdots \\ \overline{Y} \end{bmatrix}\]

The predicted values are all the same, the sample mean. In other words, \(\beta = \overline{Y}\). Since the projector is idempotent, deriving the variance of the predicted value in the iid case is simple:

\[ \text{Var}\left\lbrack \textbf{H}\textbf{Y}\right\rbrack = \textbf{H}\text{Var}\left\lbrack \textbf{Y}\right\rbrack\textbf{H}^\prime = \sigma^{2}\textbf{H}\textbf{H}^{\prime} = \sigma^{2}\textbf{H}= \sigma^{2}\frac{1}{n}\textbf{J}=\begin{bmatrix} \frac{\sigma^{2}}{n} & \cdots & \frac{\sigma^{2}}{n} \\ \vdots & \ddots & \vdots \\ \frac{\sigma^{2}}{n} & \cdots & \frac{\sigma^{2}}{n} \end{bmatrix} \]

3.7 Matrix Factorization

The factorization (or decomposition) of a matrix \(\textbf{A}\) expresses the matrix as a product of two or more matrices, for example, \(\textbf{A}= \textbf{B}\textbf{C}\). Matrix multiplication defines the result of \(\textbf{B}\textbf{C}\), but how can we reverse this and express the result as a product of matrices?

The reasons for matrix factorization are to simplify computation, to make working with the matrix easier, and to reveal important properties of \(\textbf{A}\). Consequently, not any factorization will do, we are interested in very specific forms of writing \(\textbf{A}\) as a product. Two important matrix factorizations are the eigenvalue decomposition and the singular value decomposition.

The word eigen in eigenvalue has German roots. It means to own, the German noun Eigentum refers to possession (ownership). It also means an inherent property that is special to a person or entity. Eigen can refer to a peculiar property. To make the concept of an eigenvalue or an eigenvector understandable, the idea of an inherent property of oneself is appropriate. To see the connection we consider first linear transformations from \(\mathbb{R}^n\) tp \(\mathbb{R}^m\).

Linear Transformations

If \(\textbf{A}_{(m \times n)}\) is a matrix and \(x_n\) is a vector in \(\mathbb{R}^n\), then \[ \textbf{y}= \textbf{A}\textbf{x} \] is a vector in \(\mathbb{R}^m\). The matrix \(\textbf{A}\) linearly transformed the \(n\)-vector \(\textbf{x}\) into the \(m\)-vector \(\textbf{y}\). You can say that \(\textbf{A}\) “moved” \(\textbf{x}\) to %\(\textbf{y}\). The transform is called linear because the transformation \(\textbf{A}\) of a linear combination of two vectors—\(c_1\textbf{x}_1 + c_2\textbf{x}_2\)—is identical to the linear combination of the tranformed vectors: \[ \textbf{A}(c_1\textbf{x}_1 + c_2\textbf{x}_2) = c_1(\textbf{A}\textbf{x}_1) + c_2(\textbf{A}\textbf{x}_2) \]

Eigenvalue Decomposition

Suppose \(\textbf{A}\) is a square matrix, \((n \times n)\). Is there a vector \(\textbf{x}\) for which the transformation of \(\textbf{x}\) results in a multiple of itself? That is, can we find some vector \(\textbf{x}\) and some number \(\lambda\) for which \[ \textbf{A}\textbf{x}= \lambda\textbf{x} \]

Rearranging terms, the equation can be written as \[ (\textbf{A}- \lambda\textbf{I})\textbf{x}= \textbf{0} \]

Clearly, \(\textbf{x}=\textbf{0}\) satisfies the equation regardless of the value of \(\lambda\). A nonzero solution exists if and only if \(det(\textbf{A}-\lambda\textbf{I}) = 0\). The values for \(\lambda\) for which this characteristic equation holds are called the eigenvalues of \(\textbf{A}\).

If \(\lambda_j\) is such a value, and \(\textbf{x}_j\) is a vector for which \[ \textbf{A}\textbf{x}_j = \lambda_j \textbf{x}_j \] then \(\textbf{x}_j\) is called the eigenvector of \(\textbf{A}\) corresponding to the eigenvalue \(\lambda_j\).

Some interesting results about eigenvalues:

If \(\textbf{A}_{(n\times n)}\) is singular, then it has at least one zero eigenvalue.
For \(\textbf{A}_{(n\times n)}\) and \(\textbf{C}_{(n \times n)}\) nonsingular, the following matrices have the same eigenvalues: \(\textbf{A}\), \(\textbf{C}^{-1}\textbf{A}\textbf{C}\), \(\textbf{C}\textbf{A}\textbf{C}^{-1}\).
\(\textbf{A}\) and \(\textbf{A}^\prime\) have the same eigenvalues.
If \(\textbf{A}\) is nonsingular and \(\lambda\) is one of its eigenvalues, then \(\lambda\) is an eigenvalue of \(\textbf{A}^{-1}\).

We have finally arrived at the following: the eigenvalue decomposition (or eigendecomposition) of a square \((n \times n)\) matrix \(\textbf{A}\) is a factorization in terms of eigenvalues and eigenvectors: \[ \textbf{A}= \textbf{Q}\boldsymbol{\Lambda}\textbf{Q}^{-1} \] where \(\boldsymbol{\Lambda}\) is a diagonal matrix containing the \(n\) eigenvalues \(\lambda_1, \cdots, \lambda_n\) and \(\textbf{Q}\) is an \((n \times n)\) matrix of eigenvectors.

If \(\textbf{A}\) is a real symmetric matrix, then \(\textbf{Q}\) is an orthogonal matrix, which implies that \(\textbf{Q}^{-1} = \textbf{Q}^\prime\). An orthogonal (or orthonormal) matrix is a square matrix of real numbers whose columns are independent (perpendicular) of each other and have unit length. If \(\textbf{Q}\) is orthogonal, this property can be expressed as \(\textbf{Q}^\prime\textbf{Q}=\textbf{I}\), which implies that the transpose of the orthogonal matrix is its inverse.

We can then write the eigenvalue decomposition as \[ \textbf{A}= \textbf{Q}\boldsymbol{\Lambda}\textbf{Q}^\prime \]

One interpretation of the eigenvalues is the extent to which they shrink or stretch \(\textbf{A}\) in the direction of the eigenvectors. For a real symmetric \(\textbf{A}\) we can write \[ \textbf{A}\textbf{q} = \lambda\textbf{q} \] where \(\textbf{q}\) is one of the orthonormal eigenvectors, \(||\textbf{q}||=1\). Then \[ ||\textbf{A}\textbf{q}|| = ||\lambda\textbf{q}|| = |\lambda| \, ||\textbf{q}|| = |\lambda| \]

Constructing an inverse

If \(\textbf{A}\) is of full rank, then an inverse can be constructed from the eigenvalue decomposition as \[ \textbf{A}^{-1} = \textbf{Q}\boldsymbol{\Lambda}^{-1}\textbf{Q}^{-1} \] and in the case of a real symmetric matrix: \[ \textbf{A}= \textbf{Q}\boldsymbol{\Lambda}^{-1}\textbf{Q}^\prime \]

The number of nonzero diagonal values of \(\boldsymbol{\Lambda}\) equals the rank of the decomposed matrix. For a nonsingular (full-rank) \(\textbf{A}\), all eigenvalues are greater than zero. The \(\lambda_j\) can be close to zero, however, so when computing eigenvalues with finite-precision arithmetic (computers!), we apply thresholds to declare a singularity.

Singular Value Decomposition (SVD)

The matrix factorized in the eigendecomposition is a square \((n \times n)\) matrix. What can we say about arbitrary \(\textbf{A}_{(m \times n)}\) matrices? The eigendecomposition cannot be applied directly, because we need a square matrix. What happens if we consider the decomposition of the square symmetric \(\textbf{A}^\prime \textbf{A}\) instead? We know that there is an eigenvalue decomposition of the form \[ \textbf{A}^\prime \textbf{A}= \textbf{Q}\boldsymbol{\Lambda}\textbf{Q}^\prime \] The singular values of \(\textbf{A}\) are the square roots of the eigenvalues of \(\textbf{A}^\prime\textbf{A}\). They are the lengths of the vectors \(\textbf{A}\textbf{q}_1, \cdots, \textbf{A}\textbf{q}_n\), where \(\textbf{q}_j\) is the \(j\)^th eigenvector of \(\textbf{A}^\prime\textbf{A}\).

The singular value decomposition (SVD) of the \((m \times n)\) matrix \(\textbf{A}\) is written as \[ \textbf{A}= \textbf{U}\boldsymbol{\Sigma}\textbf{V}^\prime \] where \(\boldsymbol{\Sigma}\) is a \((m \times n)\) diagonal matrix containing the singular values of \(\textbf{A}\) (the square roots of the eigenvalues of \(\textbf{A}^\prime \textbf{A}\)), \(\textbf{U}\) is a \((m \times m)\) orthogonal matrix of left singular vectors, and \(\textbf{V}\) is a \((n \times n)\) matrix of right singular vectors of \(\textbf{A}\).

The SVD is an important operation on matrices with many applications in statistics and data science:

Principal component analysis
Reducing the dimensionality of a high-dimensional problem
Computing a low-rank approximation to a matrix
Imputation of missing values
Spectral analysis
Image processing
Computing inverse and generalized inverse matrices

Constructing an inverse

If \(\textbf{A}\) is square and nonsingular, we can find its inverse from the elements of the singular value decomposition. Since \[ \textbf{A}= \textbf{U}\boldsymbol{\Sigma}\textbf{V}^\prime \] \[ \textbf{A}^{-1} = \left(\textbf{V}^\prime\right)^{-1}\boldsymbol{\Sigma}^{-1}\textbf{U}^{-1} = \textbf{V}\boldsymbol{\Sigma}^{-1}\textbf{U}^\prime \] The first equality applies the rule for the inverse of a product, the second equality uses the fact that \(\textbf{U}\) and \(\textbf{V}\) are orthogonal matrices. As in the case of using the eigendecomposition for inversion, \(\boldsymbol{\Sigma}^{-1}\) is a diagonal matrix with the reciprocals of the singular values on the diagonal.

Constructing a generalized inverse

If \(\textbf{A}\) is nonsingular, we cannot construct an inverse, but the previous approach can be modified to construct a generalized inverse as \[ \textbf{A}^{-1} = \textbf{V}{\boldsymbol{\Sigma}^*}^{-1}\textbf{U}^\prime \] where \({\boldsymbol{\Sigma}^*}^{-1}\) is a diagonal matrix; its \(i\)^th diagonal value is the \(i\)^th singular value if \(\sigma_i\) is nonzero, and is zero otherwise. Since the singular values are often arranged in order, \(\sigma_1 \ge \sigma_2 \ge \cdots \ge \sigma_n\), this amounts to setting the last \(n-r\) singular values to zero, if \(r\) is the rank of \(\textbf{A}\).

You can show that a matrix so constructed satisfies the conditions for a generalized inverse.

3.8 Matrix Differentiation

Estimation of parameters in statistical models often requires minimization or maximization of an objective function. For example, the ordinary least squares (OLS) principle finds the OLS estimator as the function of the data that minimizes the error sum of squares of the model. Maximum likelihood finds estimators of the parameters as the functions of the data that maximizes the joint likelihood (the joint distribution function) of the data.

The parameters of the models appear as elements of vectors and matrices. Finding estimators of the parameters thus requires calculus on vectors and matrices. Consider matrix \(\textbf{A}\), whose elements depend on a scalar parameter \(\theta\), \(\textbf{A}= \left\lbrack a_{ij}(\theta) \right\rbrack\). The derivative of \(\textbf{A}\) with respect to \(\theta\) is the matrix of the derivatives of the typical elements \(a_{ij}(\theta)\) with respect to \(\theta\). We write this formally as

\[\frac{\partial\textbf{A}}{\partial\theta} = \left\lbrack \frac{\partial a_{ij}(\theta)}{\partial\theta} \right\rbrack\]

The derivative of a function \(f(\boldsymbol{\theta})\) with respect to the vector \(\boldsymbol{\theta}_{(p \times 1)}\) is the vector of the partial derivatives of the function

\[\frac{\partial f\left( \boldsymbol{\theta}\right)}{\partial\boldsymbol{\theta}} = \begin{bmatrix} \frac{\partial f\left( \boldsymbol{\theta}\right)}{\partial\theta_{1}} \\ \vdots \\ \frac{\partial f\left( \boldsymbol{\theta}\right)}{\partial\theta_{p}} \end{bmatrix}\]

Here are some useful results from vector and matrix calculus where \(\textbf{A}\) and \(\textbf{B}\) are functions of \(\theta\) and vector \(\textbf{X}\) does not depend on \(\theta\):

\(\frac{{\partial ln}\left| \textbf{A}\right|}{\partial\theta} = \frac{1}{\left| \textbf{A}\right|}\frac{\partial\left| \textbf{A}\right|}{\partial\theta} = tr\left( \textbf{A}^{- 1}\frac{\partial\textbf{A}}{\partial\theta} \right)\)
\(\frac{\partial\textbf{A}^{- 1}}{\partial\theta} = - \textbf{A}^{- 1}\frac{\partial\textbf{A}}{\partial\theta\ }\textbf{A}^{- 1}\)
\(\frac{\partial tr\left( \mathbf{AB} \right)}{\partial\theta} = tr\left( \frac{\mathbf{\partial}\textbf{A}}{\partial\theta}\textbf{B}\right) + tr\left( \textbf{A}\frac{\mathbf{\partial}\textbf{B}}{\partial\theta} \right)\)
\(\frac{\partial\textbf{X}^\prime\textbf{A}^{- 1}\textbf{X}}{\partial\theta} = - \textbf{X}^\prime\textbf{A}^{- 1}\frac{\partial\textbf{A}}{\partial\theta}\textbf{A}^{- 1}\textbf{X}\)
\(\frac{\partial\textbf{X}^{\prime}\mathbf{Ax}}{\partial\textbf{X}} = 2\mathbf{Ax}\)
\(\frac{\partial\textbf{X}^\prime\textbf{A}}{\partial\textbf{X}} = \frac{\partial\textbf{A}^\prime\textbf{X}}{\partial\textbf{X}} = \textbf{A}\)

3.9 Multivariate Gaussian Distribution

Definition

A scalar random variable \(Y\) has a Gaussian distribution function with mean \(\mu\) and variance \(\sigma^{2}\) if its probability density function is given by

\[f(y) = \frac{1}{\sqrt{2\pi\sigma^{2}}}\exp\left\{ - \frac{1}{2\sigma^{2}}(y - \mu)^{2} \right\}\]

We also say that \(Y\) is normally distributed with mean \(\mu\) and variance \(\sigma^{2}\). The shorthand expressions \(Y \sim G(\mu,\sigma^{2})\) or \(Y \sim N(\mu,\sigma^{2})\) are common.

The generalization from a scalar random variable \(Y\) to a random vector \(\textbf{Y}_{(n \times 1)}\) with a multivariate Gaussian distribution is as follows. \(\textbf{Y}_{(n \times 1)}\) has a multivariate Gaussian (normal) distribution with mean \(\boldsymbol{\mu}\) and variance matrix \(\textbf{V}\), if its density is given by

\[f\left( \textbf{Y}\right)=\frac{\left| \textbf{V}\right|^{- 1/2}}{(2\pi)^{\frac{n}{2}}}\exp\left\{ - \frac{1}{2}\left( \textbf{Y}- \boldsymbol{\mu}\right)^\prime\textbf{V}^{- 1}\left( \textbf{Y}- \boldsymbol{\mu}\right) \right\}\]

This is denoted with the shorthand \(\textbf{Y}\sim G_{n}(\boldsymbol{\mu},\textbf{V})\) or \(\textbf{Y}\sim N_{n}(\boldsymbol{\mu},\textbf{V}\)\). If the dimension of the distribution is clear from context, the subscript \(n\) can be omitted. A special case is the standard multivariate Gaussian distribution with mean \(\textbf{0}\) and variance matrix \(\textbf{I}\) :

\[f\left( \textbf{Y}\right)=\frac{1}{(2\pi)^{\frac{n}{2}}}\exp\left\{ - \frac{1}{2}\textbf{Y}^{\prime}\textbf{Y}\right\} = \frac{1}{(2\pi)^{\frac{n}{2}}}\exp\left\{ - \frac{1}{2}\sum_{i}^{n}y_{i}^{2} \right\}\]

But this is just the product of the \(n\) univariate densities of \(N(0,1)\) random variables:

\[f\left( \textbf{Y}\right) = f\left( y_{1} \right) \times \cdots \times f\left( y_{n} \right)\]

where

\[f\left( y_{i} \right) = \frac{1}{(2\pi)^{1/2}}\exp\left\{ - \frac{1}{2}y^{2} \right\}\]

If the variance matrix is diagonal—that is, the \(Y_{i}\) are uncorrelated—the multivariate normal distribution is the product of the univariate distributions. The random variables are independent.

Properties

Gaussian distributions have amazing (magical) properties.

Linear combinations are Gaussian

For example, a linear combination of Gaussian random variables also follows a Gaussian distribution. Formally, this can be expressed as follows: if \(\textbf{Y}\sim G_{n}\left( \boldsymbol{\mu},\textbf{V}\right)\) and \(\textbf{A}\) and \(\textbf{B}\) are a matrix and vector of constants (not random variables), respectively, then \(\mathbf{AY} + \textbf{B}\) follows a \(G(\textbf{A}\boldsymbol{\mu},\mathbf{AVA})\) distribution.

A special case of this result is that if \(\textbf{Y}\sim G_{n}\left( \boldsymbol{\mu},\textbf{V}\right)\), \(\textbf{Y}- \boldsymbol{\mu}\) has a \(G\left( 0,\textbf{V}\right)\) distribution.

Because a linear function of a Gaussian random variable is Gaussian distributed, you can define all multivariate Gaussian distributions as linear transformations of the standard multivariate Gaussian distribution. If \(\textbf{Z}\sim G_{n}(\textbf{0},\textbf{I})\), and \(\textbf{V}= \textbf{C}^\prime\textbf{C}\), then \(\textbf{Y}= \mathbf{C}^\prime\textbf{Z}+ \boldsymbol{\mu}\) has a \(G(\boldsymbol{\mu},\textbf{V})\) distribution.

Zero covariance implies independence

Another unusual property of Gaussian random variables is that if they are uncorrelated, they are also stochastically independent. We derived this above for the special case of \(\textbf{Y}\sim G(\textbf{0},\sigma^{2}\textbf{I})\).

You cannot in general conclude that random variables are independent based on their lack of correlation. For Gaussian random variables you can. This result can be extended to Gaussian random vectors. Suppose \(\textbf{Y}_{(n \times 1)} \sim G(\boldsymbol{\mu},\ \textbf{V})\) is partitioned into two sub-vectors of size \(s\) and \(k\), where \(n = s + k\). Then we can similarly partition the mean vector and variance matrix:

\[\textbf{Y}_{(n \times 1)} = \begin{bmatrix} \textbf{Y}_{1(s \times 1)} \\ \textbf{Y}_{2(k \times 1)} \end{bmatrix},\ \ \boldsymbol{\mu}= \begin{bmatrix} \boldsymbol{\mu}_{1} \\ \boldsymbol{\mu}_{2} \end{bmatrix},\ \ \ \ \ \ \textbf{V}= \begin{bmatrix} \textbf{V}_{11} & \textbf{V}_{12} \\ \textbf{V}_{21} & \textbf{V}_{22} \end{bmatrix}\]

If \(\textbf{V}_{12} = \textbf{0}\), then \(\textbf{Y}_{1}\) and \(\textbf{Y}_{2}\) are independent. Also, each partition is Gaussian distributed, for example, \(\textbf{Y}_{1} \sim G(\boldsymbol{\mu}_{1},\ \textbf{V}_{11})\). We call the distribution of \(\textbf{Y}_{1}\) the marginal distribution.

It follows immediately that each element of \(\textbf{Y}\) follows a (univariate) Gaussian distribution, \(Y_{i} \sim G(\mu_{i},V_{ii})\)—all marginal univariate distributions are Gaussian.

Conditionals are Gaussian}

The conditional distribution of \(\textbf{Y}_{1}\) given \(\textbf{Y}_{2}\) is also a Gaussian distribution, specifically:

\[\textbf{Y}_{1}|\textbf{Y}_{2} \sim G\left( \boldsymbol{\mu}_{1}\mathbf{+}\textbf{V}_{12}\textbf{V}_{22}^{- 1}\left( \textbf{Y}_{2} - \boldsymbol{\mu}_{2} \right),\ \textbf{V}_{11} - \textbf{V}_{12}\textbf{V}_{22}^{- 1}\textbf{V}_{12}^\prime \right)\]

This result plays an important role when predicting Gaussian random variables, for example in mixed models.

Notice that the variance matrix of the conditional distribution does not depend on the particular value \(\textbf{Y}_{2} = \textbf{Y}_{2}\) on which the distribution is conditioned. However, the mean of the conditional distribution does depend on \(\textbf{Y}_{2}\) unless \(\textbf{V}_{12} = \textbf{0}\), a condition established earlier for independence of \(\textbf{Y}_{1}\) and \(\textbf{Y}_{2}\).