3  Linear Algebra Review

3.1 Basics

Command of linear algebra is essential in data science, models and estimators are often expressed in terms of tensors, matrices, and vectors. Using scalar-based arithmetic becomes tedious very quickly as models become more complex. For example, the simple linear regression model and a straight line through the intercept model can be written as

\[Y_{i} = \beta_{0} + \beta_{1}x_{i} + \epsilon_{i}\]

\[Y_{i} = \beta_{1}x_{i} + \epsilon_{i}\]

Using scalar algebra, the estimates of the slope are quite different:

\[{\widehat{\beta}}_{1} = \frac{\left( \sum_{i = 1}^{n}{\left( y_{i} - \overline{y} \right)\left( x_{i} - \overline{x} \right)} \right)}{\sum_{i = 1}^{n}\left( x_{i} - \overline{x} \right)^{2}}\]

\[{\widehat{\beta}}_{1} = \frac{\left( \sum_{i = 1}^{n}{y_{i}x_{i}} \right)}{\sum_{i = 1}^{n}x_{i}^{2}}\]

The formulas get messier as we add another input variable to the model. Using matrix—vector notation, the estimator of all the regression coefficients takes the same form, regardless of the size of the model:

\[\widehat{\boldsymbol{\beta}} = \left( \textbf{X}^{\prime}\textbf{X}\right)^{- 1}\textbf{X}^{\prime}\textbf{Y}\]

A scalar is a single real number, a vector is an array of scalars arranged in a single column (a column vector) or a row (a row vector). A matrix is a two-dimensional array of scalars, a tensor is a multi-dimensional array.

The order of a vector or matrix is specified as (rows x columns) and is sometimes used as a subscript for clarity. For example,\(\textbf{A}_{(3 \times 5)}\) denotes a matrix with 3 rows and 5 columns. It can be viewed as a concatenation} of five \((3 \times 1)\) column vectors:

\[\textbf{A}_{(3 \times 5)}=\begin{bmatrix} \begin{matrix} 1 \\ 1 \\ 1 \end{matrix} & \begin{matrix} 9.0 \\ 3.2 \\ 4.1 \end{matrix} & \begin{matrix} \begin{matrix} 6.2 \\ 1.4 \\ - 0.6 \end{matrix} & \begin{matrix} 1 \\ 0 \\ 0 \end{matrix} & \begin{matrix} 0 \\ 1 \\ 0 \end{matrix} \end{matrix} \end{bmatrix}\]

\(\textbf{a}_{1} = \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix}\ \ \ \ \ \textbf{a}_{2} = \begin{bmatrix} 9.0 \\ 3.2 \\ 4.1 \end{bmatrix}\ \ \ \ \textbf{a}_{3} = \begin{bmatrix} 6.2 \\ 1.4 \\ - 0.6 \end{bmatrix}\ \ \ \ \ \textbf{a}_{4} = \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix}\ \ \ \ \ \textbf{a}_{5} = \begin{bmatrix} 0 \\ 1 \\ 0 \end{bmatrix}\).

A matrix with as many rows as columns is called a square matrix.

Bold symbols are common, lowercase for vectors and uppercase for matrices, but there are some exceptions. When dealing with vectors of random variables, bold uppercase notation is used for a vector of random variables and bold lowercase notation is used for a vector of the realized values. For example, if \(Y_{1},\cdots,Y_{n}\) is a random sample of size \(n\), the vector of random variables is

\[\textbf{Y}_{(n \times 1)} = \begin{bmatrix} Y_{1} \\ \vdots \\ Y_{n} \end{bmatrix}\]

and the vector of realized values is

\[\textbf{y}_{(n \times 1)} = \begin{bmatrix} y_{1} \\ \vdots \\ y_{n} \end{bmatrix}\]

The difference is significant because \(\textbf{Y}\) is a random variable and \(\textbf{y}\) is a vector of constants. \(\textbf{Y}\) has a multi-variate distribution with mean and variance, \(\textbf{y}\) is just a vector of numbers.

We follow the convention that all vectors are column vectors, so that \(\textbf{y}_{(n)}\) serves as a shorthand for \(\textbf{y}_{(n \times 1)}\).

The typical element of a matrix is written as a scalar with subscripts that refer to rows and columns. For example, the statement

\[\textbf{A}= \left\lbrack a_{ij} \right\rbrack\]

says that matrix \(\textbf{A}\) consists of the scalars \(a_{ij}\); for example, \(a_{23}\) is the scalar in row 2, column 3 of the matrix.

3.2 Special Matrices

A few special matrices, common in statistics and machine learning are

  • \(\textbf{1}_{n} = \begin{bmatrix} 1 \\ \vdots \\ 1 \end{bmatrix}\), the unit vector of size \(n\); all its elements are 1.

  • \(\textbf{0}_{n} = \begin{bmatrix} 0 \\ \vdots \\ 0 \end{bmatrix}\), the zero vector of size \(n\); all its elements are 0.

  • \(\textbf{0}_{(n \times k)} = \begin{bmatrix} 0 & 0 & 0 \\ \vdots & \cdots & \vdots \\ 0 & 0 & 0 \end{bmatrix}\), the zero matrix of order \((n \times k)\). All its elements are 0.

  • \(\textbf{J}_{(n \times k)} = \begin{bmatrix} 1 & 1 & 1 \\ \vdots & \cdots & \vdots \\ 1 & 1 & 1 \end{bmatrix}\), the unit matrix of size \((n \times k)\). All its elements are 1.

  • \(\textbf{I}_{n} = \begin{bmatrix} 1 & 0 & 0 \\ 0 & \ddots & 0 \\ 0 & 0 & 1 \end{bmatrix}\), the identity matrix of size \((n \times n)\) with 1s on the diagonal and 0s elsewhere.

If the order of these matrices is obvious from the context, the subscripts tend to be omitted.

3.3 Basic Operations on Matrices and Vectors

The basic operations on matrices and vectors are addition, subtraction, multiplication, transposition, and inversion. These are standard operations in manipulating matrix and vector equations. Decompositions such as Cholesky roots, eigenvalue and singular value decompositions are more advanced operations that are important in solving estimation problems in statistics.

Transpose

The transpose of a matrix is obtained by exchanging rows and columns. If \(a_{ij}\) is the element in row \(i\), column \(j\) of matrix \(\textbf{A}\), the transpose of \(\textbf{A}\), denoted \(\textbf{A}^\prime\), has typical element \(a_{ji}\). In case of the \((3\ \times 5)\) matrix shown previously, its transpose is

\[\textbf{A}^\prime_{(5 \times 3)} = \begin{bmatrix} \begin{matrix} 1 \\ 9.0 \\ \begin{matrix} 6.2 \\ 1 \\ 0 \end{matrix} \end{matrix} & \begin{matrix} 1 \\ 3.2 \\ \begin{matrix} 1.4 \\ 0 \\ 1 \end{matrix} \end{matrix} & \begin{matrix} 1 \\ 4.1 \\ \begin{matrix} - 0.6 \\ 0 \\ 0 \end{matrix} \end{matrix} \end{bmatrix}\]

The transpose of a column vector is a row vector:

\[\textbf{a}^{\prime} = \begin{bmatrix} a_{1} \\ \vdots \\ a_{n} \end{bmatrix}^\prime = \left\lbrack a_{1},\cdots,a_{n} \right\rbrack\]

Transposing a transpose produces the original matrix, \(\left( \textbf{A}^{\prime} \right)^{\prime}\ = \textbf{A}\).

A matrix is symmetric if it is equal to its transpose, \(\textbf{A}^\prime = \textbf{A}\). Symmetric matrices are square matrices (same numbers of rows and columns). The matrices \(\textbf{A}^\prime\textbf{A}\) and \(\textbf{A}\textbf{A}^\prime\) are always symmetric. A symmetric matrix whose off-diagonal elements are zero is called a diagonal matrix.

Addition and Subtraction

The sum (difference) of two matrices is the matrix of the elementwise sums (differences) of their elements. These operations require that the matrices being summed or subtracted have the same order:

\[\textbf{A}_{(n \times k)} + \textbf{B}_{(n \times k)} = \left\lbrack a_{ij} + b_{ij} \right\rbrack\]

\[\textbf{A}_{(n \times k)} - \textbf{B}_{(n \times k)} = \left\lbrack a_{ij} - b_{ij} \right\rbrack\]

Suppose, for example, that \(\textbf{A}= \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix}\) and \(\textbf{B}=\begin{bmatrix} - 1 & - 2 & - 3 \\ - 4 & - 5 & - 6 \end{bmatrix}\). Then,

\[\textbf{A}+ \textbf{B}= \begin{bmatrix} 0 & 0 & 0 \\ 0 & 0 & 0 \end{bmatrix}\]

\[\textbf{A}- \textbf{B}= \begin{bmatrix} 2 & 4 & 6 \\ 8 & 10 & 12 \end{bmatrix}\]

Since addition (subtraction) are elementwise operations, they can be combined with transposition:

\[\left( \textbf{A}+ \textbf{B}\right)^\prime = \textbf{A}^{\prime} + \textbf{B}^{\prime}\]

Multiplication

Two matrices conform for addition (subtraction) if they have the same order, that is, the same number of rows and columns. Multiplication of matrices requires a different type of conformity; two matrices \(\textbf{A}\) and \(\textbf{B}\) can be multiplied as \(\text{AB}\) (or \(\textbf{A}\text{×}\textbf{B}\)), if the number of columns in \(\textbf{A}\) equals the number of columns in \(\textbf{B}\). We say that in the product \(\textbf{A}\text{×}\textbf{B}\), \(\textbf{A}\) is post-multiplied by \(\textbf{B}\) or that \(\textbf{A}\) is multiplied into \(\textbf{B}\). The result of multiplying a \((n \times k)\) matrix into a \((k \times p)\) matrix is a \((n \times p)\) matrix.

Before examining the typical elements in the result of multiplication, let’s look at a special case, the inner product of two \((k \times 1)\) vectors \(\textbf{A}\) and \(\textbf{B}\), also called the dot product or the scalar product, is the result of multiplying the transpose of \(\textbf{A}\) into \(\textbf{B}\), a scalar value

\[\textbf{A}^\prime\textbf{B}= \left\lbrack a_{1}, \cdots,a_{k} \right\rbrack\begin{bmatrix} b_{1} \\ \vdots \\ b_{k} \end{bmatrix} = a_{1}b_{1} + \cdots a_{k}b_{k} = \sum_{i = 1}^{k}{a_{i}b_{i}}\]

The square root of the dot product of a vector with itself is the Euclidean \({(L}_{2})\) norm of the vector,

\[\left| \left| \textbf{a}\right| \right| = \sqrt{\textbf{a}^\prime\textbf{a}} = \sum_{i = 1}^{k}a_{i}^{2}\]

The \(L_{2}\) norm plays an important role as a loss function in statistical models. The vector for which the norm is calculated is then often a vector of model errors.

Now let’s return to the problem of multiplying the \((n \times k)\) matrix \(\textbf{A}\) into the \((k \times p)\) matrix \(\textbf{B}\) and introduce one more piece of notation: the \(i\)th row of \(\textbf{A}\) is denoted \(\mathbf{\alpha}_{i}\) and the \(j\)th column of \(\textbf{B}\) is denoted \(\textbf{B}_{j}\). Now we can finally write the product \(\textbf{A}\text{×}\textbf{B}\) as a matrix whose typical element is the inner product of \(\mathbf{\alpha}_{i}\) and \(\textbf{B}_{j}\):

\[\textbf{A}_{(n \times k)} \times \textbf{B}_{(k \times p)} = \left\lbrack \boldsymbol{\alpha}_{i}^\prime\ \textbf{b}_{j} \right\rbrack_{(n \times p)}\ \]

As an example, let \(\textbf{A}= \begin{bmatrix} 1 & 2 & 0 \\ 3 & 1 & - 3 \\ 4 & 1 & 2 \end{bmatrix}\) and \(\textbf{B}= \begin{bmatrix} 1 & 0 \\ 2 & 3 \\ 2 & 1 \end{bmatrix}\). The product \(\textbf{A}\times\textbf{B}\) is a \((3 \times 2)\) matrix with elements

\[\textbf{A}\times\textbf{B}= \begin{bmatrix} 1 \times 1 + 2 \times 2 + 0 \times 2 & 1 \times 0 + 2 \times 3 + 0 \times 1 \\ 3 \times 1 + 1 \times 2 - 3 \times 2 & 3 \times 0 + 1 \times 3 - 3 \times 1 \\ 4 \times 1 + 1 \times 2 + 2 \times 2 & 4 \times 0 + 1 \times 3 + 2 \times 1 \end{bmatrix} = \begin{bmatrix} 5 & 6 \\ - 1 & 0 \\ 10 & 5 \end{bmatrix}\]

Here are a few helpful rules for matrix multiplication:

  1. \(c\textbf{A}= \left\lbrack ca_{ij} \right\rbrack\)

  2. \(c\left( \textbf{A}+ \textbf{B}\right) = c\textbf{A}+ c\textbf{B}\)

  3. \(\textbf{C}\left( \textbf{A}+ \textbf{B}\right) = \textbf{C}\textbf{A}+ \textbf{C}\textbf{B}\)

  4. \(\left( \textbf{A}\textbf{B}\right)\textbf{C}= \textbf{A}(\textbf{B}\textbf{C})\)

  5. \(\left( \textbf{A}+ \textbf{B}\right)\left( \textbf{C}+ \mathbf{D} \right) = \textbf{A}\textbf{C}+ \textbf{A}\mathbf{D} + \mathbf{BC} + \mathbf{BD}\)

  6. \(\left( \textbf{A}\textbf{B}\right)^\prime = \textbf{B}^\prime\textbf{A}^\prime\)

  7. \(\left( c\textbf{A}\right)^\prime = c\textbf{A}^\prime\)

Inversion and Rank

In scalar algebra, division and multiplication are inverse operations, dividing a non-zero scalar by itself yields the multiplicative identity: \(\frac{a}{a} = 1\). What is the equivalent of this operation for matrices? First, inversion of a matrix does not reduce it to a scalar, the multiplicative identity for matrices is the identity matrix \(\textbf{I}\), a diagonal matrix with 1s on the diagonal. Second, the inversion is only defined for square matrices. If \(\textbf{A}\) is an \((n \times n)\) matrix, the matrix \(\textbf{B}\) for which

\[\textbf{A}\textbf{B}= \textbf{I}\]

is called the inverse of \(\textbf{A}\), denoted as \(\textbf{A}^{- 1}\).

Inverse matrices do not have to exist, even for square matrices. If \(\textbf{A}\) has an inverse matrix, then \(\textbf{A}\) is called a non-singular matrix. In that case, \(\textbf{A}^{- 1}\textbf{A}= \textbf{A}\textbf{A}^{- 1} = \text{I}\).

For the inverse of a square matrix to exist, for the matrix to be non-singular, the matrix must be of full rank. The rank of a matrix, denoted \(r(\textbf{A})\), is the number of its linearly independent columns. What does that mean? Suppose we are dealing with a \((n \times k)\) matrix \(\textbf{B}\) and its column vectors are \(\textbf{B}_{1},\cdots,\textbf{B}_{k}\). A linear combination of the columns of \(\textbf{B}\) is

\[c_{1}\textbf{b}_{1} + c_{2}\textbf{b}_{2} + \cdots + c_{k}\textbf{b}_{k} = q\]

If you can find a set of scalars \(c_{1},\cdots,c_{k}\) such that \(q = 0\), then the columns of \(\textbf{B}\) are linearly dependent. If the only set of scalars that yields \(q = 0\) is

\[c_{1} = c_{2} = \cdots = c_{k} = 0\]

then the columns of \(\textbf{B}\) are not linearly dependent and the rank of \(\textbf{B}\) is \(k\).

Here are a few more useful results about the rank of a matrix:

  1. \(r\left( \textbf{A}\right) = r\left( \textbf{A}^\prime \right) = r\left( \textbf{A}^\prime\textbf{A}\right) = r\left( \textbf{A}\textbf{A}^{\prime} \right)\)

  2. \(r\left( \textbf{A}\textbf{B}\right) \leq \min\left\{ r\left( \textbf{A}\right),r\left( \textbf{B}\right) \right\}\)

  3. \(r\left( \textbf{A}+ \textbf{B}\right) \leq r\left( \textbf{A}\right) + r(\textbf{B})\)

The first two results are particularly important in statistical models. In models with linear structures, it is common to collect the \(p\) input variables in a linear model, including the intercept as a column of ones, into a matrix \(\textbf{X}_{(n\ \times p + 1)}\):

\[\textbf{X}_{(n\ \times p + 1)} = \begin{bmatrix} 1 & x_{11} & \begin{matrix} \cdots & x_{1p} \end{matrix} \\ \vdots & \vdots & \begin{matrix} \ddots & \vdots \end{matrix} \\ 1 & x_{n1} & \begin{matrix} \cdots & x_{np} \end{matrix} \end{bmatrix}\]

Suppose we want to solve the linear system \(\textbf{Y}= \textbf{X}\textbf{c}\) for \(\textbf{c}\). Start by pre-multiplying both sides of the equation with the transpose of \(\textbf{X}\):

\[\textbf{X}^{\prime}\textbf{Y}= \textbf{X}^{\prime}\textbf{X}\textbf{c}\]

If we had an inverse of \(\textbf{X}^\prime\textbf{X}\), then we can now pre-multiply both sides with that inverse and isolate \(\text{c}\):

\[{\left( \textbf{X}^\prime\textbf{X}\right)^{\mathbf{- 1}}\textbf{X}}^{\prime}\textbf{Y}= \left( \textbf{X}^\prime\textbf{X}\right)^{-1}\textbf{X}^\prime\textbf{X}\textbf{c}= \textbf{I}\textbf{c}= \textbf{c}\]

We have a solution to the system, namely \({\textbf{c}=\left( \textbf{X}^\prime\textbf{X}\right)}^{- 1}\textbf{X}^{\prime}\textbf{Y}\), only if the inverse \(\left( \textbf{X}^\prime\textbf{X}\right)^{-1}\) exists. And that requires this \((p + 1) \times (p + 1)\) matrix is of full rank \(r\left( \left( \textbf{X}^\prime\textbf{X}\right)^{- 1} \right) = p + 1\). This, in turn is equivalent to saying that \(\textbf{X}\) has full rank \(p + 1\) because of property (i).

Here are some useful results about inverse matrices:

  1. \(\left( \textbf{A}^{- 1} \right)^\prime = \left( \textbf{A}^\prime \right)^{- 1}\)

  2. \(\left( \textbf{A}^{- 1} \right)^{- 1} = \textbf{A}\)

  3. \(\left( \textbf{A}\textbf{B}\right)^{- 1} = \textbf{B}^{-1}\textbf{A}^{-1}\)

  4. \(r\left( \textbf{A}^{- 1} \right) = r(\textbf{A})\)

If the matrix \(\textbf{X}\) is of less than full rank, it is called a rank-deficient matrix. Can we still solve the linear system \(\textbf{Y}= \textbf{X}\textbf{c}\)? Not by using a (regular) inverse matrix, but there is a way out, by using a generalized inverse matrix. If a matrix \(\textbf{A}^{-}\) can be found that satisfies

\[\textbf{A}\textbf{A}^{-}\textbf{A}= \textbf{A}\]

then it is called the generalized inverse (or pseudo-inverse or g-inverse) of \(\textbf{A}\). Suppose we can find such a generalized inverse \(\left( \textbf{X}^{\prime}\textbf{X}\right)^{-}\)f or \(\textbf{X}^\prime\textbf{X}\). What if we use that in the solution of the linear system,

\[\textbf{c}= {\left( \textbf{X}^{\prime}\textbf{X}\right)^{-}\textbf{X}}^{\prime}\textbf{Y}\]

Unfortunately, whereas regular inverses are unique, there are (infinitely) many generalized inverses that satisfy the condition \((\textbf{X}^\prime\textbf{X})\left( \textbf{X}^\prime\textbf{X}\right)^{-}\textbf{X}^\prime\textbf{X}= \textbf{X}^\prime\textbf{X}\). So, there will be infinitely many possible solutions to the linear system. Fortunately, it turns out that generalized inverses have some nice properties, for example, \(\textbf{X}\left( \textbf{X}^\prime\textbf{X}\right)^{-}\textbf{X}\) is invariant to the choice of the generalized inverse. Even if the solution \(\textbf{c}\) is not unique, \(\textbf{X}\textbf{c}\) is unique. This result is important in linear models with rank-deficient design matrices, a condition that is common when the model contains classification variables. While the parameter estimates in such a model are not unique, because we need to use a generalized inverse to derive the estimates, the predicted values are the same, no matter which generalized inverse we choose.

Determinant

The rank reduces a matrix to a single scalar value, the number of linearly independent columns of the matrix. Another value that reduces a square matrix to a single scalar is the determinant, written as \(det(\textbf{A})\) or \(|\textbf{A}|\). The determinant has a geometric interpretation which is not that relevant for our discussion. What matters more is that the determinant appears frequently in expressions of multivariate probability distributions and knowing how to manipulate the determinants.

  1. \(|\textbf{A}| = |\textbf{A}^\prime|\)

  2. \(|\textbf{I}| = 1\)

  3. \(\left| c\textbf{A}\right| = c^{n}\mathbf{|A}\mathbf{|}\)

  4. If \(\textbf{A}\) is singular, then \(\left| \textbf{A}\right| = 0\)

  5. If each element of a row (column) of \(\textbf{A}\) is zero, then \(\left| \textbf{A}\right| = 0\)

  6. If two rows (column) of \(\textbf{A}\) are identical, then \(\left| \textbf{A}\right| = 0\)

  7. \(\left| \textbf{A}\textbf{B}\right| = \left| \textbf{A}\right|\ \left| \textbf{B}\right|\)

  8. \(\left| \textbf{A}^{- 1} \right| = 1/|\textbf{A}|\)

  9. If \(\textbf{A}\) is a triangular matrix, then \(|\textbf{A}| = \prod_{i = 1}^{n}a_{ii}\)

  10. If \(\textbf{A}\) is a diagonal matrix, then \(|\textbf{A}| = \prod_{i = 1}^{n}a_{ii}\)

Trace

The trace operator, \(tr(\textbf{A})\), applies only to square matrices. The trace of an \(\textbf{A}_{(n \times n)}\) matrix is the sum of its diagonal elements:

\[tr\left( \textbf{A}\right) = \sum_{i = 1}^{n}a_{ii}\]

The trace plays an important role in statistics in determining expected values of quadratic forms of random variables, for example, sums of squares in linear models. An important property of the trace is its invariance under cyclic permutations,

\[tr\left( \mathbf{ABC} \right) = tr\left( \mathbf{BCA} \right) = tr(\mathbf{CAB})\]

provided the matrices conform to multiplication.

Some other useful properties of the trace are

  1. \(tr\left( \textbf{A}+ \textbf{B}\right) = tr\left( \textbf{A}\right) + tr\left( \textbf{B}\right)\)

  2. \(tr\left( \textbf{A}\right) = tr\left( \textbf{A}^\prime \right)\)

  3. \(\textbf{Y}^\prime\text{Ay} = tr\left( \textbf{Y}^\prime\text{Ay} \right)\)

  4. \(tr\left( c\textbf{A}\right) = c \times tr\left( \textbf{A}\right)\)

  5. \(tr\left( \textbf{A}\right) = r(\textbf{A})\) if \(\textbf{A}\) is symmetric and idempotent (\(\textbf{A}\textbf{A}= \textbf{A}\) and \(\textbf{A}= \textbf{A}^\prime\))

3.4 Random Vectors

If the elements of a vector are random variables, the vector object itself is a random variable. You can think of random vectors as a convenient mechanism to collect random variables. Suppose we draw a random sample \(Y_{1},\cdots,Y_{n}\), then we can collect the \(n\) random variables in a single random vector

\[\textbf{Y}= \begin{bmatrix} Y_{1} \\ \vdots \\ Y_{n} \end{bmatrix}\]

Expected Value

Since each \(Y_{i}\) has a probability distribution, a mean (expected value) \(\text{E}\left\lbrack Y_{i} \right\rbrack\), a variance \(\text{Var}\left\lbrack Y_{i} \right\rbrack\), and so forth, the same applies to their collection. The expected value (mean) of a random vector is the vector of the expected values of its elements:

\[\text{E}\left\lbrack \textbf{Y}\right\rbrack = \begin{bmatrix} \text{E}\left\lbrack Y_{1} \right\rbrack \\ \vdots \\ \text{E}\left\lbrack Y_{n} \right\rbrack \end{bmatrix}\]

Suppose that \(\textbf{A},\ \textbf{B},\ \textbf{c}\) are matrices and vectors of constants, respectively, and that \(\textbf{Y}\) and \(\mathbf{U}\) are random vectors. The following are useful expectation operations in this situations:

  1. \(\text{E}\left\lbrack \textbf{A}\right\rbrack = \textbf{A}\)

  2. \(\text{E}\left\lbrack \mathbf{AYB} + \mathbf{C} \right\rbrack = \textbf{A}\text{E}\left\lbrack \textbf{Y}\right\rbrack\textbf{B}+ \textbf{C}\)

  3. \(\text{E}\left\lbrack \mathbf{AY} + \mathbf{c} \right\rbrack = \textbf{A}\text{E}\left\lbrack \textbf{Y}\right\rbrack + \textbf{c}\)

  4. \(\text{E}\left\lbrack \mathbf{AY} + \mathbf{BU} \right\rbrack = \textbf{A}\text{E}\left\lbrack \textbf{Y}\right\rbrack + \textbf{B}\ \text{E}\lbrack\mathbf{U}\rbrack\)

Covariance Matrix

While the distribution of \(Y_{i}\) is univariate, \(\textbf{Y}\) has a multivariate (\(n\)-variate) distribution. The mean of the distribution is represented by a vector. The variance of the distribution is represented by a matrix, the variance-covariance matrix, a special case of a covariance matrix.

The covariance matrix between random vectors \(\textbf{Y}_{(k \times 1)}\) and \(\mathbf{U}_{(p \times 1)}\) is a \((k \times p)\) matrix whose typical elements are the covariances between the elements of \(\textbf{Y}\) and \(\mathbf{U}\):

\[\text{Cov}\left\lbrack \textbf{Y},\mathbf{U} \right\rbrack = \left\lbrack \text{Cov}(Y_{i},U_{j}) \right\rbrack\]

The covariance matrix can be written in terms of expected values of \(\textbf{Y}\), \(\mathbf{U}\), and \(\textbf{Y}\mathbf{U}^\prime\)

\[\text{Cov}\left\lbrack \textbf{Y},\mathbf{U} \right\rbrack = \text{E}\left\lbrack \left( \textbf{Y}- \text{E}\lbrack\textbf{Y}\rbrack \right)\left( \mathbf{U} - \text{E}\left\lbrack \mathbf{U} \right\rbrack \right)^\prime \right\rbrack = \text{E}\left\lbrack \textbf{Y}\mathbf{U}^\prime \right\rbrack - \text{E}\left\lbrack \textbf{Y}\right\rbrack \text{E}\left\lbrack \mathbf{U} \right\rbrack^{\prime}\]

Some useful rules to manipulate covariance matrices are:

  1. \(\text{Cov}\left\lbrack \mathbf{AY},\mathbf{U} \right\rbrack = \textbf{A}\text{Cov}\lbrack\textbf{Y},\mathbf{U}\rbrack\)

  2. \(\text{Cov}\left\lbrack \textbf{Y},\mathbf{BU} \right\rbrack = \text{Cov}\left\lbrack \textbf{Y},\mathbf{U} \right\rbrack\textbf{B}^\prime\)

  3. \(\text{Cov}\left\lbrack \mathbf{AY},\mathbf{BU} \right\rbrack = \textbf{A}\text{Cov}\left\lbrack \textbf{Y},\mathbf{U} \right\rbrack\ \textbf{B}^\prime\)

  4. \(\text{Cov}\left\lbrack a\textbf{Y}+ b\mathbf{U},c\mathbf{W} + d\textbf{V}\right\rbrack = ac\text{Cov}\left\lbrack \textbf{Y},\mathbf{W} \right\rbrack + bc\text{Cov}\left\lbrack \mathbf{U},\mathbf{W} \right\rbrack + ad\text{Cov}\left\lbrack \textbf{Y},\textbf{V}\right\rbrack + bd\text{Cov}\lbrack\mathbf{U},\textbf{V}\rbrack\)

Variance-covariance Matrix

The variance-covariance matrix (or variance matrix for short) of a random vector \(\textbf{Y}\) is the covariance matrix of \(\textbf{Y}\) with itself.

\[\text{Var}\left\lbrack \textbf{Y}\right\rbrack = \text{Cov}\left\lbrack \textbf{Y},\textbf{Y}\right\rbrack = \text{E}\left\lbrack \left( \textbf{Y}- \text{E}\lbrack\textbf{Y}\rbrack \right)\left( \textbf{Y}-\text{E}\left\lbrack \textbf{Y}\right\rbrack \right)^\prime \right\rbrack = \text{E}\left\lbrack \textbf{Y}\textbf{Y}^\prime \right\rbrack - \text{E}\left\lbrack \textbf{Y}\right\rbrack \text{E}\left\lbrack \textbf{Y}\right\rbrack^{\prime}\]

The diagonal entries of the variance-covariance matrix contain the variances of the \(Y_{i}\). The off-diagonal cells contain the covariances \(\text{Cov}\left\lbrack Y_{i},Y_{j} \right\rbrack\). If the variance matrix is diagonal, the elements of random vector \(\textbf{Y}\) are uncorrelated. Two random vectors \(\textbf{Y}\) and \(\mathbf{U}\) are uncorrelated if their variance matrix is block-diagonal:

\[\text{Var}\begin{bmatrix} \textbf{Y}_{1} \\ \textbf{Y}_{2} \end{bmatrix} = \begin{bmatrix} \text{Var}\lbrack\textbf{Y}_{2}\rbrack & \textbf{0}\\ \textbf{0}& \text{Var}\lbrack\textbf{Y}_{1}\rbrack \end{bmatrix}\]

A very special variance-covariance matrix in statistical models is the scaled identity matrix, \(\sigma^{2}\textbf{I}\). This is the variance matrix of uncorrelated observations drawn from the same distribution—a common assumption for the error terms in models.

The rules for working with covariances extend to working with variance matrices:

  1. \(\text{Var}\left\lbrack \mathbf{AY} \right\rbrack = \textbf{A}\ \text{Var}\left\lbrack \textbf{Y}\right\rbrack\textbf{A}^\prime\)

  2. \(\text{Var}\left\lbrack \textbf{Y}+ \textbf{A}\right\rbrack = \text{Var}\lbrack\textbf{Y}\rbrack\)

  3. \(\text{Var}\left\lbrack \textbf{A}^\prime\textbf{Y}\right\rbrack = \textbf{A}^\prime\text{Var}\left\lbrack \textbf{Y}\right\rbrack\textbf{A}\)

  4. \(\text{Var}\left\lbrack a\textbf{Y}\right\rbrack = a^{2}\text{Var}\lbrack\textbf{Y}\rbrack\)

  5. \(\text{Var}\left\lbrack a\textbf{Y}+ b\mathbf{U} \right\rbrack = a^{2}\text{Var}\left\lbrack \textbf{Y}\right\rbrack + b^{2}\text{Var}\left\lbrack \mathbf{U} \right\rbrack + 2ab\ \text{Cov}\lbrack\textbf{Y},\mathbf{U}\rbrack\)

Finally, an important result about expected values of quadratic forms, heavily used to in decomposing variability is

\[\text{E}\left\lbrack \textbf{Y}^\prime\mathbf{AY} \right\rbrack = tr\left( \textbf{A}\text{Var}\left\lbrack \textbf{Y}\right\rbrack \right) + \text{E}\left\lbrack \textbf{Y}\right\rbrack^\prime\textbf{A}\ \text{E}\lbrack\textbf{Y}\rbrack\]

3.5 Matrix Differentiation

Estimation of parameters in statistical models often requires minimization or maximization of an objective function. For example, the ordinary least squares (OLS) principle finds the OLS estimator as the function of the data that minimizes the error sum of squares of the model. Maximum likelihood finds estimators of the parameters as the functions of the data that maximizes the joint likelihood (the joint distribution function) of the data.

The parameters of the models appear as elements of vectors and matrices. Finding estimators of the parameters thus requires calculus on vectors and matrices. Consider matrix \(\textbf{A}\), whose elements depend on a scalar parameter \(\theta\), \(\textbf{A}= \left\lbrack a_{ij}(\theta) \right\rbrack\). The derivative of \(\textbf{A}\) with respect to \(\theta\) is the matrix of the derivatives of the typical elements \(a_{ij}(\theta)\) with respect to \(\theta\). We write this formally as

\[\frac{\partial\textbf{A}}{\partial\theta} = \left\lbrack \frac{\partial a_{ij}(\theta)}{\partial\theta} \right\rbrack\]

The derivative of a function \(f(\boldsymbol{\theta})\) with respect to the vector \(\boldsymbol{\theta}_{(p \times 1)}\) is the vector of the partial derivatives of the function

\[\frac{\partial f\left( \boldsymbol{\theta}\right)}{\partial\boldsymbol{\theta}} = \begin{bmatrix} \frac{\partial f\left( \boldsymbol{\theta}\right)}{\partial\theta_{1}} \\ \vdots \\ \frac{\partial f\left( \boldsymbol{\theta}\right)}{\partial\theta_{p}} \end{bmatrix}\]

Here are some useful results from vector and matrix calculus where \(\textbf{A}\) and \(\textbf{B}\) are functions of \(\theta\) and vector \(\textbf{X}\) does not depend on \(\theta\):

  1. \(\frac{{\partial ln}\left| \textbf{A}\right|}{\partial\theta} = \frac{1}{\left| \textbf{A}\right|}\frac{\partial\left| \textbf{A}\right|}{\partial\theta} = tr\left( \textbf{A}^{- 1}\frac{\partial\textbf{A}}{\partial\theta} \right)\)

  2. \(\frac{\partial\textbf{A}^{- 1}}{\partial\theta} = - \textbf{A}^{- 1}\frac{\partial\textbf{A}}{\partial\theta\ }\textbf{A}^{- 1}\)

  3. \(\frac{\partial tr\left( \mathbf{AB} \right)}{\partial\theta} = tr\left( \frac{\mathbf{\partial}\textbf{A}}{\partial\theta}\textbf{B}\right) + tr\left( \textbf{A}\frac{\mathbf{\partial}\textbf{B}}{\partial\theta} \right)\)

  4. \(\frac{\partial\textbf{X}^\prime\textbf{A}^{- 1}\textbf{X}}{\partial\theta} = - \textbf{X}^\prime\textbf{A}^{- 1}\frac{\partial\textbf{A}}{\partial\theta}\textbf{A}^{- 1}\textbf{X}\)

  5. \(\frac{\partial\textbf{X}^{\prime}\mathbf{Ax}}{\partial\textbf{X}} = 2\mathbf{Ax}\)

  6. \(\frac{\partial\textbf{X}^\prime\textbf{A}}{\partial\textbf{X}} = \frac{\partial\textbf{A}^\prime\textbf{X}}{\partial\textbf{X}} = \textbf{A}\)

3.6 Idempotent Matrices

A matrix is called idempotent if multiplying the matrix by itself yields the matrix–\(\textbf{A}\textbf{A}= \textbf{A}\). Because \(\textbf{A}\) must conform to itself for multiplication, idempotent matrices are square matrices. Idempotent matrices that are also symmetric play an important role in statistical estimation. Idempotent matrices are projection matrices, that means they map a vector from a space to a sub-space. Symmetric idempotent matrices are orthogonal projection matrices.

Among some of the interesting properties of idempotent matrices are the following:

  1. The trace of an idempotent matrix is equal to its rank.

  2. The trace of an idempotent matrix is an integer.

  3. The eigenvalues of an idempotent matrix are either 0 or 1.

Projections

For example, suppose we want to find a solution for \(\boldsymbol{\beta}\) in the linear model \(\textbf{Y}_{(n \times 1)} = \textbf{X}\boldsymbol{\beta}_{(p + 1 \times 1)} + \mathbf{\epsilon}\). The vector \(\textbf{Y}\) is a vector in \(n\)-dimensional space \(\mathbb{R}^{n}\) and the model places a restriction on the predicted values \(\widehat{\textbf{Y}} = \textbf{X}\widehat{\boldsymbol{\beta}}\): the predicted values are confined to a \((p + 1)\)-dimensional sub-space of \(\mathbb{R}^{n}\). Regardless of how we choose the estimator \(\widehat{\boldsymbol{\beta}}\), we are dealing with projecting vector \(\textbf{Y}\) onto a sub-space of \(\mathbb{R}^{n}\).

We can thus think of the problem of finding the best estimator in this model as the problem of finding the best projection onto the space generated by the columns of \(\textbf{X}\). In that case, why not choose the projection that minimizes the distance between the observed values \(\textbf{Y}\) and the predicted values \(\widehat{\textbf{Y}}\). This is achieved by projecting \(\textbf{Y}\) perpendicular (orthogonal) onto the sub-space generated by \(\textbf{X}\). In other words, our solution is the vector \(\widehat{\boldsymbol{\beta}}\) that satisfies

\[\left( \textbf{Y}- \textbf{X}\widehat{\boldsymbol{\beta}} \right)^\prime\textbf{X}\widehat{\boldsymbol{\beta}} = 0\]

Multiplying out and rearranging terms yields

\[{\widehat{\boldsymbol{\beta}}}^{\prime}\textbf{X}^{\prime}\textbf{Y}= {\widehat{\boldsymbol{\beta}}}^{\prime}\textbf{X}^\prime\textbf{X}\widehat{\boldsymbol{\beta}}\]

which implies that

\[\textbf{X}^{\prime}\textbf{Y}= \textbf{X}^\prime\textbf{X}\widehat{\boldsymbol{\beta}}\]

If \(\textbf{X}\) is of full column rank, then \(\left( \textbf{X}^\prime\textbf{X}\right)^{- 1}\) exists and we can solve:

\[\widehat{\boldsymbol{\beta}}=\left( \textbf{X}^\prime\textbf{X}\right)^{- 1}\textbf{X}^{\prime}\textbf{Y}\]

The “Hat” Matrix

The ordinary least squares estimator is the orthogonal projection of \(\textbf{Y}\) onto the sub-space created by the columns of \(\textbf{X}\). To see the projection matrix at work, compute the predicted values, \(\textbf{X}\widehat{\boldsymbol{\beta}}\):

\[\textbf{X}\widehat{\boldsymbol{\beta}}=\textbf{X}\left( \textbf{X}^{\prime}\textbf{X}\right)^{- 1}\textbf{X}^{\prime}\textbf{Y}= \widehat{\textbf{Y}}\]

The matrix \(\textbf{X}\left( \textbf{X}^\prime\textbf{X}\right)^{- 1}\textbf{X}^{\prime}\) has a very special role in regression analysis, it is often called the “hat” matrix and denoted \(\textbf{H}\), because pre-multiplying \(\textbf{Y}\) with \(\textbf{H}\) puts the hats on \(\textbf{Y}\):

\[\textbf{H}\textbf{Y}= \widehat{\textbf{Y}}\]

Let’s verify that \(\textbf{H}\) is indeed a projection matrix, which requires that \(\textbf{H}\textbf{H}= \textbf{H}\):

\[\textbf{H}\textbf{H}= \textbf{X}\left( \textbf{X}^\prime\textbf{X}\right)^{- 1}\textbf{X}^{\prime} \times \textbf{X}\left( \textbf{X}^\prime\textbf{X}\right)^{-1}\textbf{X}^\prime=\textbf{X}\left( \textbf{X}^\prime\textbf{X}\right)^{- 1}\textbf{X}^{\prime} = \textbf{H}\]

Matrices with the property that \(\textbf{H}\textbf{H}= \textbf{H}\) are called idempotent matrices, these are projection matrices. If, in addition, \(\textbf{H}\) is symmetric, \(\textbf{H}^\prime = \textbf{H}\), the matrix is called symmetric idempotent—these are orthogonal projection matrices. (An idempotent matrix that is not symmetric is called an oblique projector.)

The hat matrix in the regression model is a symmetric idempotent matrix.

Here are some results about (symmetric) idempotent matrices that come in handy when working out the properties of estimators in regression-type models:

  1. Projection matrices are typically not of full rank. If an \((n \times n)\) idempotent matrix is of rank \(n\), then it is the identity matrix \(\textbf{I}\).

  2. If \(\textbf{A}\) is (symmetric) idempotent, then \(\textbf{I}- \textbf{A}\) is (symmetric) idempotent.

  3. If \(\mathbf{P}\) is non-singular, then \(\mathbf{PA}\mathbf{P}^{-1}\) is an idempotent matrix.

You can use these properties to show that in the linear regression model with uncorrelated errors and equal variance the variance matrix of the model residuals \(\textbf{Y}- \widehat{\textbf{Y}}\) is

\[\text{Var}\left\lbrack \textbf{Y}- \widehat{\textbf{Y}} \right\rbrack = \sigma^{2}(\textbf{I}- \textbf{H})\]

A Special Case

Consider the special case where \(\textbf{X}= \textbf{1}_{n}\), a column vector of ones. The corresponding linear model is \(\textbf{Y}= \textbf{1}\beta + \boldsymbol{\epsilon}\), an intercept-only model. The hat matrix for this model is

\[\textbf{1}\left( \textbf{1}^{\prime}\textbf{1}\right)^{- 1}\textbf{1}^\prime = \frac{1}{n}\textbf{1}\textbf{1}^\prime = \frac{1}{n}\textbf{J}= \begin{bmatrix} \frac{1}{n} & \cdots & \frac{1}{n} \\ \vdots & \ddots & \vdots \\ \frac{1}{n} & \cdots & \frac{1}{n} \end{bmatrix} \]a matrix filled with \(\frac{1}{n}\). The projection of \(\textbf{Y}\) onto the space generated by \(\textbf{1}_{n}\) is

\[\frac{1}{n}\mathbf{JY} = \begin{bmatrix} \overline{Y} \\ \vdots \\ \overline{Y} \end{bmatrix}\]

The predicted values are all the same, the sample mean. In other words, \(\beta = \overline{Y}\). Since the projector is idempotent, deriving the variance of the predicted value in the iid case is simple:

\[\text{Var}\left\lbrack \textbf{H}\textbf{Y}\right\rbrack = \textbf{H}\text{Var}\left\lbrack \textbf{Y}\right\rbrack\textbf{H}^\prime = \sigma^{2}\textbf{H}\textbf{H}^{\prime} = \sigma^{2}\textbf{H}= \sigma^{2}\frac{1}{n}\textbf{J}=\begin{bmatrix} \frac{\sigma^{2}}{n} & \cdots & \frac{\sigma^{2}}{n} \\ \vdots & \ddots & \vdots \\ \frac{\sigma^{2}}{n} & \cdots & \frac{\sigma^{2}}{n} \end{bmatrix}\]

3.7 Multivariate Gaussian Distribution

Definition

A scalar random variable \(Y\) has a Gaussian distribution function with mean \(\mu\) and variance \(\sigma^{2}\) if its probability density function is given by

\[f(y) = \frac{1}{\sqrt{2\pi\sigma^{2}}}\exp\left\{ - \frac{1}{2\sigma^{2}}(y - \mu)^{2} \right\}\]

We also say that \(Y\) is normally distributed with mean \(\mu\) and variance \(\sigma^{2}\). The shorthand expressions \(Y \sim G(\mu,\sigma^{2})\) or \(Y \sim N(\mu,\sigma^{2})\) are common.

The generalization from a scalar random variable \(Y\) to a random vector \(\textbf{Y}_{(n \times 1)}\) with a multivariate Gaussian distribution is as follows. \(\textbf{Y}_{(n \times 1)}\) has a multivariate Gaussian (normal) distribution with mean \(\boldsymbol{\mu}\) and variance matrix \(\textbf{V}\), if its density is given by

\[f\left( \textbf{Y}\right)=\frac{\left| \textbf{V}\right|^{- 1/2}}{(2\pi)^{\frac{n}{2}}}\exp\left\{ - \frac{1}{2}\left( \textbf{Y}- \boldsymbol{\mu}\right)^\prime\textbf{V}^{- 1}\left( \textbf{Y}- \boldsymbol{\mu}\right) \right\}\]

This is denoted with the shorthand \(\textbf{Y}\sim G_{n}(\boldsymbol{\mu},\textbf{V})\) or \(\textbf{Y}\sim N_{n}(\boldsymbol{\mu},\textbf{V}\)\). If the dimension of the distribution is clear from context, the subscript \(n\) can be omitted. A special case is the standard multivariate Gaussian distribution with mean \(\textbf{0}\) and variance matrix \(\textbf{I}\) :

\[f\left( \textbf{Y}\right)=\frac{1}{(2\pi)^{\frac{n}{2}}}\exp\left\{ - \frac{1}{2}\textbf{Y}^{\prime}\textbf{Y}\right\} = \frac{1}{(2\pi)^{\frac{n}{2}}}\exp\left\{ - \frac{1}{2}\sum_{i}^{n}y_{i}^{2} \right\}\]

But this is just the product of the \(n\) univariate densities of \(N(0,1)\) random variables:

\[f\left( \textbf{Y}\right) = f\left( y_{1} \right) \times \cdots \times f\left( y_{n} \right)\]

where

\[f\left( y_{i} \right) = \frac{1}{(2\pi)^{1/2}}\exp\left\{ - \frac{1}{2}y^{2} \right\}\]

If the variance matrix is diagonal—that is, the \(Y_{i}\) are uncorrelated—the multivariate normal distribution is the product of the univariate distributions. The random variables are independent.

Properties

Gaussian distributions have amazing (magical) properties.

Linear combinations are Gaussian

For example, a linear combination of Gaussian random variables also follows a Gaussian distribution. Formally, this can be expressed as follows: if \(\textbf{Y}\sim G_{n}\left( \boldsymbol{\mu},\textbf{V}\right)\) and \(\textbf{A}\) and \(\textbf{B}\) are a matrix and vector of constants (not random variables), respectively, then \(\mathbf{AY} + \textbf{B}\) follows a \(G(\textbf{A}\boldsymbol{\mu},\mathbf{AVA})\) distribution.

A special case of this result is that if \(\textbf{Y}\sim G_{n}\left( \boldsymbol{\mu},\textbf{V}\right)\), \(\textbf{Y}- \boldsymbol{\mu}\) has a \(G\left( 0,\textbf{V}\right)\) distribution.

Because a linear function of a Gaussian random variable is Gaussian distributed, you can define all multivariate Gaussian distributions as linear transformations of the standard multivariate Gaussian distribution. If \(\textbf{Z}\sim G_{n}(\textbf{0},\textbf{I})\), and \(\textbf{V}= \textbf{C}^\prime\textbf{C}\), then \(\textbf{Y}= \mathbf{C}^\prime\textbf{Z}+ \boldsymbol{\mu}\) has a \(G(\boldsymbol{\mu},\textbf{V})\) distribution.

Zero covariance implies independence

Another unusual property of Gaussian random variables is that if they are uncorrelated, they are also stochastically independent. We derived this above for the special case of \(\textbf{Y}\sim G(\textbf{0},\sigma^{2}\textbf{I})\).

You cannot in general conclude that random variables are independent based on their lack of correlation. For Gaussian random variables you can. This result can be extended to Gaussian random vectors. Suppose \(\textbf{Y}_{(n \times 1)} \sim G(\boldsymbol{\mu},\ \textbf{V})\) is partitioned into two sub-vectors of size \(s\) and \(k\), where \(n = s + k\). Then we can similarly partition the mean vector and variance matrix:

\[\textbf{Y}_{(n \times 1)} = \begin{bmatrix} \textbf{Y}_{1(s \times 1)} \\ \textbf{Y}_{2(k \times 1)} \end{bmatrix},\ \ \boldsymbol{\mu}= \begin{bmatrix} \boldsymbol{\mu}_{1} \\ \boldsymbol{\mu}_{2} \end{bmatrix},\ \ \ \ \ \ \textbf{V}= \begin{bmatrix} \textbf{V}_{11} & \textbf{V}_{12} \\ \textbf{V}_{21} & \textbf{V}_{22} \end{bmatrix}\]

If \(\textbf{V}_{12} = \textbf{0}\), then \(\textbf{Y}_{1}\) and \(\textbf{Y}_{2}\) are independent. Also, each partition is Gaussian distributed, for example, \(\textbf{Y}_{1} \sim G(\boldsymbol{\mu}_{1},\ \textbf{V}_{11})\). We call the distribution of \(\textbf{Y}_{1}\) the marginal distribution.

It follows immediately that each element of \(\textbf{Y}\) follows a (univariate) Gaussian distribution, \(Y_{i} \sim G(\mu_{i},V_{ii})\)—all marginal univariate distributions are Gaussian.

Conditionals are Gaussian}

The conditional distribution of \(\textbf{Y}_{1}\) given \(\textbf{Y}_{2}\) is also a Gaussian distribution, specifically:

\[\textbf{Y}_{1}|\textbf{Y}_{2} \sim G\left( \boldsymbol{\mu}_{1}\mathbf{+}\textbf{V}_{12}\textbf{V}_{22}^{- 1}\left( \textbf{Y}_{2} - \boldsymbol{\mu}_{2} \right),\ \textbf{V}_{11} - \textbf{V}_{12}\textbf{V}_{22}^{- 1}\textbf{V}_{12}^\prime \right)\]

This result plays an important role when predicting Gaussian random variables, for example in mixed models.

Notice that the variance matrix of the conditional distribution does not depend on the particular value \(\textbf{Y}_{2} = \textbf{Y}_{2}\) on which the distribution is conditioned. However, the mean of the conditional distribution does depend on \(\textbf{Y}_{2}\) unless \(\textbf{V}_{12} = \textbf{0}\), a condition established earlier for independence of \(\textbf{Y}_{1}\) and \(\textbf{Y}_{2}\).