8  Vectors and Matrices

8.1 Introduction

Working with vectors and matrices is essential for statisticians. We express the mathematics of statistics in terms of scalars, vectors, and matrices. As you move into machine learning, in particular deep learning, the horizon expands from matrices to tensors (multi-dimensional arrays). For now we stick with one-dimensional (=vectors) and two-dimensional (=matrices) arrays of real numbers.

Creating Vectors and Matrices

Vectors and matrices in R are special cases of the array data type. An array can have one, two, or more dimensions as indicated by its dim() attribute. Arrays with one dimensions are vectors, two-dimensional arrays are matrices.

If you create a one-dimensional array, you are automatically creating a vector

a <- c(1:5)
a
## [1] 1 2 3 4 5
is.vector(a)
## [1] TRUE

Frequently we create vectors of constant values, the rep function helps with that:

rep(1:4)
[1] 1 2 3 4
rep(1:4,times=2)
[1] 1 2 3 4 1 2 3 4

To create a regular sequence of values, use seq

seq(1:5)
## [1] 1 2 3 4 5
seq(from=1,to=10,by=0.5)
##  [1]  1.0  1.5  2.0  2.5  3.0  3.5  4.0  4.5  5.0  5.5  6.0  6.5  7.0  7.5  8.0
## [16]  8.5  9.0  9.5 10.0
seq(length.out=10)
##  [1]  1  2  3  4  5  6  7  8  9 10

To create a matrix you can use different approaches:

  • Use the matrix function
  • Convert a vector by changing its dimensions
  • Coerce a numeric object into a matrix with as.matrix

matrix function

B <- matrix(1:10,ncol=2) # default is ncol=1, byrow=FALSE
B
     [,1] [,2]
[1,]    1    6
[2,]    2    7
[3,]    3    8
[4,]    4    9
[5,]    5   10
E <- matrix(1:10,ncol=2,byrow=TRUE)
E
     [,1] [,2]
[1,]    1    2
[2,]    3    4
[3,]    5    6
[4,]    7    8
[5,]    9   10

Converting a vector into a matrix

Since a matrix is a two-dimensional array, and a vector is one-dimensional, a simple technique for creating a matrix from a vector is to make an assignment to the dimension attribute. In the following code, a one-dimensional vector of length 20 is shaped into a \((10 \times 2)\) matrix. Note that the elements of the matrix are filled in column order: the first 10 elements of the vector are assigned to the rows of column 1, the next 10 elements are assigned to the rows of column 2.

set.seed(876)
x <- round(rnorm(20),3)
x
##  [1]  0.171  0.782 -1.064 -0.264  0.114  1.940  0.461  1.226 -0.415  0.013
## [11]  1.154  0.837  0.247 -0.916  0.378 -1.406 -0.367  0.127 -0.848 -0.146
dim(x) <- c(10,2)
x
##         [,1]   [,2]
##  [1,]  0.171  1.154
##  [2,]  0.782  0.837
##  [3,] -1.064  0.247
##  [4,] -0.264 -0.916
##  [5,]  0.114  0.378
##  [6,]  1.940 -1.406
##  [7,]  0.461 -0.367
##  [8,]  1.226  0.127
##  [9,] -0.415 -0.848
## [10,]  0.013 -0.146
is.matrix(x)
## [1] TRUE

What if you want to create a \((10 \times 2)\) matrix but fill the matrix in row-order: the first two elements in row 1, the next two elements in row 2, and so forth? The solution is to assign the dimensions in reverse order and transpose the result.

x <- round(rnorm(20),3)
x
##  [1]  0.742  1.415  1.603 -0.124 -0.828 -0.138  0.152  0.425 -0.159 -0.837
## [11]  0.364 -0.373 -0.375  0.194  0.238 -0.740  2.435  1.573  1.117 -1.773
dim(x) <- c(2,10)
t(x)
##         [,1]   [,2]
##  [1,]  0.742  1.415
##  [2,]  1.603 -0.124
##  [3,] -0.828 -0.138
##  [4,]  0.152  0.425
##  [5,] -0.159 -0.837
##  [6,]  0.364 -0.373
##  [7,] -0.375  0.194
##  [8,]  0.238 -0.740
##  [9,]  2.435  1.573
## [10,]  1.117 -1.773

Converting a vector into a matrix by changing its dimensions has the advantage that the object is not copied, saving memory.

Coercion

R is good at coercion, the implicit conversion from one data type to another. Coercion can happens implicitly when you pass an object of a different type to a function. Coercion can also be done explicitly using as.*-style functions.

I really meant “as.*-style functions” as in as.matrix, as.data.frame, as.Date, as.dendrogram, etc. Not as*-style functions.

For example, to coerce an R object into a matrix, use the as.matrix function. A common usage is to convert a dataframe of numerical data:

df <- data.frame(int=rep(1,4), x1=c(1,2,3,4), x2=rnorm(4))
is.matrix(df)
[1] FALSE
is.matrix(as.matrix(df))
[1] TRUE

Basic Operations

When operating on matrices and vectors, we need to distinguish elementwise operations from true matrix operations. For example, take the \((5 \times 2)\) matrices B and E created earlier. Their elementwise product is the matrix with typical element \([b_{ij}*e_{ij}]\). In other words, elements of the matrices are matched up and the multiplication is performed separately in each cell.

The matrix product \(\textbf{B}* \textbf{E}\) is not possible because the matrices do not conform to matrix multiplication. However, the product \(\textbf{B}* \textbf{E}^\prime\) is possible, the result is a (5 )$ matrix.

Elementwise operations

B
##      [,1] [,2]
## [1,]    1    6
## [2,]    2    7
## [3,]    3    8
## [4,]    4    9
## [5,]    5   10

E
##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4
## [3,]    5    6
## [4,]    7    8
## [5,]    9   10

# Elementwise addition
5 + B
##      [,1] [,2]
## [1,]    6   11
## [2,]    7   12
## [3,]    8   13
## [4,]    9   14
## [5,]   10   15
a + B
##      [,1] [,2]
## [1,]    2    7
## [2,]    4    9
## [3,]    6   11
## [4,]    8   13
## [5,]   10   15
B + E
##      [,1] [,2]
## [1,]    2    8
## [2,]    5   11
## [3,]    8   14
## [4,]   11   17
## [5,]   14   20

# Elementwise multiplication
5 * B
##      [,1] [,2]
## [1,]    5   30
## [2,]   10   35
## [3,]   15   40
## [4,]   20   45
## [5,]   25   50
a * B
##      [,1] [,2]
## [1,]    1    6
## [2,]    4   14
## [3,]    9   24
## [4,]   16   36
## [5,]   25   50
B * E
##      [,1] [,2]
## [1,]    1   12
## [2,]    6   28
## [3,]   15   48
## [4,]   28   72
## [5,]   45  100

Note that when the dimensions of the two arrays do not match up, values are repeated as necessary. For example, in 5+B, the scalar 5 is applied to each cell of B. The operation is essentially the same as

matrix(rep(5,10),5,2) + B
     [,1] [,2]
[1,]    6   11
[2,]    7   12
[3,]    8   13
[4,]    9   14
[5,]   10   15

Similarly, in a+B, where a is a vector of length 5, the vector is added elementwise to the first column of B, then to the second column of B.

You can also perform elementwise logical operations, creating vectors and matrices of TRUE/FALSE values.

(B==3) | (B==2)
      [,1]  [,2]
[1,] FALSE FALSE
[2,]  TRUE FALSE
[3,]  TRUE FALSE
[4,] FALSE FALSE
[5,] FALSE FALSE

Matrix operations

Transposition

A matrix is transposed, its rows and columns exchanged, with the t() operator. Transposition exchanges the dimensions of the underlying array.

B
##      [,1] [,2]
## [1,]    1    6
## [2,]    2    7
## [3,]    3    8
## [4,]    4    9
## [5,]    5   10
dim(B)
## [1] 5 2
t(B)
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    2    3    4    5
## [2,]    6    7    8    9   10
dim(t(B))
## [1] 2 5
Multiplication

Matrix multiplication is performed with the %*% operator. For this operation to succeed, the matrices have to conform to multiplication, that is, the number of columns of the matrix on the left side of the multiplication must equal the number of rows on the right side.

This will fail:

B %*% E
Error in B %*% E: non-conformable arguments

And this product succeeds, since \(\textbf{B}\) and \(\textbf{E}^\prime\) conform to multiplication.

B %*% t(E)
     [,1] [,2] [,3] [,4] [,5]
[1,]   13   27   41   55   69
[2,]   16   34   52   70   88
[3,]   19   41   63   85  107
[4,]   22   48   74  100  126
[5,]   25   55   85  115  145
Diagonal matrices

In statistics we frequently encounter diagonal matrices, square matrices with zeros in off-diagonal cells. Programmatically, two important situations arise:

  • Extracting the diagonal elements of a matrix
  • Forming a diagonal matrix from a vector

The diag function handles both cases

# The (3 x 3) identity matrix
diag(1,3)
     [,1] [,2] [,3]
[1,]    1    0    0
[2,]    0    1    0
[3,]    0    0    1
# A (3 x 3) diagonal matrix with 1, 2, 3 on the diagonal
diag(1:3)
     [,1] [,2] [,3]
[1,]    1    0    0
[2,]    0    2    0
[3,]    0    0    3
 diag(B %*% t(E))
## [1]  13  34  63 100 145

C_ <- matrix(1:9 + rnorm(9,0,1e-3),ncol=3)
C_
##          [,1]     [,2]     [,3]
## [1,] 0.999239 4.000457 7.000228
## [2,] 1.999412 4.999930 7.999253
## [3,] 3.000382 6.000139 9.000183
diag(C_)
## [1] 0.999239 4.999930 9.000183

The trace of a matrix is the sum of its diagonal elements. You can compute the trace by combining sum and diag functions:

# The trace of matrix C_
sum(diag(C_))
[1] 14.99935
Crossproduct matrix

The crossproduct of matrices \(\textbf{A}\) and \(\textbf{B}\) is \(\textbf{A}^\prime \textbf{B}\) provided that \(\textbf{A}^\prime\) and \(\textbf{B}\) are conformable for multiplication. The most important crossproduct matrices in statistics are crossproducts of a matrix with itself: \(\textbf{A}^\prime\textbf{A}\). These crossproducts are square, symmetric matrices.

You can calculate a crossproduct matrix directly using matrix multiplication, or a dedicated function (base::crossprod or Matrix::crossprod). The dedicated functions are slightly smaller than computing the product directly, but I have found the difference to be pretty negligible, even for large matrices.

X <- matrix(rnorm(300),nrow=100,ncol=3)

# Computing X`X by direct multiplication
XpX <- t(X) %*% X
XpX
          [,1]      [,2]       [,3]
[1,]  81.91635 12.696119 -13.566906
[2,]  12.69612 83.278958  -1.086481
[3,] -13.56691 -1.086481 102.548803
# Computing X`X using the crossprod() function
crossprod(X)
          [,1]      [,2]       [,3]
[1,]  81.91635 12.696119 -13.566906
[2,]  12.69612 83.278958  -1.086481
[3,] -13.56691 -1.086481 102.548803
Inverse matrix

The inverse of square matrix \(\textbf{A}\), denoted \(\textbf{A}^{-1}\), if it exists, is the multiplicative identity: \[ \textbf{A}^{-1}\textbf{A}= \textbf{A}\textbf{A}^{-1} = \textbf{I} \] The inverse exists if \(\textbf{A}\) is of full rank–we say that than the matrix is non-singular.

library(Matrix)
rankMatrix(XpX)[1]
[1] 3

The matrix \(\textbf{X}^\prime\textbf{X}\) in our example has rank 3, which equals its number of rows (columns). The matrix is thus of full rank and can be inverted. Computing the inverse matrix is a special case of using the solve function. solve(a,b,...) solves a linear system of equation of the form \[ \textbf{A}\textbf{X}= \textbf{B} \] When the function is called without the b argument, it returns the inverse of a:

XpX_inverse <- solve(XpX)

XpX_inverse
##             [,1]          [,2]          [,3]
## [1,]  0.01278294 -0.0019270000  0.0016707297
## [2,] -0.00192700  0.0122999860 -0.0001246209
## [3,]  0.00167073 -0.0001246209  0.0099711670

round(XpX %*% XpX_inverse,4)
##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1

8.2 Least Squares from Scratch

Coding algorithms in statistical programming often starts with reproducing the formulas on paper in a programming language.

Suppose we wish to fit a multiple linear regression model with target variable (output) \(Y\) and predictor variables (inputs) \(X_1, \cdots, X_p\). The \(n\) data points for this analysis are arranged in an \((n \times 1)\) vector \[ \textbf{Y} = [Y_1, \cdots, Y_n]^\prime \] an \((n \times (p+1))\) matrix \[ \textbf{X} = \left [ \begin{array}{ccc} 1 & x_{11} & \cdots & x_{p1} \\ 1 & x_{12} & \cdots & x_{p2} \\ \vdots & \vdots & \vdots & \vdots \\ 1 & x_{1n} & \cdots & x_{pn} \end{array}\right] \] and an \((n \times 1)\) vector of error terms \[ \boldsymbol{\epsilon} = [\epsilon_1, \cdots, \epsilon_n]^\prime \]

The complete regression model can be written as \[ \textbf{Y} = \textbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon} \qquad \boldsymbol{\epsilon} \sim (\textbf{0},\sigma^2 \textbf{I}) \] The statement on the right says that the model errors have mean zero, equal variance \(\sigma^2\) and are uncorrelated—this is also sometimes called the iid assumption (identically and independently distributed), although lack of correlation does not strictly imply independence.

The parameter vector \(\boldsymbol{\beta}\) in this model is typically estimated by ordinary least squares (OLS), the solution is \[ \widehat{\boldsymbol{\beta}} = \left (\textbf{X}^\prime\textbf{X}\right)^{-1}\textbf{X}^\prime\textbf{Y} \] provided that \(\textbf{X}\) is of full column rank (which implies \(\textbf{X}^\prime\textbf{X}\) is non-singular and (\(\textbf{X}^\prime\textbf{X})^{-1}\) exists) and the predicted values are \[ \widehat{\textbf{Y}} = \textbf{X}\widehat{\boldsymbol{\beta}} \]

Let’s use a data set and compute the OLS estimates and the predicted values using matrix–vector operations, then compare the results to the standard output of the linear modeling function lm() in R.

The data set for this exercise is the fitness data set. The data comprise measurements of aerobic capacity and other attributes on 31 men involved in a physical fitness course at N.C. State University.

Aerobic capacity is the ability of the heart and lungs to provide the body with oxygen. It is a measure of fitness and expressed as the oxygen intake in ml per kg body weight per minute. Measuring aerobic capacity is expensive and time consuming compared to attributes such as age, weight, and pulse. The question is whether aerobic capacity can be predicted from the easily measurable attributes. If so, a predictive equation can reduce time and effort to assess aerobic capacity.

The variables are

  • Age: age in years

  • Weight: weight in kg

  • Oxygen: oxygen intake rate (ml per kg body weight per minute)

  • RunTime: time to run 1.5 miles (minutes)

  • RestPulse: heart rate while resting

  • RunPulse: heart rate while running (same time Oxygen rate measured)

  • MaxPulse: maximum heart rate recorded while running

The linear model we have in mind is \[ \text{Oxygen}_i = \beta_0 + \beta_1\text{Age}_i + \beta_2\text{Weight}_i + \beta_3\text{RunTime}_i + \beta_4\text{RestPulse}_i + \beta_5\text{RunPulse}_i + \beta_6\text{MaxPulse}_i + \epsilon_i \] \(i=1,\cdots,31\).

The following code makes a connection to the ads DuckDB database, loads the fitness table into an R dataframe, displays the first 6 observations, and closes the connection to the database again.

library("duckdb")
con <- dbConnect(duckdb(),dbdir = "ads.ddb",read_only=TRUE)
fit <- dbGetQuery(con, "SELECT * FROM fitness")

head(fit)
  Age Weight Oxygen RunTime RestPulse RunPulse MaxPulse
1  44  89.47 44.609   11.37        62      178      182
2  40  75.07 45.313   10.07        62      185      185
3  44  85.84 54.297    8.65        45      156      168
4  42  68.15 59.571    8.17        40      166      172
5  38  89.02 49.874    9.22        55      178      180
6  47  77.45 44.811   11.63        58      176      176
dbDisconnect(con)

The target variable for the linear model is Oxygen, the remaining variables are inputs to the regression. The next statements create the \(\textbf{y}\) vector and the \(\textbf{X}\) matrix for the model. Note that the first column of \(\textbf{X}\) is a vector of ones, representing the intercept \(\beta_0\).

y <- as.matrix(fit[,which(names(fit)=="Oxygen")])
X <- as.matrix(cbind(Intcpt=rep(1,nrow(fit)), 
                     fit[,which(names(fit)!="Oxygen")]))
head(X)
     Intcpt Age Weight RunTime RestPulse RunPulse MaxPulse
[1,]      1  44  89.47   11.37        62      178      182
[2,]      1  40  75.07   10.07        62      185      185
[3,]      1  44  85.84    8.65        45      156      168
[4,]      1  42  68.15    8.17        40      166      172
[5,]      1  38  89.02    9.22        55      178      180
[6,]      1  47  77.45   11.63        58      176      176

Next we are building the \(\textbf{X}^\prime\textbf{X}\) matrix and compute its inverse, \((\textbf{X}^\prime\textbf{X})^{-1}\), with the solve() function. t() transposes a matrix and %*% indicates that we are performing matrix multiplication rather than elementwise multiplication.

XpX <- t(X) %*% X
XpXInv <- solve(XpX)

We can verify that XpxInv is indeed the inverse of XpX by multiplying the two. This should yield the identity matrix

round(XpX %*% XpXInv,3)
          Intcpt Age Weight RunTime RestPulse RunPulse MaxPulse
Intcpt         1   0      0       0         0        0        0
Age            0   1      0       0         0        0        0
Weight         0   0      1       0         0        0        0
RunTime        0   0      0       1         0        0        0
RestPulse      0   0      0       0         1        0        0
RunPulse       0   0      0       0         0        1        0
MaxPulse       0   0      0       0         0        0        1

Next we compute the OLS estimate of \(\boldsymbol{\beta}\) and the predicted values \(\widehat{\textbf{y}} = \textbf{X}\widehat{\boldsymbol{\beta}}\).

beta_hat <- XpXInv %*% t(X) %*% y
beta_hat
                  [,1]
Intcpt    102.93447948
Age        -0.22697380
Weight     -0.07417741
RunTime    -2.62865282
RestPulse  -0.02153364
RunPulse   -0.36962776
MaxPulse    0.30321713
y_hat <- X %*% beta_hat

The estimate of the intercept is \(\widehat{\beta}_0\) = 102.9345, the estimate of the coefficient for Age is \(\widehat{\beta}_1\) = -0.227 and so on.

We could have also used the solve function in its intended application, to solve a system of linear equations–we abused the behavior of solve a bit by using it with only one argument; it will then return the inverse matrix of the argument.

The linear system to solve in the ordinary least squares problem is \[ \textbf{X}^\prime\textbf{X}\boldsymbol{\beta}= \textbf{X}^\prime\textbf{Y} \] This system of equations is called the normal equations. The solution can be computed with the solve function:

solve(XpX,crossprod(X,y))
                  [,1]
Intcpt    102.93447948
Age        -0.22697380
Weight     -0.07417741
RunTime    -2.62865282
RestPulse  -0.02153364
RunPulse   -0.36962776
MaxPulse    0.30321713

The results match beta_hat computed earlier.

The residuals \(\widehat{\boldsymbol{\epsilon}} = \textbf{y}- \widehat{\textbf{y}}\) and the error sum of squares

\[ \text{SSE} = (\textbf{y}- \widehat{\textbf{y}} )^\prime (\textbf{y}- \widehat{\textbf{y}}) = \sum_{i=1}^n \left(y_i - \widehat{y}_i\right)^2\] and the estimate of the residual variance \[\widehat{\sigma}^2 = \frac{1}{n-r(\textbf{X})} \, \text{SSE} \]

are computed as

residuals <- y - y_hat
SSE <- sum(residuals^2)
n <- nrow(fit)
rankX <- rankMatrix(XpX)[1]
sigma2_hat <- SSE/(n - rankX)

SSE
[1] 128.8379
sigma2_hat
[1] 5.368247

We used the rankMatrix function in the Matrix package to compute the rank of \(\textbf{X}\), which is identical to the rank of \(\textbf{X}^\prime\textbf{X}\). With these quantities available, the variance-covariance matrix of \(\widehat{\boldsymbol{\beta}}\), \[\ \text{Var}[\widehat{\boldsymbol{\beta}}] = \sigma^2 (\textbf{X}^\prime\textbf{X})^{-1} \] can be estimated by substituting \(\widehat{\sigma}^2\) for \(\sigma^2\). The standard errors of the regression coefficient estimates are the square roots of the diagonal values of this matrix.

Var_beta_hat <- sigma2_hat * XpXInv
se_beta_hat <- sqrt(diag(Var_beta_hat))
se_beta_hat
     Intcpt         Age      Weight     RunTime   RestPulse    RunPulse 
12.40325810  0.09983747  0.05459316  0.38456220  0.06605428  0.11985294 
   MaxPulse 
 0.13649519 

Now let’s compare our results to the output from the lm() function in R.

linmod <- lm(Oxygen ~ ., data=fit)
summary(linmod)

Call:
lm(formula = Oxygen ~ ., data = fit)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.4026 -0.8991  0.0706  1.0496  5.3847 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 102.93448   12.40326   8.299 1.64e-08 ***
Age          -0.22697    0.09984  -2.273  0.03224 *  
Weight       -0.07418    0.05459  -1.359  0.18687    
RunTime      -2.62865    0.38456  -6.835 4.54e-07 ***
RestPulse    -0.02153    0.06605  -0.326  0.74725    
RunPulse     -0.36963    0.11985  -3.084  0.00508 ** 
MaxPulse      0.30322    0.13650   2.221  0.03601 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.317 on 24 degrees of freedom
Multiple R-squared:  0.8487,    Adjusted R-squared:  0.8108 
F-statistic: 22.43 on 6 and 24 DF,  p-value: 9.715e-09

Based on the quantities calculated earlier, the following code reproduces the lm summary.

tvals <- beta_hat/se_beta_hat
pvals <- 2*(1-pt(abs(tvals),n-rankX))
result <- cbind(beta_hat, se_beta_hat, tvals, pvals)
colnames(result) <- c("Estimate", "Std. Error", "t value", "Pr(>|t|)")
round(result,5)
           Estimate Std. Error  t value Pr(>|t|)
Intcpt    102.93448   12.40326  8.29899  0.00000
Age        -0.22697    0.09984 -2.27343  0.03224
Weight     -0.07418    0.05459 -1.35873  0.18687
RunTime    -2.62865    0.38456 -6.83544  0.00000
RestPulse  -0.02153    0.06605 -0.32600  0.74725
RunPulse   -0.36963    0.11985 -3.08401  0.00508
MaxPulse    0.30322    0.13650  2.22145  0.03601
cat("\nResidual standard error: ", sqrt(sigma2_hat)," on ", n-rankX, "degrees of freedom\n")

Residual standard error:  2.316948  on  24 degrees of freedom
SST <- sum( (y -mean(y))^2 )
cat("Multiple R-squared: ", 1-SSE/SST, 
    "Adjusted R-squared: ", 1 - (SSE/SST)*(n-1)/(n-rankX), "\n")
Multiple R-squared:  0.8486719 Adjusted R-squared:  0.8108399 
Fstat <- ((SST-SSE)/(rankX-1)) / (SSE/(n-rankX))
cat("F-statistic: ", Fstat, "on ", 
    rankX-1, "and", n-rankX, "DF, p-value:", 1-pf(Fstat,rankX-1,n-rankX))
F-statistic:  22.43263 on  6 and 24 DF, p-value: 9.715305e-09

8.3 Building a Model Matrix

In the previous example, the model matrix \(\textbf{X}\) was formed in code by appending a \((31 \times 6)\) matrix of input variables to a \(31 \times 1\) vector of ones:

X <- as.matrix(cbind(Intcpt=rep(1,nrow(fit)), 
                     fit[,which(names(fit)!="Oxygen")]))

The lm function used a special syntax to specify the model, called a model formula: Oxygen ~ .. The formula specifies the target variable (the dependent variable) on the left side of the tilde and the input (predictor, independent) variables on the right hand side of the tilde. The special dot syntax implies to include all variables in the data frame (except for Oxygen) as input variables for the right hand side of the model. Also, the intercept is automatically included in model formulas, because not having an intercept is a special case in statistical modeling.

You can generate a model matrix easily by using the model.matrix function with a model formula. In the fitness example,

X_ <- model.matrix(Oxygen ~ ., data=fit)

The two matrices are identical, as you can see with

sum(X - X_)
[1] 0

If we wanted to repeat the regression calculations for a model that contains only Age and MaxPulse as predictor variables, it would be easy to construct the model matrix with

X_small_model <- model.matrix(Oxygen ~ Age + MaxPulse, data=fit)

Using model.matrix is very convenient when you work with factors, classification input variables that are not represented in the model by their actual values (which could be strings) but by converting a variable with \(k\) unique values into \(k\) columns in \(\textbf{X}\). These columns use 0/1 values to encode which level of the classification input variable matches the value for an observation.

8.4 Principal Component Analysis from Scratch

Principal Component Analysis (PCA) is one of the oldest statistical techniques and still of great importance today, not only in statistics. Machine learning methods frequently use PCA to visualize data, to summarize data, to impute missing values by matrix completion, and most important, to reduce the dimension of an estimation problem.

The principal components of a set of data \(X_1,\cdots,X_p\) are linear combinations of the \(p\) inputs. The elements of the \(j\)th component score vector are calculated as follows: \[ z_{ij} = \psi_{1j} x_{i1} + \psi_{2j} x_{i2} + \cdots + \psi_{pj} x_{ip} \quad i=1,\cdots,n; \;\; j=1,\cdots,p \] The coefficients \(\psi_{kj}\) are called the rotations or loadings of the principal components. In constructing the linear combination, the \(X\)s are always centered to have a mean of zero and are usually scaled to have standard deviation one.

The scores \(z_{ij}\) are then collected into a score vector \(\textbf{z}_j = [z_{1j}, \cdots, z_{nj}]\) for the \(j\)th component. Note that for each observation \(i\) there are \(p\) inputs and \(p\) scores. So what have we gained? Instead of the \(p\) vectors \(\textbf{x}_1, \cdots, \textbf{x}_p\) we now have the \(p\) vectors \(\textbf{z}_1, \cdots, \textbf{z}_p\). Because of the way in which the \(\psi_{kj}\) are cosntructed, the PCA scores have very special properties:

  • They have zero mean (if the data was centered):\(\sum_{i=1}^n z_{ij} = 0\)

  • They are uncorrelated: \(\text{Corr}[z_j, z_k] = 0, \forall j \ne k\)

  • \(\text{Var}\left[\sum_{j=1}^p z_j\right] = \sum_{j=1}^p \text{Var}[z_j] = p\) (if data was scaled)

  • The components are ordered in terms of their variance \(\Rightarrow \text{Var}[z_1] > \text{Var}[z_2] > \cdots > \text{Var}[z_p]\)

Consider the following data set with \(n=10\) and \(p=4\).

Table 8.1: Example data for principal component analysis.
Obs \(X_1\) \(X_2\) \(X_3\) \(X_4\)
1 0.344 0.364 0.806 0.160
2 0.363 0.354 0.696 0.249
3 0.196 0.189 0.437 0.248
4 0.200 0.212 0.590 0.160
5 0.227 0.229 0.437 0.187
6 0.204 0.233 0.518 0.090
7 0.197 0.209 0.499 0.169
8 0.165 0.162 0.536 0.267
9 0.138 0.116 0.434 0.362
10 0.151 0.151 0.483 0.223

We now perform the principal component analysis with the prcomp function in R and then compute the results from scratch using linear algebra.

df <- data.frame("x1"=c(0.344,0.363,0.196,0.200,0.227,
                        0.204,0.197,0.165,0.138,0.151),
                 "x2"=c(0.364,0.354,0.189,0.212,0.229,
                        0.233,0.209,0.162,0.116,0.151),
                 "x3"=c(0.806,0.696,0.437,0.590,0.437,
                        0.518,0.499,0.536,0.434,0.483),
                 "x4"=c(0.160,0.249,0.248,0.160,0.187,
                        0.090,0.169,0.267,0.362,0.223)
                 )
df
      x1    x2    x3    x4
1  0.344 0.364 0.806 0.160
2  0.363 0.354 0.696 0.249
3  0.196 0.189 0.437 0.248
4  0.200 0.212 0.590 0.160
5  0.227 0.229 0.437 0.187
6  0.204 0.233 0.518 0.090
7  0.197 0.209 0.499 0.169
8  0.165 0.162 0.536 0.267
9  0.138 0.116 0.434 0.362
10 0.151 0.151 0.483 0.223

prcomp in R

You can compute a principal component analysis in R with the prcomp or the princomp function. We use prcomp here, as it is based on the singular value decomposition and is numerically more stable. The results will match the calculations based on eigenvalue decomposition below. The scale.=TRUE option requests that the data are centered (default) and scaled (not the default) to have standard deviation 1. The retx=TRUE option requests that the matrix of PCA scores is returned.

pca <- prcomp(df,scale.=TRUE,retx=TRUE)
summary(pca)
Importance of components:
                          PC1    PC2    PC3     PC4
Standard deviation     1.7199 0.9183 0.4450 0.02452
Proportion of Variance 0.7395 0.2108 0.0495 0.00015
Cumulative Proportion  0.7395 0.9504 0.9999 1.00000

prcomp reports the standard deviations (pca$sdev) and the variances associated with the four principal components. The sum of the variances is \(p=4\). This allows a decomposition of the total variability by component:

pca$sdev
## [1] 1.71993171 0.91827030 0.44498671 0.02452314

sum(pca$sdev^2)
## [1] 4

(pca$sdev^2)/sum(pca$sdev^2)
## [1] 0.7395412727 0.2108050872 0.0495032941 0.0001503461

You can see the matrix of loadings (the rotations) \(\boldsymbol{\Psi}\) as pca$rotation.

pca$rotation
          PC1        PC2         PC3         PC4
x1 -0.5550597 0.23468359  0.46010018 -0.65202771
x2 -0.5740837 0.05358751  0.33570890  0.74488643
x3 -0.5298211 0.20737384 -0.82062990 -0.05340623
x4  0.2857029 0.94818146  0.04662547  0.13096507

The first principal component, with somewhat equal loading coefficients for \(X_1\), \(X_2\), and \(X_3\), explains 73.9% of the overall variance. The second component, which is dominated by \(X_4\), explains an additional 21% of the variance. Together, the first two components explain 95% of the total variability in the \(X\)s.

And here are the scores for the four principal components.

round(pca$x,5)
           PC1      PC2      PC3      PC4
 [1,] -3.25197  0.27659 -0.43983  0.02831
 [2,] -2.50685  1.26276  0.42476 -0.02385
 [3,]  0.99638  0.18799  0.46349 -0.00004
 [4,] -0.19065 -0.63335 -0.49528 -0.04232
 [5,]  0.25530 -0.45773  0.77902 -0.00367
 [6,] -0.32314 -1.60994  0.05447  0.02628
 [7,]  0.27979 -0.68519  0.08867  0.01106
 [8,]  1.05828  0.48108 -0.48662  0.00707
 [9,]  2.38236  1.39061 -0.09924  0.02476
[10,]  1.30050 -0.21282 -0.28944 -0.02760

Eigenvalue Decomposition

The PCA can be computed with a singular value decomposition (SVD) of the matrix \(\textbf{X}\) or with an eigendecomposition of the covariance or correlation matrix of \(\textbf{X}\). The SVD is numerically more stable than the eigendecomposition, but we can demonstrate the computations easily using eigenvalues and eigenvectors.

The eigendecomposition of a square \(n \times n\) matrix \(\textbf{A}\) is \[ \textbf{A}= \textbf{Q}\boldsymbol{\Lambda}\textbf{Q}^{-1} \] where \(\textbf{Q}\) is the \(n \times n\) matrix containing the eigenvectors of \(\textbf{A}\) and \(\boldsymbol{\Lambda}\) is a diagonal matrix with the eigenvalues of \(\textbf{A}\) on the diagonal. If \(\textbf{A}\) is symmetric, then \(\textbf{Q}\) is orthogonal and \(\textbf{Q}^{-1} = \textbf{Q}^\prime\). The eigendecomposition of a real symmetric matrix is thus

\[ \textbf{A}= \textbf{Q}\boldsymbol{\Lambda}\textbf{Q}^\prime \] How can we interpret eigenvectors and eigenvalues? The \(n\times 1\) vector \(\textbf{q}\) is an eigenvector of \(\textbf{A}\), if it satisfies the linear equation

\[ \textbf{A}\textbf{q} = \lambda\textbf{q} \]

for a scalar \(\lambda\). This scalar is called the eigenvalue corresponding to \(\textbf{q}\). There are \(n\) eigenvectors and when they are arranged as the columns of a matrix you get \(\textbf{Q}\).

Example: Eigenanalysis of Correlation Matrix

In R you can obtain the eigenvectors and eigenvalues of a square matrix with the eigen function.

There are ten observations of four variables. Let’s take as input for the eigendecomposition the correlation matrix of the data:

c <- cor(df)
c
           x1         x2         x3         x4
x1  1.0000000  0.9835190  0.8362382 -0.2772801
x2  0.9835190  1.0000000  0.8545558 -0.4391878
x3  0.8362382  0.8545558  1.0000000 -0.2895615
x4 -0.2772801 -0.4391878 -0.2895615  1.0000000
e <- eigen(c)

e$values 
## [1] 2.9581650906 0.8432203486 0.1980131763 0.0006013844

sum(e$values)
## [1] 4

sqrt(e$values)
## [1] 1.71993171 0.91827030 0.44498671 0.02452314

e$vectors
##            [,1]       [,2]        [,3]        [,4]
## [1,] -0.5550597 0.23468359 -0.46010018  0.65202771
## [2,] -0.5740837 0.05358751 -0.33570890 -0.74488643
## [3,] -0.5298211 0.20737384  0.82062990  0.05340623
## [4,]  0.2857029 0.94818146 -0.04662547 -0.13096507

The eigen function returns the diagonal elements of \(\boldsymbol{\Lambda}\) in e$values and the matrix \(\textbf{Q}\) of eigenvectors in e$vectors. The input matrix of the decomposition can be reconstructed from those:

A <- e$vectors %*% diag(e$values) %*% t(e$vectors)
round((c - A),4)
   x1 x2 x3 x4
x1  0  0  0  0
x2  0  0  0  0
x3  0  0  0  0
x4  0  0  0  0

What happens if we multiply the \(10 \times 4\) data matrix with the \(4\times 4\) matrix of eigenvectors of the correlation matrix? First, we center and scale the data. The result is a \(10 \times 4\) matrix of linear transformations. These are the scores of the PCA.

scores <- scale(df) %*% e$vectors
scores
            [,1]       [,2]        [,3]          [,4]
 [1,] -3.2519650  0.2765895  0.43983079 -2.831053e-02
 [2,] -2.5068508  1.2627631 -0.42476284  2.385301e-02
 [3,]  0.9963834  0.1879868 -0.46348719  3.885523e-05
 [4,] -0.1906528 -0.6333530  0.49527588  4.231872e-02
 [5,]  0.2552987 -0.4577253 -0.77901629  3.672417e-03
 [6,] -0.3231438 -1.6099388 -0.05447168 -2.628130e-02
 [7,]  0.2797933 -0.6851941 -0.08867423 -1.106065e-02
 [8,]  1.0582766  0.4810756  0.48662389 -7.069551e-03
 [9,]  2.3823630  1.3906130  0.09923769 -2.475865e-02
[10,]  1.3004973 -0.2128168  0.28944399  2.759768e-02

The scores have mean 0, their variances equal the eigenvalues of the correlation matrix, their variances sum to \(p=4\), and they are uncorrelated.

apply(scores,2,mean)
[1] -4.218847e-16 -2.109424e-16  1.498801e-16 -6.696033e-17
apply(scores,2,var)
[1] 2.9581650906 0.8432203486 0.1980131763 0.0006013844
sum(apply(scores,2,var))
[1] 4
round(cor(scores),4)
     [,1] [,2] [,3] [,4]
[1,]    1    0    0    0
[2,]    0    1    0    0
[3,]    0    0    1    0
[4,]    0    0    0    1