<- c(1:5)
a
a## [1] 1 2 3 4 5
is.vector(a)
## [1] TRUE
8 Vectors and Matrices
8.1 Introduction
Working with vectors and matrices is essential for statisticians. We express the mathematics of statistics in terms of scalars, vectors, and matrices. As you move into machine learning, in particular deep learning, the horizon expands from matrices to tensors (multi-dimensional arrays). For now we stick with one-dimensional (=vectors) and two-dimensional (=matrices) arrays of real numbers.
Creating Vectors and Matrices
Vectors and matrices in R
are special cases of the array data type. An array can have one, two, or more dimensions as indicated by its dim()
attribute. Arrays with one dimensions are vectors, two-dimensional arrays are matrices.
If you create a one-dimensional array, you are automatically creating a vector
Frequently we create vectors of constant values, the rep
function helps with that:
rep(1:4)
[1] 1 2 3 4
rep(1:4,times=2)
[1] 1 2 3 4 1 2 3 4
To create a regular sequence of values, use seq
seq(1:5)
## [1] 1 2 3 4 5
seq(from=1,to=10,by=0.5)
## [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
## [16] 8.5 9.0 9.5 10.0
seq(length.out=10)
## [1] 1 2 3 4 5 6 7 8 9 10
To create a matrix you can use different approaches:
- Use the
matrix
function - Convert a vector by changing its dimensions
- Coerce a numeric object into a matrix with
as.matrix
matrix
function
<- matrix(1:10,ncol=2) # default is ncol=1, byrow=FALSE
B B
[,1] [,2]
[1,] 1 6
[2,] 2 7
[3,] 3 8
[4,] 4 9
[5,] 5 10
<- matrix(1:10,ncol=2,byrow=TRUE)
E E
[,1] [,2]
[1,] 1 2
[2,] 3 4
[3,] 5 6
[4,] 7 8
[5,] 9 10
Converting a vector into a matrix
Since a matrix is a two-dimensional array, and a vector is one-dimensional, a simple technique for creating a matrix from a vector is to make an assignment to the dimension attribute. In the following code, a one-dimensional vector of length 20 is shaped into a \((10 \times 2)\) matrix. Note that the elements of the matrix are filled in column order: the first 10 elements of the vector are assigned to the rows of column 1, the next 10 elements are assigned to the rows of column 2.
set.seed(876)
<- round(rnorm(20),3)
x
x## [1] 0.171 0.782 -1.064 -0.264 0.114 1.940 0.461 1.226 -0.415 0.013
## [11] 1.154 0.837 0.247 -0.916 0.378 -1.406 -0.367 0.127 -0.848 -0.146
dim(x) <- c(10,2)
x## [,1] [,2]
## [1,] 0.171 1.154
## [2,] 0.782 0.837
## [3,] -1.064 0.247
## [4,] -0.264 -0.916
## [5,] 0.114 0.378
## [6,] 1.940 -1.406
## [7,] 0.461 -0.367
## [8,] 1.226 0.127
## [9,] -0.415 -0.848
## [10,] 0.013 -0.146
is.matrix(x)
## [1] TRUE
What if you want to create a \((10 \times 2)\) matrix but fill the matrix in row-order: the first two elements in row 1, the next two elements in row 2, and so forth? The solution is to assign the dimensions in reverse order and transpose the result.
<- round(rnorm(20),3)
x
x## [1] 0.742 1.415 1.603 -0.124 -0.828 -0.138 0.152 0.425 -0.159 -0.837
## [11] 0.364 -0.373 -0.375 0.194 0.238 -0.740 2.435 1.573 1.117 -1.773
dim(x) <- c(2,10)
t(x)
## [,1] [,2]
## [1,] 0.742 1.415
## [2,] 1.603 -0.124
## [3,] -0.828 -0.138
## [4,] 0.152 0.425
## [5,] -0.159 -0.837
## [6,] 0.364 -0.373
## [7,] -0.375 0.194
## [8,] 0.238 -0.740
## [9,] 2.435 1.573
## [10,] 1.117 -1.773
Converting a vector into a matrix by changing its dimensions has the advantage that the object is not copied, saving memory.
Coercion
R
is good at coercion, the implicit conversion from one data type to another. Coercion can happens implicitly when you pass an object of a different type to a function. Coercion can also be done explicitly using as.*
-style functions.
I really meant “as.*
-style functions” as in as.matrix
, as.data.frame
, as.Date
, as.dendrogram
, etc. Not as*
-style functions.
For example, to coerce an R
object into a matrix, use the as.matrix
function. A common usage is to convert a dataframe of numerical data:
<- data.frame(int=rep(1,4), x1=c(1,2,3,4), x2=rnorm(4))
df is.matrix(df)
[1] FALSE
is.matrix(as.matrix(df))
[1] TRUE
Basic Operations
When operating on matrices and vectors, we need to distinguish elementwise operations from true matrix operations. For example, take the \((5 \times 2)\) matrices B
and E
created earlier. Their elementwise product is the matrix with typical element \([b_{ij}*e_{ij}]\). In other words, elements of the matrices are matched up and the multiplication is performed separately in each cell.
The matrix product \(\textbf{B}* \textbf{E}\) is not possible because the matrices do not conform to matrix multiplication. However, the product \(\textbf{B}* \textbf{E}^\prime\) is possible, the result is a (5 )$ matrix.
Elementwise operations
B## [,1] [,2]
## [1,] 1 6
## [2,] 2 7
## [3,] 3 8
## [4,] 4 9
## [5,] 5 10
E## [,1] [,2]
## [1,] 1 2
## [2,] 3 4
## [3,] 5 6
## [4,] 7 8
## [5,] 9 10
# Elementwise addition
5 + B
## [,1] [,2]
## [1,] 6 11
## [2,] 7 12
## [3,] 8 13
## [4,] 9 14
## [5,] 10 15
+ B
a ## [,1] [,2]
## [1,] 2 7
## [2,] 4 9
## [3,] 6 11
## [4,] 8 13
## [5,] 10 15
+ E
B ## [,1] [,2]
## [1,] 2 8
## [2,] 5 11
## [3,] 8 14
## [4,] 11 17
## [5,] 14 20
# Elementwise multiplication
5 * B
## [,1] [,2]
## [1,] 5 30
## [2,] 10 35
## [3,] 15 40
## [4,] 20 45
## [5,] 25 50
* B
a ## [,1] [,2]
## [1,] 1 6
## [2,] 4 14
## [3,] 9 24
## [4,] 16 36
## [5,] 25 50
* E
B ## [,1] [,2]
## [1,] 1 12
## [2,] 6 28
## [3,] 15 48
## [4,] 28 72
## [5,] 45 100
Note that when the dimensions of the two arrays do not match up, values are repeated as necessary. For example, in 5+B
, the scalar 5
is applied to each cell of B
. The operation is essentially the same as
matrix(rep(5,10),5,2) + B
[,1] [,2]
[1,] 6 11
[2,] 7 12
[3,] 8 13
[4,] 9 14
[5,] 10 15
Similarly, in a+B
, where a
is a vector of length 5, the vector is added elementwise to the first column of B
, then to the second column of B
.
You can also perform elementwise logical operations, creating vectors and matrices of TRUE/FALSE values.
==3) | (B==2) (B
[,1] [,2]
[1,] FALSE FALSE
[2,] TRUE FALSE
[3,] TRUE FALSE
[4,] FALSE FALSE
[5,] FALSE FALSE
Matrix operations
Transposition
A matrix is transposed, its rows and columns exchanged, with the t()
operator. Transposition exchanges the dimensions of the underlying array.
B## [,1] [,2]
## [1,] 1 6
## [2,] 2 7
## [3,] 3 8
## [4,] 4 9
## [5,] 5 10
dim(B)
## [1] 5 2
t(B)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 2 3 4 5
## [2,] 6 7 8 9 10
dim(t(B))
## [1] 2 5
Multiplication
Matrix multiplication is performed with the %*%
operator. For this operation to succeed, the matrices have to conform to multiplication, that is, the number of columns of the matrix on the left side of the multiplication must equal the number of rows on the right side.
This will fail:
%*% E B
Error in B %*% E: non-conformable arguments
And this product succeeds, since \(\textbf{B}\) and \(\textbf{E}^\prime\) conform to multiplication.
%*% t(E) B
[,1] [,2] [,3] [,4] [,5]
[1,] 13 27 41 55 69
[2,] 16 34 52 70 88
[3,] 19 41 63 85 107
[4,] 22 48 74 100 126
[5,] 25 55 85 115 145
Diagonal matrices
In statistics we frequently encounter diagonal matrices, square matrices with zeros in off-diagonal cells. Programmatically, two important situations arise:
- Extracting the diagonal elements of a matrix
- Forming a diagonal matrix from a vector
The diag
function handles both cases
# The (3 x 3) identity matrix
diag(1,3)
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 1 0
[3,] 0 0 1
# A (3 x 3) diagonal matrix with 1, 2, 3 on the diagonal
diag(1:3)
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 2 0
[3,] 0 0 3
diag(B %*% t(E))
## [1] 13 34 63 100 145
<- matrix(1:9 + rnorm(9,0,1e-3),ncol=3)
C_
C_## [,1] [,2] [,3]
## [1,] 0.999239 4.000457 7.000228
## [2,] 1.999412 4.999930 7.999253
## [3,] 3.000382 6.000139 9.000183
diag(C_)
## [1] 0.999239 4.999930 9.000183
The trace of a matrix is the sum of its diagonal elements. You can compute the trace by combining sum
and diag
functions:
# The trace of matrix C_
sum(diag(C_))
[1] 14.99935
Crossproduct matrix
The crossproduct of matrices \(\textbf{A}\) and \(\textbf{B}\) is \(\textbf{A}^\prime \textbf{B}\) provided that \(\textbf{A}^\prime\) and \(\textbf{B}\) are conformable for multiplication. The most important crossproduct matrices in statistics are crossproducts of a matrix with itself: \(\textbf{A}^\prime\textbf{A}\). These crossproducts are square, symmetric matrices.
You can calculate a crossproduct matrix directly using matrix multiplication, or a dedicated function (base::crossprod
or Matrix::crossprod
). The dedicated functions are slightly smaller than computing the product directly, but I have found the difference to be pretty negligible, even for large matrices.
<- matrix(rnorm(300),nrow=100,ncol=3)
X
# Computing X`X by direct multiplication
<- t(X) %*% X
XpX XpX
[,1] [,2] [,3]
[1,] 81.91635 12.696119 -13.566906
[2,] 12.69612 83.278958 -1.086481
[3,] -13.56691 -1.086481 102.548803
# Computing X`X using the crossprod() function
crossprod(X)
[,1] [,2] [,3]
[1,] 81.91635 12.696119 -13.566906
[2,] 12.69612 83.278958 -1.086481
[3,] -13.56691 -1.086481 102.548803
Inverse matrix
The inverse of square matrix \(\textbf{A}\), denoted \(\textbf{A}^{-1}\), if it exists, is the multiplicative identity: \[ \textbf{A}^{-1}\textbf{A}= \textbf{A}\textbf{A}^{-1} = \textbf{I} \] The inverse exists if \(\textbf{A}\) is of full rank–we say that than the matrix is non-singular.
library(Matrix)
rankMatrix(XpX)[1]
[1] 3
The matrix \(\textbf{X}^\prime\textbf{X}\) in our example has rank 3, which equals its number of rows (columns). The matrix is thus of full rank and can be inverted. Computing the inverse matrix is a special case of using the solve
function. solve(a,b,...)
solves a linear system of equation of the form \[
\textbf{A}\textbf{X}= \textbf{B}
\] When the function is called without the b
argument, it returns the inverse of a
:
<- solve(XpX)
XpX_inverse
XpX_inverse## [,1] [,2] [,3]
## [1,] 0.01278294 -0.0019270000 0.0016707297
## [2,] -0.00192700 0.0122999860 -0.0001246209
## [3,] 0.00167073 -0.0001246209 0.0099711670
round(XpX %*% XpX_inverse,4)
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
8.2 Least Squares from Scratch
Coding algorithms in statistical programming often starts with reproducing the formulas on paper in a programming language.
Suppose we wish to fit a multiple linear regression model with target variable (output) \(Y\) and predictor variables (inputs) \(X_1, \cdots, X_p\). The \(n\) data points for this analysis are arranged in an \((n \times 1)\) vector \[ \textbf{Y} = [Y_1, \cdots, Y_n]^\prime \] an \((n \times (p+1))\) matrix \[ \textbf{X} = \left [ \begin{array}{ccc} 1 & x_{11} & \cdots & x_{p1} \\ 1 & x_{12} & \cdots & x_{p2} \\ \vdots & \vdots & \vdots & \vdots \\ 1 & x_{1n} & \cdots & x_{pn} \end{array}\right] \] and an \((n \times 1)\) vector of error terms \[ \boldsymbol{\epsilon} = [\epsilon_1, \cdots, \epsilon_n]^\prime \]
The complete regression model can be written as \[ \textbf{Y} = \textbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon} \qquad \boldsymbol{\epsilon} \sim (\textbf{0},\sigma^2 \textbf{I}) \] The statement on the right says that the model errors have mean zero, equal variance \(\sigma^2\) and are uncorrelated—this is also sometimes called the iid assumption (identically and independently distributed), although lack of correlation does not strictly imply independence.
The parameter vector \(\boldsymbol{\beta}\) in this model is typically estimated by ordinary least squares (OLS), the solution is \[ \widehat{\boldsymbol{\beta}} = \left (\textbf{X}^\prime\textbf{X}\right)^{-1}\textbf{X}^\prime\textbf{Y} \] provided that \(\textbf{X}\) is of full column rank (which implies \(\textbf{X}^\prime\textbf{X}\) is non-singular and (\(\textbf{X}^\prime\textbf{X})^{-1}\) exists) and the predicted values are \[ \widehat{\textbf{Y}} = \textbf{X}\widehat{\boldsymbol{\beta}} \]
Let’s use a data set and compute the OLS estimates and the predicted values using matrix–vector operations, then compare the results to the standard output of the linear modeling function lm()
in R
.
The data set for this exercise is the fitness
data set. The data comprise measurements of aerobic capacity and other attributes on 31 men involved in a physical fitness course at N.C. State University.
Aerobic capacity is the ability of the heart and lungs to provide the body with oxygen. It is a measure of fitness and expressed as the oxygen intake in ml per kg body weight per minute. Measuring aerobic capacity is expensive and time consuming compared to attributes such as age, weight, and pulse. The question is whether aerobic capacity can be predicted from the easily measurable attributes. If so, a predictive equation can reduce time and effort to assess aerobic capacity.
The variables are
Age: age in years
Weight: weight in kg
Oxygen: oxygen intake rate (ml per kg body weight per minute)
RunTime: time to run 1.5 miles (minutes)
RestPulse: heart rate while resting
RunPulse: heart rate while running (same time Oxygen rate measured)
MaxPulse: maximum heart rate recorded while running
The linear model we have in mind is \[ \text{Oxygen}_i = \beta_0 + \beta_1\text{Age}_i + \beta_2\text{Weight}_i + \beta_3\text{RunTime}_i + \beta_4\text{RestPulse}_i + \beta_5\text{RunPulse}_i + \beta_6\text{MaxPulse}_i + \epsilon_i \] \(i=1,\cdots,31\).
The following code makes a connection to the ads
DuckDB database, loads the fitness
table into an R
dataframe, displays the first 6 observations, and closes the connection to the database again.
library("duckdb")
<- dbConnect(duckdb(),dbdir = "ads.ddb",read_only=TRUE)
con <- dbGetQuery(con, "SELECT * FROM fitness")
fit
head(fit)
Age Weight Oxygen RunTime RestPulse RunPulse MaxPulse
1 44 89.47 44.609 11.37 62 178 182
2 40 75.07 45.313 10.07 62 185 185
3 44 85.84 54.297 8.65 45 156 168
4 42 68.15 59.571 8.17 40 166 172
5 38 89.02 49.874 9.22 55 178 180
6 47 77.45 44.811 11.63 58 176 176
dbDisconnect(con)
The target variable for the linear model is Oxygen
, the remaining variables are inputs to the regression. The next statements create the \(\textbf{y}\) vector and the \(\textbf{X}\) matrix for the model. Note that the first column of \(\textbf{X}\) is a vector of ones, representing the intercept \(\beta_0\).
<- as.matrix(fit[,which(names(fit)=="Oxygen")])
y <- as.matrix(cbind(Intcpt=rep(1,nrow(fit)),
X which(names(fit)!="Oxygen")]))
fit[,head(X)
Intcpt Age Weight RunTime RestPulse RunPulse MaxPulse
[1,] 1 44 89.47 11.37 62 178 182
[2,] 1 40 75.07 10.07 62 185 185
[3,] 1 44 85.84 8.65 45 156 168
[4,] 1 42 68.15 8.17 40 166 172
[5,] 1 38 89.02 9.22 55 178 180
[6,] 1 47 77.45 11.63 58 176 176
Next we are building the \(\textbf{X}^\prime\textbf{X}\) matrix and compute its inverse, \((\textbf{X}^\prime\textbf{X})^{-1}\), with the solve()
function. t()
transposes a matrix and %*%
indicates that we are performing matrix multiplication rather than elementwise multiplication.
<- t(X) %*% X
XpX <- solve(XpX) XpXInv
We can verify that XpxInv
is indeed the inverse of XpX
by multiplying the two. This should yield the identity matrix
round(XpX %*% XpXInv,3)
Intcpt Age Weight RunTime RestPulse RunPulse MaxPulse
Intcpt 1 0 0 0 0 0 0
Age 0 1 0 0 0 0 0
Weight 0 0 1 0 0 0 0
RunTime 0 0 0 1 0 0 0
RestPulse 0 0 0 0 1 0 0
RunPulse 0 0 0 0 0 1 0
MaxPulse 0 0 0 0 0 0 1
Next we compute the OLS estimate of \(\boldsymbol{\beta}\) and the predicted values \(\widehat{\textbf{y}} = \textbf{X}\widehat{\boldsymbol{\beta}}\).
<- XpXInv %*% t(X) %*% y
beta_hat beta_hat
[,1]
Intcpt 102.93447948
Age -0.22697380
Weight -0.07417741
RunTime -2.62865282
RestPulse -0.02153364
RunPulse -0.36962776
MaxPulse 0.30321713
<- X %*% beta_hat y_hat
The estimate of the intercept is \(\widehat{\beta}_0\) = 102.9345, the estimate of the coefficient for Age
is \(\widehat{\beta}_1\) = -0.227 and so on.
We could have also used the solve
function in its intended application, to solve a system of linear equations–we abused the behavior of solve
a bit by using it with only one argument; it will then return the inverse matrix of the argument.
The linear system to solve in the ordinary least squares problem is \[
\textbf{X}^\prime\textbf{X}\boldsymbol{\beta}= \textbf{X}^\prime\textbf{Y}
\] This system of equations is called the normal equations. The solution can be computed with the solve
function:
solve(XpX,crossprod(X,y))
[,1]
Intcpt 102.93447948
Age -0.22697380
Weight -0.07417741
RunTime -2.62865282
RestPulse -0.02153364
RunPulse -0.36962776
MaxPulse 0.30321713
The results match beta_hat
computed earlier.
The residuals \(\widehat{\boldsymbol{\epsilon}} = \textbf{y}- \widehat{\textbf{y}}\) and the error sum of squares
\[ \text{SSE} = (\textbf{y}- \widehat{\textbf{y}} )^\prime (\textbf{y}- \widehat{\textbf{y}}) = \sum_{i=1}^n \left(y_i - \widehat{y}_i\right)^2\] and the estimate of the residual variance \[\widehat{\sigma}^2 = \frac{1}{n-r(\textbf{X})} \, \text{SSE} \]
are computed as
<- y - y_hat
residuals <- sum(residuals^2)
SSE <- nrow(fit)
n <- rankMatrix(XpX)[1]
rankX <- SSE/(n - rankX)
sigma2_hat
SSE
[1] 128.8379
sigma2_hat
[1] 5.368247
We used the rankMatrix
function in the Matrix
package to compute the rank of \(\textbf{X}\), which is identical to the rank of \(\textbf{X}^\prime\textbf{X}\). With these quantities available, the variance-covariance matrix of \(\widehat{\boldsymbol{\beta}}\), \[\
\text{Var}[\widehat{\boldsymbol{\beta}}] = \sigma^2 (\textbf{X}^\prime\textbf{X})^{-1}
\] can be estimated by substituting \(\widehat{\sigma}^2\) for \(\sigma^2\). The standard errors of the regression coefficient estimates are the square roots of the diagonal values of this matrix.
<- sigma2_hat * XpXInv
Var_beta_hat <- sqrt(diag(Var_beta_hat))
se_beta_hat se_beta_hat
Intcpt Age Weight RunTime RestPulse RunPulse
12.40325810 0.09983747 0.05459316 0.38456220 0.06605428 0.11985294
MaxPulse
0.13649519
Now let’s compare our results to the output from the lm()
function in R
.
<- lm(Oxygen ~ ., data=fit)
linmod summary(linmod)
Call:
lm(formula = Oxygen ~ ., data = fit)
Residuals:
Min 1Q Median 3Q Max
-5.4026 -0.8991 0.0706 1.0496 5.3847
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 102.93448 12.40326 8.299 1.64e-08 ***
Age -0.22697 0.09984 -2.273 0.03224 *
Weight -0.07418 0.05459 -1.359 0.18687
RunTime -2.62865 0.38456 -6.835 4.54e-07 ***
RestPulse -0.02153 0.06605 -0.326 0.74725
RunPulse -0.36963 0.11985 -3.084 0.00508 **
MaxPulse 0.30322 0.13650 2.221 0.03601 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.317 on 24 degrees of freedom
Multiple R-squared: 0.8487, Adjusted R-squared: 0.8108
F-statistic: 22.43 on 6 and 24 DF, p-value: 9.715e-09
Based on the quantities calculated earlier, the following code reproduces the lm
summary.
<- beta_hat/se_beta_hat
tvals <- 2*(1-pt(abs(tvals),n-rankX))
pvals <- cbind(beta_hat, se_beta_hat, tvals, pvals)
result colnames(result) <- c("Estimate", "Std. Error", "t value", "Pr(>|t|)")
round(result,5)
Estimate Std. Error t value Pr(>|t|)
Intcpt 102.93448 12.40326 8.29899 0.00000
Age -0.22697 0.09984 -2.27343 0.03224
Weight -0.07418 0.05459 -1.35873 0.18687
RunTime -2.62865 0.38456 -6.83544 0.00000
RestPulse -0.02153 0.06605 -0.32600 0.74725
RunPulse -0.36963 0.11985 -3.08401 0.00508
MaxPulse 0.30322 0.13650 2.22145 0.03601
cat("\nResidual standard error: ", sqrt(sigma2_hat)," on ", n-rankX, "degrees of freedom\n")
Residual standard error: 2.316948 on 24 degrees of freedom
<- sum( (y -mean(y))^2 )
SST cat("Multiple R-squared: ", 1-SSE/SST,
"Adjusted R-squared: ", 1 - (SSE/SST)*(n-1)/(n-rankX), "\n")
Multiple R-squared: 0.8486719 Adjusted R-squared: 0.8108399
<- ((SST-SSE)/(rankX-1)) / (SSE/(n-rankX))
Fstat cat("F-statistic: ", Fstat, "on ",
-1, "and", n-rankX, "DF, p-value:", 1-pf(Fstat,rankX-1,n-rankX)) rankX
F-statistic: 22.43263 on 6 and 24 DF, p-value: 9.715305e-09
8.3 Building a Model Matrix
In the previous example, the model matrix \(\textbf{X}\) was formed in code by appending a \((31 \times 6)\) matrix of input variables to a \(31 \times 1\) vector of ones:
<- as.matrix(cbind(Intcpt=rep(1,nrow(fit)),
X which(names(fit)!="Oxygen")])) fit[,
The lm
function used a special syntax to specify the model, called a model formula: Oxygen ~ .
. The formula specifies the target variable (the dependent variable) on the left side of the tilde and the input (predictor, independent) variables on the right hand side of the tilde. The special dot syntax implies to include all variables in the data frame (except for Oxygen
) as input variables for the right hand side of the model. Also, the intercept is automatically included in model formulas, because not having an intercept is a special case in statistical modeling.
You can generate a model matrix easily by using the model.matrix
function with a model formula. In the fitness
example,
<- model.matrix(Oxygen ~ ., data=fit) X_
The two matrices are identical, as you can see with
sum(X - X_)
[1] 0
If we wanted to repeat the regression calculations for a model that contains only Age
and MaxPulse
as predictor variables, it would be easy to construct the model matrix with
<- model.matrix(Oxygen ~ Age + MaxPulse, data=fit) X_small_model
Using model.matrix
is very convenient when you work with factors, classification input variables that are not represented in the model by their actual values (which could be strings) but by converting a variable with \(k\) unique values into \(k\) columns in \(\textbf{X}\). These columns use 0/1 values to encode which level of the classification input variable matches the value for an observation.
8.4 Principal Component Analysis from Scratch
Principal Component Analysis (PCA) is one of the oldest statistical techniques and still of great importance today, not only in statistics. Machine learning methods frequently use PCA to visualize data, to summarize data, to impute missing values by matrix completion, and most important, to reduce the dimension of an estimation problem.
The principal components of a set of data \(X_1,\cdots,X_p\) are linear combinations of the \(p\) inputs. The elements of the \(j\)th component score vector are calculated as follows: \[ z_{ij} = \psi_{1j} x_{i1} + \psi_{2j} x_{i2} + \cdots + \psi_{pj} x_{ip} \quad i=1,\cdots,n; \;\; j=1,\cdots,p \] The coefficients \(\psi_{kj}\) are called the rotations or loadings of the principal components. In constructing the linear combination, the \(X\)s are always centered to have a mean of zero and are usually scaled to have standard deviation one.
The scores \(z_{ij}\) are then collected into a score vector \(\textbf{z}_j = [z_{1j}, \cdots, z_{nj}]\) for the \(j\)th component. Note that for each observation \(i\) there are \(p\) inputs and \(p\) scores. So what have we gained? Instead of the \(p\) vectors \(\textbf{x}_1, \cdots, \textbf{x}_p\) we now have the \(p\) vectors \(\textbf{z}_1, \cdots, \textbf{z}_p\). Because of the way in which the \(\psi_{kj}\) are cosntructed, the PCA scores have very special properties:
They have zero mean (if the data was centered):\(\sum_{i=1}^n z_{ij} = 0\)
They are uncorrelated: \(\text{Corr}[z_j, z_k] = 0, \forall j \ne k\)
\(\text{Var}\left[\sum_{j=1}^p z_j\right] = \sum_{j=1}^p \text{Var}[z_j] = p\) (if data was scaled)
The components are ordered in terms of their variance \(\Rightarrow \text{Var}[z_1] > \text{Var}[z_2] > \cdots > \text{Var}[z_p]\)
Consider the following data set with \(n=10\) and \(p=4\).
Obs | \(X_1\) | \(X_2\) | \(X_3\) | \(X_4\) |
---|---|---|---|---|
1 | 0.344 | 0.364 | 0.806 | 0.160 |
2 | 0.363 | 0.354 | 0.696 | 0.249 |
3 | 0.196 | 0.189 | 0.437 | 0.248 |
4 | 0.200 | 0.212 | 0.590 | 0.160 |
5 | 0.227 | 0.229 | 0.437 | 0.187 |
6 | 0.204 | 0.233 | 0.518 | 0.090 |
7 | 0.197 | 0.209 | 0.499 | 0.169 |
8 | 0.165 | 0.162 | 0.536 | 0.267 |
9 | 0.138 | 0.116 | 0.434 | 0.362 |
10 | 0.151 | 0.151 | 0.483 | 0.223 |
We now perform the principal component analysis with the prcomp
function in R
and then compute the results from scratch using linear algebra.
<- data.frame("x1"=c(0.344,0.363,0.196,0.200,0.227,
df 0.204,0.197,0.165,0.138,0.151),
"x2"=c(0.364,0.354,0.189,0.212,0.229,
0.233,0.209,0.162,0.116,0.151),
"x3"=c(0.806,0.696,0.437,0.590,0.437,
0.518,0.499,0.536,0.434,0.483),
"x4"=c(0.160,0.249,0.248,0.160,0.187,
0.090,0.169,0.267,0.362,0.223)
) df
x1 x2 x3 x4
1 0.344 0.364 0.806 0.160
2 0.363 0.354 0.696 0.249
3 0.196 0.189 0.437 0.248
4 0.200 0.212 0.590 0.160
5 0.227 0.229 0.437 0.187
6 0.204 0.233 0.518 0.090
7 0.197 0.209 0.499 0.169
8 0.165 0.162 0.536 0.267
9 0.138 0.116 0.434 0.362
10 0.151 0.151 0.483 0.223
prcomp
in R
You can compute a principal component analysis in R
with the prcomp
or the princomp
function. We use prcomp
here, as it is based on the singular value decomposition and is numerically more stable. The results will match the calculations based on eigenvalue decomposition below. The scale.=TRUE
option requests that the data are centered (default) and scaled (not the default) to have standard deviation 1. The retx=TRUE
option requests that the matrix of PCA scores is returned.
<- prcomp(df,scale.=TRUE,retx=TRUE)
pca summary(pca)
Importance of components:
PC1 PC2 PC3 PC4
Standard deviation 1.7199 0.9183 0.4450 0.02452
Proportion of Variance 0.7395 0.2108 0.0495 0.00015
Cumulative Proportion 0.7395 0.9504 0.9999 1.00000
prcomp
reports the standard deviations (pca$sdev
) and the variances associated with the four principal components. The sum of the variances is \(p=4\). This allows a decomposition of the total variability by component:
$sdev
pca## [1] 1.71993171 0.91827030 0.44498671 0.02452314
sum(pca$sdev^2)
## [1] 4
$sdev^2)/sum(pca$sdev^2)
(pca## [1] 0.7395412727 0.2108050872 0.0495032941 0.0001503461
You can see the matrix of loadings (the rotations) \(\boldsymbol{\Psi}\) as pca$rotation
.
$rotation pca
PC1 PC2 PC3 PC4
x1 -0.5550597 0.23468359 0.46010018 -0.65202771
x2 -0.5740837 0.05358751 0.33570890 0.74488643
x3 -0.5298211 0.20737384 -0.82062990 -0.05340623
x4 0.2857029 0.94818146 0.04662547 0.13096507
The first principal component, with somewhat equal loading coefficients for \(X_1\), \(X_2\), and \(X_3\), explains 73.9% of the overall variance. The second component, which is dominated by \(X_4\), explains an additional 21% of the variance. Together, the first two components explain 95% of the total variability in the \(X\)s.
And here are the scores for the four principal components.
round(pca$x,5)
PC1 PC2 PC3 PC4
[1,] -3.25197 0.27659 -0.43983 0.02831
[2,] -2.50685 1.26276 0.42476 -0.02385
[3,] 0.99638 0.18799 0.46349 -0.00004
[4,] -0.19065 -0.63335 -0.49528 -0.04232
[5,] 0.25530 -0.45773 0.77902 -0.00367
[6,] -0.32314 -1.60994 0.05447 0.02628
[7,] 0.27979 -0.68519 0.08867 0.01106
[8,] 1.05828 0.48108 -0.48662 0.00707
[9,] 2.38236 1.39061 -0.09924 0.02476
[10,] 1.30050 -0.21282 -0.28944 -0.02760
Eigenvalue Decomposition
The PCA can be computed with a singular value decomposition (SVD) of the matrix \(\textbf{X}\) or with an eigendecomposition of the covariance or correlation matrix of \(\textbf{X}\). The SVD is numerically more stable than the eigendecomposition, but we can demonstrate the computations easily using eigenvalues and eigenvectors.
The eigendecomposition of a square \(n \times n\) matrix \(\textbf{A}\) is \[ \textbf{A}= \textbf{Q}\boldsymbol{\Lambda}\textbf{Q}^{-1} \] where \(\textbf{Q}\) is the \(n \times n\) matrix containing the eigenvectors of \(\textbf{A}\) and \(\boldsymbol{\Lambda}\) is a diagonal matrix with the eigenvalues of \(\textbf{A}\) on the diagonal. If \(\textbf{A}\) is symmetric, then \(\textbf{Q}\) is orthogonal and \(\textbf{Q}^{-1} = \textbf{Q}^\prime\). The eigendecomposition of a real symmetric matrix is thus
\[ \textbf{A}= \textbf{Q}\boldsymbol{\Lambda}\textbf{Q}^\prime \] How can we interpret eigenvectors and eigenvalues? The \(n\times 1\) vector \(\textbf{q}\) is an eigenvector of \(\textbf{A}\), if it satisfies the linear equation
\[ \textbf{A}\textbf{q} = \lambda\textbf{q} \]
for a scalar \(\lambda\). This scalar is called the eigenvalue corresponding to \(\textbf{q}\). There are \(n\) eigenvectors and when they are arranged as the columns of a matrix you get \(\textbf{Q}\).
Example: Eigenanalysis of Correlation Matrix
In R
you can obtain the eigenvectors and eigenvalues of a square matrix with the eigen
function.
There are ten observations of four variables. Let’s take as input for the eigendecomposition the correlation matrix of the data:
<- cor(df)
c c
x1 x2 x3 x4
x1 1.0000000 0.9835190 0.8362382 -0.2772801
x2 0.9835190 1.0000000 0.8545558 -0.4391878
x3 0.8362382 0.8545558 1.0000000 -0.2895615
x4 -0.2772801 -0.4391878 -0.2895615 1.0000000
<- eigen(c)
e
$values
e## [1] 2.9581650906 0.8432203486 0.1980131763 0.0006013844
sum(e$values)
## [1] 4
sqrt(e$values)
## [1] 1.71993171 0.91827030 0.44498671 0.02452314
$vectors
e## [,1] [,2] [,3] [,4]
## [1,] -0.5550597 0.23468359 -0.46010018 0.65202771
## [2,] -0.5740837 0.05358751 -0.33570890 -0.74488643
## [3,] -0.5298211 0.20737384 0.82062990 0.05340623
## [4,] 0.2857029 0.94818146 -0.04662547 -0.13096507
The eigen
function returns the diagonal elements of \(\boldsymbol{\Lambda}\) in e$values
and the matrix \(\textbf{Q}\) of eigenvectors in e$vectors
. The input matrix of the decomposition can be reconstructed from those:
<- e$vectors %*% diag(e$values) %*% t(e$vectors)
A round((c - A),4)
x1 x2 x3 x4
x1 0 0 0 0
x2 0 0 0 0
x3 0 0 0 0
x4 0 0 0 0
What happens if we multiply the \(10 \times 4\) data matrix with the \(4\times 4\) matrix of eigenvectors of the correlation matrix? First, we center and scale the data. The result is a \(10 \times 4\) matrix of linear transformations. These are the scores of the PCA.
<- scale(df) %*% e$vectors
scores scores
[,1] [,2] [,3] [,4]
[1,] -3.2519650 0.2765895 0.43983079 -2.831053e-02
[2,] -2.5068508 1.2627631 -0.42476284 2.385301e-02
[3,] 0.9963834 0.1879868 -0.46348719 3.885523e-05
[4,] -0.1906528 -0.6333530 0.49527588 4.231872e-02
[5,] 0.2552987 -0.4577253 -0.77901629 3.672417e-03
[6,] -0.3231438 -1.6099388 -0.05447168 -2.628130e-02
[7,] 0.2797933 -0.6851941 -0.08867423 -1.106065e-02
[8,] 1.0582766 0.4810756 0.48662389 -7.069551e-03
[9,] 2.3823630 1.3906130 0.09923769 -2.475865e-02
[10,] 1.3004973 -0.2128168 0.28944399 2.759768e-02
The scores have mean 0, their variances equal the eigenvalues of the correlation matrix, their variances sum to \(p=4\), and they are uncorrelated.
apply(scores,2,mean)
[1] -4.218847e-16 -2.109424e-16 1.498801e-16 -6.696033e-17
apply(scores,2,var)
[1] 2.9581650906 0.8432203486 0.1980131763 0.0006013844
sum(apply(scores,2,var))
[1] 4
round(cor(scores),4)
[,1] [,2] [,3] [,4]
[1,] 1 0 0 0
[2,] 0 1 0 0
[3,] 0 0 1 0
[4,] 0 0 0 1