24  Principal Component Analysis (PCA)

24.1 Introduction

Principal component analysis (PCA) is one of the most important methods of unsupervised learning with many applications:

  • Summarization: summarize a \(p\)-dimensional matrix \(\textbf{X}_{(n\times p)}\) through a lower-dimensional matrix \(\textbf{Z}_{(n\times m)}\). \(m\) is usually much smaller than \(p\), and in many cases \(m=2\), especially if the results are to be presented graphically.

  • Visualization: Instead of all \({p\choose 2}\) two-way scatterplots between \(p\) variables, display a single scatterplot of the first two principal components.

  • Dimension reduction: Use the first \(m\) principal components as inputs to a regression model (see PCR, Section 8.3.2). Apply a clustering algorithm such as \(k\)-means clustering to the first \(m\) principal components.

  • Imputation: Use principal components iteratively to complete a numerical matrix that contains missing values.

Given the many applications it is no surprise that PCA is a key method in the toolbox of data scientists and machine learners. It goes back to 1901, when it was invented by Karl Pearson.

A principal component is a linear combination of the \(p\) inputs. The elements of the \(j\)th component score vector are calculated as follows: \[ z_{ij} = \psi_{1j} x_{i1} + \psi_{2j} x_{i2} + \cdots + \psi_{pj} x_{ip} \quad i=1,\cdots,n; j=1,\cdots,p \] The coefficients \(\psi_{kj}\) are called the rotations or loadings of the principal components. In constructing the linear combination, the \(x\)s are always centered to have a mean of zero and are usually scaled to have standard deviation one. More on whether to scale the data for PCA follows below.

The scores \(z_{ij}\) are then collected into a score vector \(\textbf{z}_j = [z_{1j}, \cdots, z_{nj}]\) for the \(j\)th component. Note that for each observation \(i\) there are \(p\) inputs and \(p\) scores. So what have we gained? Instead of the \(p\) vectors \(\textbf{x}_1, \cdots, \textbf{x}_p\) we now have the \(p\) vectors \(\textbf{z}_1, \cdots, \textbf{z}_p\). Because of the way in which the \(\psi_{kj}\) are found, the PCA scores have very special properties:

  • They have zero mean (if the data was centered):\(\sum_{i=1}^n z_{ij} = 0\)

  • They are uncorrelated: \(\text{Corr}[z_j, z_k] = 0, \forall j \ne k\)

  • \(\text{Var}\left[\sum_{j=1}^p z_j\right] = \sum_{j=1}^p \text{Var}[z_j] = p\) (if data was scaled)

  • The components are ordered in terms of their variance \(\Rightarrow \text{Var}[z_1] > \text{Var}[z_2] > \cdots > \text{Var}[z_p]\)

This is useful in many ways:

  1. The components provide a decomposition of the variability in the data. The first component explains the greatest proportion of the variability, the second component explains the next largest proportion of variability and so forth.

  2. The contributions of the components are independent, each component is orthogonal to the other components. For example the second component is a linear combination of the \(X\)s that models a relationship not captured by the first (or any other) component.

  3. The loadings for each component inform about the particular linear relationship. The magnitude of the loading coefficients reflects which inputs are driving the linear relationship. For example, when all coefficients are of similar magnitude, the component expresses some overall, average trend. When a few coefficients stand out it suggests that the associated attributes explain most of the variability captured by the component.

  4. To summarize high-dimensional data, we can ignore those components that explain a small amount of variability and turn a high-dimensional problem into a lower-dimensional one. Often, only the first two principal components are used to visualize the data and for further analysis.

A principle refers to a fundamental law or rule, for example, “the principles of Christianity” or “the principle of judicial review”. You might agree with someone in principle, but disagree with their approach. A principal, on the other hand, refers to something head, chief, main, or leading. The principal of the school is the head of the school. They are leading the school. The principal reason for doing something is the main reason for doing it.

PCA is principal component analysis because the leading components are the main accountants of variability.

An alternative interpretation of the principal components is to find \(p\) vectors \(\textbf{z}_1,\cdots,\textbf{z}_p\) as linear combinations of the \(\textbf{x}_1,\cdots,\textbf{x}_p\) such that

  • \(\textbf{z}_1\) explains the most variability of the \(X\)s

  • \(\textbf{z}_2\) explains the second-most variability of the \(X\)s

  • … and so forth

  • and the \(\textbf{z}\) are orthogonal: \(\textbf{z}_j^\prime\textbf{z}_k = 0, \forall j,k\).

  • \(\textbf{z}_1\) is a projection of the \(X\)s in the direction that explains the most variability in the data.

  • \(\textbf{z}_2\) is a projection of the \(X\)s in the direction that explains the second-most variability in the data.

  • the \(\textbf{z}_j\) represent jointly perpendicular directions through the space of the original variables.

Viewing the components as projections into perpendicular directions hints at one technique for calculating the rotation matrix \(\boldsymbol{\Psi} = [\psi_{kj}]\) rotation and the variance decomposition: through a matrix factorization into eigenvalues and eigenvectors.

24.2 Eigenvalue Decomposition

The PCA can be computed with a singular value decomposition (SVD) of the matrix \(\textbf{X}\) or with an eigendecomposition of the covariance or correlation matrix of \(\textbf{X}\). The SVD is numerically more stable than the eigendecomposition, but we can demonstrate the computations easily using eigenvalues and eigenvectors.

The eigendecomposition of a square \(n \times n\) matrix \(\textbf{A}\) is \[ \textbf{A}= \textbf{Q}\boldsymbol{\Lambda}\textbf{Q}^{-1} \] where \(\textbf{Q}\) is the \(n \times n\) matrix containing the eigenvectors of \(\textbf{A}\) and \(\boldsymbol{\Lambda}\) is a diagonal matrix with the eigenvalues of \(\textbf{A}\) on the diagonal. If \(\textbf{A}\) is symmetric, then \(\textbf{Q}\) is orthogonal and \(\textbf{Q}^{-1} = \textbf{Q}^\prime\). The eigendecomposition of a real symmetric matrix is thus

\[ \textbf{A}= \textbf{Q}\boldsymbol{\Lambda}\textbf{Q}^\prime \] How can we interpret eigenvectors and eigenvalues? The \(n\times 1\) vector \(\textbf{q}\) is an eigenvector of \(\textbf{A}\), if it satisfies the linear equation \[ \textbf{A}\textbf{q} = \lambda\textbf{q} \] for a scalar \(\lambda\). This scalar is called the eigenvalue corresponding to \(\textbf{q}\). There are \(n\) eigenvectors and when they are arranged as the columns of a matrix you get \(\textbf{Q}\).

Example: Eigenanalysis of Correlation Matrix

In R you can obtain the eigenvectors and eigenvalues of a square matrix with the eigen function. Principal components can be computed with the prcomp or the princomp functions, both included in the base stats package. prcomp uses the singular-value decomposition of \(\textbf{X}\), princomp relies on the eigendecomposition of the covariance matrix.

The following data were used to illustrate PCA scores in Section 8.3.2.

data <- data.frame("x1"=c(0.344167 ,0.363478,0.196364,0.2     ,0.226957,
                          0.204348 ,0.196522,0.165   ,0.138333,0.150833),
                   "x2"=c(0.363625 ,0.353739,0.189405,0.212122,0.22927 ,
                          0.233209 ,0.208839,0.162254,0.116175,0.150888),
                   "x3"=c(0.805833 ,0.696087,0.437273,0.590435,0.436957,
                          0.518261 ,0.498696,0.535833,0.434167,0.482917),
                   "x4"=c(0.160446 ,0.248539,0.248309,0.160296,0.1869  ,
                          0.0895652,0.168726,0.266804,0.36195 ,0.223267)
                   )
data
         x1       x2       x3        x4
1  0.344167 0.363625 0.805833 0.1604460
2  0.363478 0.353739 0.696087 0.2485390
3  0.196364 0.189405 0.437273 0.2483090
4  0.200000 0.212122 0.590435 0.1602960
5  0.226957 0.229270 0.436957 0.1869000
6  0.204348 0.233209 0.518261 0.0895652
7  0.196522 0.208839 0.498696 0.1687260
8  0.165000 0.162254 0.535833 0.2668040
9  0.138333 0.116175 0.434167 0.3619500
10 0.150833 0.150888 0.482917 0.2232670

There are ten observations of four variables. Let’s take as input for the eigendecomposition the correlation matrix of the data:

c <- cor(data)
c
           x1         x2         x3         x4
x1  1.0000000  0.9832580  0.8359297 -0.2761887
x2  0.9832580  1.0000000  0.8539085 -0.4397088
x3  0.8359297  0.8539085  1.0000000 -0.2890121
x4 -0.2761887 -0.4397088 -0.2890121  1.0000000
e <- eigen(c)
e$values 
[1] 2.9570820972 0.8438197077 0.1985106437 0.0005875513
e$vectors
           [,1]       [,2]        [,3]        [,4]
[1,] -0.5550620 0.23539598  0.45961271  0.65211273
[2,] -0.5741828 0.05247306  0.33623520 -0.74465196
[3,] -0.5297817 0.20752636 -0.82067111  0.05256496
[4,]  0.2855723 0.94803382  0.04691458 -0.13220959

The eigen function returns the diagonal elements of \(\boldsymbol{\Lambda}\) in e$values and the matrix \(\textbf{Q}\) of eigenvectors in e$vectors. The input matrix of the decomposition can be reconstructed from those:

A <- e$vectors %*% diag(e$values) %*% t(e$vectors)
round((c - A),4)
   x1 x2 x3 x4
x1  0  0  0  0
x2  0  0  0  0
x3  0  0  0  0
x4  0  0  0  0

What happens if we multiply the \(10 \times 4\) data matrix with the \(4\times 4\) matrix of eigenvectors of the correlation matrix? First, we center and scale the data. The result is a \(10 \times 4\) matrix of linear transformations. These are the scores of the PCA.

data_c <- scale(data,center=TRUE,scale=TRUE)
scores <- data_c %*% e$vectors
scores
            [,1]       [,2]        [,3]          [,4]
 [1,] -3.2478623  0.2815532 -0.43998889 -0.0292581389
 [2,] -2.5101127  1.2574623  0.42484650  0.0254701453
 [3,]  0.9925370  0.1935176  0.46544557 -0.0005376776
 [4,] -0.1908526 -0.6286757 -0.49817675  0.0413147515
 [5,]  0.2550067 -0.4593212  0.77973866  0.0014573403
 [6,] -0.3285776 -1.6137133  0.05490332 -0.0226596733
 [7,]  0.2862026 -0.6907719  0.08654684 -0.0123166161
 [8,]  1.0581238  0.4785600 -0.48496730 -0.0088830663
 [9,]  2.3797940  1.3913628 -0.09775326 -0.0229367993
[10,]  1.3057412 -0.2099739 -0.29059469  0.0283497342

The scores have mean 0, their variances equal the eigenvalues of the correlation matrix, their variances sum to \(p=4\), and they are uncorrelated.

apply(scores,2,mean)
[1]  8.881784e-17 -4.690692e-16 -3.330669e-17 -1.134509e-16
apply(scores,2,var)
[1] 2.9570820972 0.8438197077 0.1985106437 0.0005875513
sum(apply(scores,2,var))
[1] 4
round(cor(scores),4)
     [,1] [,2] [,3] [,4]
[1,]    1    0    0    0
[2,]    0    1    0    0
[3,]    0    0    1    0
[4,]    0    0    0    1

24.3 Computing the PCA in R

We prefer the prcomp function to compute the PCA over the princomp function because it is based on the singular value decomposition, which is numerically more stable than the eigendecomposition. (Singular values are the square roots of the eigenvalues and since eigenvalues can be very close to zero working with square roots stabilizes the calculations.)

By default, prcomp centers the data but does not scale it. Add scale.=TRUE to scale the data as well.

pca <- prcomp(data,retx=TRUE,scale.=TRUE)
pca
Standard deviations (1, .., p=4):
[1] 1.71961685 0.91859660 0.44554533 0.02423946

Rotation (n x k) = (4 x 4):
          PC1        PC2         PC3         PC4
x1 -0.5550620 0.23539598  0.45961271 -0.65211273
x2 -0.5741828 0.05247306  0.33623520  0.74465196
x3 -0.5297817 0.20752636 -0.82067111 -0.05256496
x4  0.2855723 0.94803382  0.04691458  0.13220959

prcomp reports the standard deviations (pca$dev) and rotation matrix \(\boldsymbol{\Psi}\) (pca$rotation). You will recognize the rotation as the matrix of eigenvectors from the previous analysis and the standard deviations as the square roots of the variances of the scores computed earlier. When retx=TRUE is specified, the function returns the PCA scores in pca$x.

The sum of the variances of the principal components is \(p=4\) and it can be used to break down the variance into proportions explained by the components.

summary(pca)
Importance of components:
                          PC1    PC2     PC3     PC4
Standard deviation     1.7196 0.9186 0.44555 0.02424
Proportion of Variance 0.7393 0.2109 0.04963 0.00015
Cumulative Proportion  0.7393 0.9502 0.99985 1.00000

The first principal component, with somewhat equal loading coefficients for \(X_1\), \(X_2\), and \(X_3\), explains 73.927% of the overall variance. The second component, which is dominated by \(X_4\), explains an additional 21.095% of the variance.

24.4 Predicting in a PCA

You can predict a new observation using principal components by calculating the component scores for the new obs. For example, suppose the input values for the new data point are the square roots of the values for the first observation:

xx <- sqrt(data[1,]) # the "new" observation
predict(pca, newdata=xx)
        PC1      PC2      PC3       PC4
1 -6.202193 4.362812 1.553425 0.5424985

The predicted value is a vector, with one element for each of the components. To reconstruct the predicted value, we can apply the rotation matrix to the new observation. Before doing so, the values need to be centered and scaled to match the derivation of the rotations. It is thus important to apply the same centering and scaling values that were used on the training data:

xx_centered_and_scaled <- (xx - pca$center)/pca$scale
as.matrix(xx_centered_and_scaled) %*% pca$rotation
        PC1      PC2      PC3       PC4
1 -6.202193 4.362812 1.553425 0.5424985

24.5 Scree Plot

How should we choose the number of principal components to use in subsequent analyses? If the goal is summarization or visualization \(m=2\) is common. If the components are input to a regression model, cross-validation can be used in PCR (Section 8.3.2). In other situations you have to choose \(m\) such that a “fair” amount of variation is explained. To help with that, a scree plot is commonly used. Scree describes the small rock material that accumulates at the base of a mountain or steep rock formation (Figure 24.1). The scree forms a sharp transition from the rocky cliff to the area of the rock fragments.

Figure 24.1: Scree at the base of a mountain. Source: Wikipedia

In principal component analysis the scree plot displays the proprtion of variance explained against the number of components.

Example: Hitters Data (ISLR2)

For the Hitters baseball data set from James et al. (2021), the PCA scree plot is obtained as follows:

library(ISLR2)

pca_hit <- prcomp(Hitters[,1:13],.scale=TRUE) 

plot(summary(pca_hit)$importance[2,],
     type="b",
     ylab="Proportion of Variance Explained",
     xlab="Principal Component",
     las=1)

Scree plot for first thirteen attributes of Hitters data.

The sharp “knee” (“elbow”) at \(m=2\) shows that most of the variability is captured by the first principal component.

24.6 Biplot

The biplot of a PCA displays the scores and loadings for two principal components in the same visualization. Typically, it is produced using the first two components as these account for the most variability. The scores \(z_{ij}\) are shown in the biplot as a scatter, the rotations are displayed as arrows. Points that are close to each other on the biplot have similar values of the inputs. The length of an arrow depicts the contribution of the variable, long arrows correspond to large coefficients in the rotation matrix. The angle of the arrows shows the correlation with other variables:

  • uncorrelated variables have perpendicular arrows
  • highly correlated variables have arrows that point in the same direction

Example: PCA for Iris Data

For the Iris data, the first two principal component account for 95% of the variability in the four flower measurements.

pca_iris <- prcomp(iris[,1:4],retx=TRUE,scale.=TRUE)
pca_iris
Standard deviations (1, .., p=4):
[1] 1.7083611 0.9560494 0.3830886 0.1439265

Rotation (n x k) = (4 x 4):
                    PC1         PC2        PC3        PC4
Sepal.Length  0.5210659 -0.37741762  0.7195664  0.2612863
Sepal.Width  -0.2693474 -0.92329566 -0.2443818 -0.1235096
Petal.Length  0.5804131 -0.02449161 -0.1421264 -0.8014492
Petal.Width   0.5648565 -0.06694199 -0.6342727  0.5235971
summary(pca_iris)
Importance of components:
                          PC1    PC2     PC3     PC4
Standard deviation     1.7084 0.9560 0.38309 0.14393
Proportion of Variance 0.7296 0.2285 0.03669 0.00518
Cumulative Proportion  0.7296 0.9581 0.99482 1.00000

The raw data for the biplot (Figure 24.2) of the first two components are the first two columns of the rotation matrix and the first two columns of the score matrix:

pca_iris$rotation[,1:2]
                    PC1         PC2
Sepal.Length  0.5210659 -0.37741762
Sepal.Width  -0.2693474 -0.92329566
Petal.Length  0.5804131 -0.02449161
Petal.Width   0.5648565 -0.06694199
pca_iris$x[1:20,1:2]
            PC1         PC2
 [1,] -2.257141 -0.47842383
 [2,] -2.074013  0.67188269
 [3,] -2.356335  0.34076642
 [4,] -2.291707  0.59539986
 [5,] -2.381863 -0.64467566
 [6,] -2.068701 -1.48420530
 [7,] -2.435868 -0.04748512
 [8,] -2.225392 -0.22240300
 [9,] -2.326845  1.11160370
[10,] -2.177035  0.46744757
[11,] -2.159077 -1.04020587
[12,] -2.318364 -0.13263400
[13,] -2.211044  0.72624318
[14,] -2.624309  0.95829635
[15,] -2.191399 -1.85384655
[16,] -2.254661 -2.67731523
[17,] -2.200217 -1.47865573
[18,] -2.183036 -0.48720613
[19,] -1.892233 -1.40032757
[20,] -2.335545 -1.12408360
biplot(pca_iris,cex=0.75)
Figure 24.2: Biplot for first two principal components of Iris data.

There are two systems of axes on the biplot. The axes on the bottom and on the left are for the scores of the principal components. The axes on the right and on the top are for the variable arrows (the rotations). Note that the Sepal.Width rotation for the first two components is [-0.2693474, -0.9232957] but the arrow does not point at that coordinate. The biplot function applies a scaling to the scores and the arrows, which you can suppress by adding scale=0 to the function call.

We see from the biplot that the scores cluster in distinct groups. The arrow for Sepal.Width is pointing away from the arrows of the other variables, which suggests that it is not highly correlated with the other flower measurements. The length and width of the petals are very highly correlated, however; their arrows point into nearly the same direction. This can be verified by computing the correlation matrix of the measurements:

cor(iris[,1:4])
             Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000

Using the autoplot function in the ggfortify library allows you to color the scores in the biplot by another variable. We see that the distinct group of scores low in PC1 correspond to the Iris setosa species.

library(ggfortify)
autoplot(pca_iris, data=iris, color="Species", 
        loadings=TRUE, loadings.label=TRUE)

I. setosa has large sepal widths and the negative rotation coefficient for Sepal.Width in PC1 separates the species in that dimension from the other species.

library(dplyr)
iris %>% group_by(Species) %>% 
         summarize(min=min(Sepal.Width), 
                   max=max(Sepal.Width),
                   mean=mean(Sepal.Width))
# A tibble: 3 × 4
  Species      min   max  mean
  <fct>      <dbl> <dbl> <dbl>
1 setosa       2.3   4.4  3.43
2 versicolor   2     3.4  2.77
3 virginica    2.2   3.8  2.97

Looking at the sign of the rotation coefficients is helpful to determine which input variables increase or decrease a component compared to other inputs. However, you cannot interpret the sign as increasing or decreasing a target variable. The rotation matrix of a principal component analysis is unique up to sign. Two software packages might produce the same rotation matrices, but with different signs of the coefficients. The princomp function in R has a fix_sign= option that controls the sign of the first loading coefficient to be positive.

24.7 To Scale or Not To Scale

The variables should always be centered to have mean 0, so that the loadings have the same origin. Should the variables also be scaled to have the same variance?

In general, scaling is recommended whenever the results of an analysis depends on units of measurements. For example, when calculating distances it matters whether time is measured in days, hours, minutes, or seconds. The choice of units should not affect the outcome of the analysis. Similarly, large things tend to have greater variability than small things. Since the principal components are arranged in the order of variability explained, variables that have very large variances (because of their units of measurement or size) will dominate the components unless scaling is applied.

Example: U.S. Arrests

The USArrests data set contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 U.S. states in 1973. Also given is the percent of the population living in urban areas. The data set comes with R.

head(USArrests)
           Murder Assault UrbanPop Rape
Alabama      13.2     236       58 21.2
Alaska       10.0     263       48 44.5
Arizona       8.1     294       80 31.0
Arkansas      8.8     190       50 19.5
California    9.0     276       91 40.6
Colorado      7.9     204       78 38.7
apply(USArrests,2,sd)
   Murder   Assault  UrbanPop      Rape 
 4.355510 83.337661 14.474763  9.366385 

The standard deviations of the four variables is quite different, that of Assault is 20-times larger than that of the Murder variable.

We now perform PCA without and with scaling of the data.

pca_unscaled <- prcomp(USArrests,retx=TRUE,scale.=FALSE)
pca_unscaled
Standard deviations (1, .., p=4):
[1] 83.732400 14.212402  6.489426  2.482790

Rotation (n x k) = (4 x 4):
                PC1         PC2         PC3         PC4
Murder   0.04170432 -0.04482166  0.07989066 -0.99492173
Assault  0.99522128 -0.05876003 -0.06756974  0.03893830
UrbanPop 0.04633575  0.97685748 -0.20054629 -0.05816914
Rape     0.07515550  0.20071807  0.97408059  0.07232502
summary(pca_unscaled)
Importance of components:
                           PC1      PC2    PC3     PC4
Standard deviation     83.7324 14.21240 6.4894 2.48279
Proportion of Variance  0.9655  0.02782 0.0058 0.00085
Cumulative Proportion   0.9655  0.99335 0.9991 1.00000

The first principal component in the unscaled analysis explains 96.55% of the variability. The component is dominated by the Assault variable, with a loading coefficient of 0.9952.

pca_scaled <- prcomp(USArrests,retx=TRUE,scale.=TRUE)
pca_scaled
Standard deviations (1, .., p=4):
[1] 1.5748783 0.9948694 0.5971291 0.4164494

Rotation (n x k) = (4 x 4):
                PC1        PC2        PC3         PC4
Murder   -0.5358995 -0.4181809  0.3412327  0.64922780
Assault  -0.5831836 -0.1879856  0.2681484 -0.74340748
UrbanPop -0.2781909  0.8728062  0.3780158  0.13387773
Rape     -0.5434321  0.1673186 -0.8177779  0.08902432
summary(pca_scaled)
Importance of components:
                          PC1    PC2     PC3     PC4
Standard deviation     1.5749 0.9949 0.59713 0.41645
Proportion of Variance 0.6201 0.2474 0.08914 0.04336
Cumulative Proportion  0.6201 0.8675 0.95664 1.00000

In the scaled analysis the variables are on equal footing with respect to their dispersion. The loading coefficients are more evenly sized in the first principal component which explains 62.01% of variability in the scaled analysis. The Assault variable is not allowed to dominate the component analysis because of its large variance compared to the other variables.

The biplots for the scaled and unscaled analysis are markedly different. The dominance of Assault in the unscaled analysis makes it difficult (impossible) to meaningfully interpret the biplot (Figure 24.3).

par(mfrow=c(1,2))
par(mar = c(1, 2, 1, 2))
biplot(pca_unscaled,cex=0.6)
biplot(pca_scaled,cex=0.6)
Figure 24.3: Biplots in unscaled (left) and scaled (right) analysis.

In the next section we encounter a situation where we center but do not scale the \(\textbf{X}\) matrix, when values of the inputs are reconstructed.

24.8 Imputation through Matrix Completion

In the introduction to this chapter we expressed the PCA scores as a linear function of the inputs: \[ z_{ij} = \psi_{1j} x_{i1} + \psi_{2j} x_{i2} + \cdots + \psi_{pj} x_{ip} = \sum_{k=1}^p \psi_{kj}x_{ik} \]

This relationship can be reversed, expressing the inputs as a linear combination of the PCA scores and the rotations: \[ x_{ij} = \sum_{k=1}^p z_{ik}\psi_{kj} \]

To demonstrate, we can reconstruct the values for the first observation for the Iris data set from its PCA results. First, we only center the data and do not scale:

pca_iris_ns <- prcomp(iris[,1:4],retx=TRUE,scale.=FALSE)
pca_iris_ns$rotation
##                      PC1         PC2         PC3        PC4
## Sepal.Length  0.36138659 -0.65658877  0.58202985  0.3154872
## Sepal.Width  -0.08452251 -0.73016143 -0.59791083 -0.3197231
## Petal.Length  0.85667061  0.17337266 -0.07623608 -0.4798390
## Petal.Width   0.35828920  0.07548102 -0.54583143  0.7536574

sum(pca_iris_ns$x[1,] * pca_iris_ns$rotation[1,]) + pca_iris_ns$center[1]
## Sepal.Length 
##          5.1
sum(pca_iris_ns$x[1,] * pca_iris_ns$rotation[2,]) + pca_iris_ns$center[2]
## Sepal.Width 
##         3.5
sum(pca_iris_ns$x[1,] * pca_iris_ns$rotation[3,]) + pca_iris_ns$center[3]
## Petal.Length 
##          1.4
sum(pca_iris_ns$x[1,] * pca_iris_ns$rotation[4,]) + pca_iris_ns$center[4]
## Petal.Width 
##         0.2

If the PCA was centered and scaled, the reconstruction of the variables needs to take the scaling factor into account:

sum(pca_iris$x[1,] * pca_iris$rotation[1,])*pca_iris$scale[1] + pca_iris$center[1]
## Sepal.Length 
##          5.1
sum(pca_iris$x[1,] * pca_iris$rotation[2,])*pca_iris$scale[2] + pca_iris$center[2]
## Sepal.Width 
##         3.5
sum(pca_iris$x[1,] * pca_iris$rotation[3,])*pca_iris$scale[3] + pca_iris$center[3]
## Petal.Length 
##          1.4
sum(pca_iris$x[1,] * pca_iris$rotation[4,])*pca_iris$scale[4] + pca_iris$center[4]
## Petal.Width 
##         0.2

These calculations match the first observation:

iris[1,]
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa

If only a subset of the principal components is used, we obtain an approximation of the input value. If \(M < p\), then

\[ x^*_{ij} = \sum_{k=1}^M z_{ik}\psi_{kj} \approx x_{ij} \] In fact, \(x^*_{ij}\) is the best \(M\)-dimensional approximation to \(x_{ij}\) in terms of having the smallest Euclidean distance.

The sepal width of the first observation can be approximated based on the first two principal components as

sum(pca_iris_ns$x[1,] * pca_iris_ns$rotation[2,1:2]) + pca_iris_ns$center[2]
Sepal.Width 
   3.513403 

If principal component analysis can be used to approximate the observed input values, then it can be used to predict unobserved values. This is the idea behind using PCA to impute missing values, a technique known as matrix completion. The advantage of using PCA to fill in missing values in matrices is to draw on the correlations between the variables, as compared to imputing missing values by column means, which ignores information from other variables.

It is important to point out, however, that this form of imputation is only appropriate if the missingness process is missing completely at random (MCAR): the missingness is unrelated to any study variable, including the target variable. For example, if a measurement cannot be taken because an instrument randomly stopped functioning, the missing value occurred completely at random. If the measurement cannot be taken because the instrument cannot record objects of that size, then the process is not MCAR.

An iterative algorithm for matrix completion is described in $12.3 of James et al. (2021) and implemented in the eimpute library in R. The algorithm starts with imputing missing values with the means of the observed data and iterates the following steps:

  1. Perform a low-rank PCA, that is, a PCA with \(M < p\).
  2. Replace the unobserved values with approximations based on the PCA scores and rotation.
  3. Compute the mean squared error between observed \(x_{ij}\) and predicted \(x^*_{ij}\). As long as the mean squared error continues to decrease, return to step 1.

Figure 24.4 displays observed and imputed values for the Iris data set when 20% of the observations are randomly set to missing (by assigning NA). Two methods of imputation are used: using the column means, and a low-rank PCA approximation with 2 principal components using eimpute(x=iris_miss,r=2).

There are only four possible imputed values when using the column means, all missing values for a particular variable are replaced with the same column mean. The PCA imputation produces much more variety in the imputed values since it draws on the correlations between the variables.

Because missing values were introduced artificially, we have access to the true values and can compute the correlation between original and imputed values. This correlation is 0.83 for the imputation based on column means and 0.93 for the imputation based on the PCA.

Figure 24.4: Comparison of original and imputed values for Iris data with 20% missingness.