24 Principal Component Analysis (PCA)

24.1 Introduction

Principal component analysis (PCA) is one of the most important methods of unsupervised learning with many applications:

Summarization: summarize a $p$-dimensional matrix $\textbf{X}_{(n\times p)}$ through a lower-dimensional matrix $\textbf{Z}_{(n\times m)}$. $m$ is usually much smaller than $p$, and in many cases $m=2$, especially if the results are to be presented graphically.
Visualization: Instead of all ${p\choose 2}$ two-way scatterplots between $p$ variables, display a single scatterplot of the first two principal components.
Dimension reduction: Use the first $m$ principal components as inputs to a regression model (see PCR, Section 8.3.2). Apply a clustering algorithm such as $k$-means clustering to the first $m$ principal components.
Imputation: Use principal components iteratively to complete a numerical matrix that contains missing values.

Given the many applications it is no surprise that PCA is a key method in the toolbox of data scientists and machine learners. It goes back to 1901, when it was invented by Karl Pearson.

A principal component is a linear combination of the $p$ inputs. The elements of the $j$^th component score vector are calculated as follows: \[ z_{ij} = \psi_{1j} x_{i1} + \psi_{2j} x_{i2} + \cdots + \psi_{pj} x_{ip} \quad i=1,\cdots,n; j=1,\cdots,p \] The coefficients $\psi_{kj}$ are called the rotations or loadings of the principal components. In constructing the linear combination, the $x$s are always centered to have a mean of zero and are usually scaled to have standard deviation one. More on whether to scale the data for PCA follows below.

The scores $z_{ij}$ are then collected into a score vector $\textbf{z}_j = [z_{1j}, \cdots, z_{nj}]$ for the $j$^th component. Note that for each observation $i$ there are $p$ inputs and $p$ scores. So what have we gained? Instead of the $p$ vectors $\textbf{x}_1, \cdots, \textbf{x}_p$ we now have the $p$ vectors $\textbf{z}_1, \cdots, \textbf{z}_p$. Because of the way in which the $\psi_{kj}$ are found, the PCA scores have very special properties:

They have zero mean (if the data was centered):$\sum_{i=1}^n z_{ij} = 0$
They are uncorrelated: $\text{Corr}[z_j, z_k] = 0, \forall j \ne k$
$\text{Var}\left[\sum_{j=1}^p z_j\right] = \sum_{j=1}^p \text{Var}[z_j] = p$ (if data was scaled)
The components are ordered in terms of their variance $\Rightarrow \text{Var}[z_1] > \text{Var}[z_2] > \cdots > \text{Var}[z_p]$

This is useful in many ways:

The components provide a decomposition of the variability in the data. The first component explains the greatest proportion of the variability, the second component explains the next largest proportion of variability and so forth.
The contributions of the components are independent, each component is orthogonal to the other components. For example the second component is a linear combination of the $X$s that models a relationship not captured by the first (or any other) component.
The loadings for each component inform about the particular linear relationship. The magnitude of the loading coefficients reflects which inputs are driving the linear relationship. For example, when all coefficients are of similar magnitude, the component expresses some overall, average trend. When a few coefficients stand out it suggests that the associated attributes explain most of the variability captured by the component.
To summarize high-dimensional data, we can ignore those components that explain a small amount of variability and turn a high-dimensional problem into a lower-dimensional one. Often, only the first two principal components are used to visualize the data and for further analysis.

Principal versus Principle.

A principle refers to a fundamental law or rule, for example, “the principles of Christianity” or “the principle of judicial review”. You might agree with someone in principle, but disagree with their approach. A principal, on the other hand, refers to something head, chief, main, or leading. The principal of the school is the head of the school. They are leading the school. The principal reason for doing something is the main reason for doing it.

PCA is principal component analysis because the leading components are the main accountants of variability.

An alternative interpretation of the principal components is to find $p$ vectors $\textbf{z}_1,\cdots,\textbf{z}_p$ as linear combinations of the $\textbf{x}_1,\cdots,\textbf{x}_p$ such that

$\textbf{z}_1$ explains the most variability of the $X$s
$\textbf{z}_2$ explains the second-most variability of the $X$s
… and so forth
and the $\textbf{z}$ are orthogonal: $\textbf{z}_j^\prime\textbf{z}_k = 0, \forall j,k$.
$\textbf{z}_1$ is a projection of the $X$s in the direction that explains the most variability in the data.
$\textbf{z}_2$ is a projection of the $X$s in the direction that explains the second-most variability in the data.
the $\textbf{z}_j$ represent jointly perpendicular directions through the space of the original variables.

Viewing the components as projections into perpendicular directions hints at one technique for calculating the rotation matrix $\boldsymbol{\Psi} = [\psi_{kj}]$ rotation and the variance decomposition: through a matrix factorization into eigenvalues and eigenvectors.

24.2 Eigenvalue Decomposition

The PCA can be computed with a singular value decomposition (SVD) of the matrix $\textbf{X}$ or with an eigendecomposition of the covariance or correlation matrix of $\textbf{X}$. The SVD is numerically more stable than the eigendecomposition, but we can demonstrate the computations easily using eigenvalues and eigenvectors.

The eigendecomposition of a square $n \times n$ matrix $\textbf{A}$ is \[ \textbf{A}= \textbf{Q}\boldsymbol{\Lambda}\textbf{Q}^{-1} \] where $\textbf{Q}$ is the $n \times n$ matrix containing the eigenvectors of $\textbf{A}$ and $\boldsymbol{\Lambda}$ is a diagonal matrix with the eigenvalues of $\textbf{A}$ on the diagonal. If $\textbf{A}$ is symmetric, then $\textbf{Q}$ is orthogonal and $\textbf{Q}^{-1} = \textbf{Q}^\prime$. The eigendecomposition of a real symmetric matrix is thus

\[ \textbf{A}= \textbf{Q}\boldsymbol{\Lambda}\textbf{Q}^\prime \] How can we interpret eigenvectors and eigenvalues? The $n\times 1$ vector $\textbf{q}$ is an eigenvector of $\textbf{A}$, if it satisfies the linear equation \[ \textbf{A}\textbf{q} = \lambda\textbf{q} \] for a scalar $\lambda$. This scalar is called the eigenvalue corresponding to $\textbf{q}$. There are $n$ eigenvectors and when they are arranged as the columns of a matrix you get $\textbf{Q}$.

Example: Eigenanalysis of Correlation Matrix

R
Python

In R you can obtain the eigenvectors and eigenvalues of a square matrix with the eigen function. Principal components can be computed with the prcomp or the princomp functions, both included in the base stats package. prcomp uses the singular-value decomposition of $\textbf{X}$, princomp relies on the eigendecomposition of the covariance matrix.

The princomp function in R has a fix_sign= option that controls the sign of the first loading coefficient to be positive. Further discussion on loading coefficient signs can be found at the end of Section 24.6.

The following data were used to illustrate PCA scores in Section 8.3.2.

data <- data.frame("x1"=c(0.344167 ,0.363478,0.196364,0.2     ,0.226957,
                          0.204348 ,0.196522,0.165   ,0.138333,0.150833),
                   "x2"=c(0.363625 ,0.353739,0.189405,0.212122,0.22927 ,
                          0.233209 ,0.208839,0.162254,0.116175,0.150888),
                   "x3"=c(0.805833 ,0.696087,0.437273,0.590435,0.436957,
                          0.518261 ,0.498696,0.535833,0.434167,0.482917),
                   "x4"=c(0.160446 ,0.248539,0.248309,0.160296,0.1869  ,
                          0.0895652,0.168726,0.266804,0.36195 ,0.223267)
                   )
data

         x1       x2       x3        x4
1  0.344167 0.363625 0.805833 0.1604460
2  0.363478 0.353739 0.696087 0.2485390
3  0.196364 0.189405 0.437273 0.2483090
4  0.200000 0.212122 0.590435 0.1602960
5  0.226957 0.229270 0.436957 0.1869000
6  0.204348 0.233209 0.518261 0.0895652
7  0.196522 0.208839 0.498696 0.1687260
8  0.165000 0.162254 0.535833 0.2668040
9  0.138333 0.116175 0.434167 0.3619500
10 0.150833 0.150888 0.482917 0.2232670

There are ten observations of four variables. Let’s take as input for the eigendecomposition the correlation matrix of the data:

c <- cor(data)
c

           x1         x2         x3         x4
x1  1.0000000  0.9832580  0.8359297 -0.2761887
x2  0.9832580  1.0000000  0.8539085 -0.4397088
x3  0.8359297  0.8539085  1.0000000 -0.2890121
x4 -0.2761887 -0.4397088 -0.2890121  1.0000000

e <- eigen(c)
e$values

[1] 2.9570820972 0.8438197077 0.1985106437 0.0005875513

e$vectors

           [,1]       [,2]        [,3]        [,4]
[1,] -0.5550620 0.23539598 -0.45961271  0.65211273
[2,] -0.5741828 0.05247306 -0.33623520 -0.74465196
[3,] -0.5297817 0.20752636  0.82067111  0.05256496
[4,]  0.2855723 0.94803382 -0.04691458 -0.13220959

The eigen function returns the diagonal elements of $\boldsymbol{\Lambda}$ in e$values and the matrix $\textbf{Q}$ of eigenvectors in e$vectors. The input matrix of the decomposition can be reconstructed from those:

A <- e$vectors %*% diag(e$values) %*% t(e$vectors)
round((c - A),4)

   x1 x2 x3 x4
x1  0  0  0  0
x2  0  0  0  0
x3  0  0  0  0
x4  0  0  0  0

What happens if we multiply the $10 \times 4$ data matrix with the $4\times 4$ matrix of eigenvectors of the correlation matrix? First, we center and scale the data. The result is a $10 \times 4$ matrix of linear transformations. These are the scores of the PCA.

data_c <- scale(data,center=TRUE,scale=TRUE)
scores <- data_c %*% e$vectors
scores

            [,1]       [,2]        [,3]          [,4]
 [1,] -3.2478623  0.2815532  0.43998889 -0.0292581389
 [2,] -2.5101127  1.2574623 -0.42484650  0.0254701453
 [3,]  0.9925370  0.1935176 -0.46544557 -0.0005376776
 [4,] -0.1908526 -0.6286757  0.49817675  0.0413147515
 [5,]  0.2550067 -0.4593212 -0.77973866  0.0014573403
 [6,] -0.3285776 -1.6137133 -0.05490332 -0.0226596733
 [7,]  0.2862026 -0.6907719 -0.08654684 -0.0123166161
 [8,]  1.0581238  0.4785600  0.48496730 -0.0088830663
 [9,]  2.3797940  1.3913628  0.09775326 -0.0229367993
[10,]  1.3057412 -0.2099739  0.29059469  0.0283497342

The scores have mean 0, their variances equal the eigenvalues of the correlation matrix, their variances sum to $p=4$, and they are uncorrelated.

apply(scores,2,mean)

[1]  1.110223e-16 -4.024558e-16  3.330669e-17 -8.847090e-17

apply(scores,2,var)

[1] 2.9570820972 0.8438197077 0.1985106437 0.0005875513

sum(apply(scores,2,var))

[1] 4

round(cor(scores),4)

     [,1] [,2] [,3] [,4]
[1,]    1    0    0    0
[2,]    0    1    0    0
[3,]    0    0    1    0
[4,]    0    0    0    1

In Python you can obtain the eigenvectors and eigenvalues of a square matrix with the linalg.eig function. Principal components can be computed with the PCA function, included in the sklearn package. PCA uses the singular-value decomposition of $\textbf{X}$.

The following data were used to illustrate PCA scores in Section 8.3.2.

import pandas as pd
import numpy as np

# Create the data frame 
data = pd.DataFrame({
    'x1': [0.344167, 0.363478, 0.196364, 0.2, 0.226957,
           0.204348, 0.196522, 0.165, 0.138333, 0.150833],
    'x2': [0.363625, 0.353739, 0.189405, 0.212122, 0.22927,
           0.233209, 0.208839, 0.162254, 0.116175, 0.150888],
    'x3': [0.805833, 0.696087, 0.437273, 0.590435, 0.436957,
           0.518261, 0.498696, 0.535833, 0.434167, 0.482917],
    'x4': [0.160446, 0.248539, 0.248309, 0.160296, 0.1869,
           0.0895652, 0.168726, 0.266804, 0.36195, 0.223267]
})

print(data)

         x1        x2        x3        x4
0  0.344167  0.363625  0.805833  0.160446
1  0.363478  0.353739  0.696087  0.248539
2  0.196364  0.189405  0.437273  0.248309
3  0.200000  0.212122  0.590435  0.160296
4  0.226957  0.229270  0.436957  0.186900
5  0.204348  0.233209  0.518261  0.089565
6  0.196522  0.208839  0.498696  0.168726
7  0.165000  0.162254  0.535833  0.266804
8  0.138333  0.116175  0.434167  0.361950
9  0.150833  0.150888  0.482917  0.223267

There are ten observations of four variables. Let’s take as input for the eigendecomposition the correlation matrix of the data:

# Calculate correlation matrix 

c = data.corr()
print("\nCorrelation matrix:")


Correlation matrix:

print(c)

          x1        x2        x3        x4
x1  1.000000  0.983258  0.835930 -0.276189
x2  0.983258  1.000000  0.853909 -0.439709
x3  0.835930  0.853909  1.000000 -0.289012
x4 -0.276189 -0.439709 -0.289012  1.000000

# Calculate eigenvalues and eigenvectors 
eigenvalues, eigenvectors = np.linalg.eig(c)
print("\nEigenvalues: ")


Eigenvalues:

print([f"{x:.6f}" for x in eigenvalues])

['2.957082', '0.843820', '0.000588', '0.198511']

print("\nEigenvectors:")


Eigenvectors:

print(eigenvectors)

[[ 0.55506204  0.23539598  0.65211273 -0.45961271]
 [ 0.57418284  0.05247306 -0.74465196 -0.3362352 ]
 [ 0.52978172  0.20752636  0.05256496  0.82067111]
 [-0.28557228  0.94803382 -0.13220959 -0.04691458]]

The linalg.eig function returns the diagonal elements of $\boldsymbol{\Lambda}$ in eigenvalues and the matrix $\textbf{Q}$ of eigenvectors in eigenvectors. The input matrix of the decomposition can be reconstructed from those:

A = eigenvectors @ np.diag(eigenvalues) @ eigenvectors.T
print(np.round(c - A, 4))

     x1   x2   x3   x4
x1  0.0  0.0  0.0 -0.0
x2  0.0  0.0  0.0 -0.0
x3  0.0  0.0  0.0 -0.0
x4 -0.0 -0.0 -0.0  0.0

from sklearn.preprocessing import StandardScaler
import math
from sklearn.preprocessing import scale 

# Standardize the data 
scaler = StandardScaler()
data_c = scaler.fit_transform(data)
data_c = pd.DataFrame(data_c, columns=data.columns)

#Function to correct standard deviation calculation from n dof to n-1 dof
def sd_correct(x, n):
  correction = x * math.sqrt(n-1) / math.sqrt(n)
  return correction

#Correcting to n-1 degrees of freedom
data_c = data_c.map(lambda x: sd_correct(x, n=10))

# Calculate PCA scores 
scores = data_c @ eigenvectors
scores = pd.DataFrame(scores)
print(scores)

          0         1         2         3
0  3.247862  0.281553 -0.029258  0.439989
1  2.510113  1.257462  0.025470 -0.424846
2 -0.992537  0.193518 -0.000538 -0.465446
3  0.190853 -0.628676  0.041315  0.498177
4 -0.255007 -0.459321  0.001457 -0.779739
5  0.328578 -1.613713 -0.022660 -0.054903
6 -0.286203 -0.690772 -0.012317 -0.086547
7 -1.058124  0.478560 -0.008883  0.484967
8 -2.379794  1.391363 -0.022937  0.097753
9 -1.305741 -0.209974  0.028350  0.290595

The scores have mean 0, their variances equal the eigenvalues of the correlation matrix, their variances sum to $p=4$, and they are uncorrelated.

#Mean Scores
print(scores.mean())

0    2.886580e-16
1   -2.359224e-16
2    3.715778e-16
3   -2.831069e-16
dtype: float64

#Score variances
print(scores.var(ddof=1))

0    2.957082
1    0.843820
2    0.000588
3    0.198511
dtype: float64

#Sum of Score Variances
print(scores.var(ddof=1).sum())

4.0

#Correlation of scores
print(np.round(scores.corr(), 4))

     0    1    2    3
0  1.0 -0.0 -0.0  0.0
1 -0.0  1.0 -0.0 -0.0
2 -0.0 -0.0  1.0  0.0
3  0.0 -0.0  0.0  1.0

We prefer the prcomp function to compute the PCA over the princomp function because it is based on the singular value decomposition, which is numerically more stable than the eigendecomposition. (Singular values are the square roots of the eigenvalues and since eigenvalues can be very close to zero working with square roots stabilizes the calculations.)

By default, prcomp centers the data but does not scale it. Add scale.=TRUE to scale the data as well.

pca <- prcomp(data,retx=TRUE,scale.=TRUE)
pca

Standard deviations (1, .., p=4):
[1] 1.71961685 0.91859660 0.44554533 0.02423946

Rotation (n x k) = (4 x 4):
          PC1        PC2         PC3         PC4
x1 -0.5550620 0.23539598  0.45961271 -0.65211273
x2 -0.5741828 0.05247306  0.33623520  0.74465196
x3 -0.5297817 0.20752636 -0.82067111 -0.05256496
x4  0.2855723 0.94803382  0.04691458  0.13220959

prcomp reports the standard deviations (pca$dev) and rotation matrix $\boldsymbol{\Psi}$ (pca$rotation). You will recognize the rotation as the matrix of eigenvectors from the previous analysis and the standard deviations as the square roots of the variances of the scores computed earlier. When retx=TRUE is specified, the function returns the PCA scores in pca$x.

The sum of the variances of the principal components is $p=4$ and it can be used to break down the variance into proportions explained by the components.

summary(pca)

Importance of components:
                          PC1    PC2     PC3     PC4
Standard deviation     1.7196 0.9186 0.44555 0.02424
Proportion of Variance 0.7393 0.2109 0.04963 0.00015
Cumulative Proportion  0.7393 0.9502 0.99985 1.00000

from sklearn.decomposition import PCA

# Using sklearn's PCA (equivalent to R's prcomp)
pca = PCA(n_components=4)

pca_scores = pca.fit_transform(data_c)

Component Loadings:

print(pd.DataFrame(pca.components_.T, 
                   columns=[f'PC{i+1}' for i in range(4)],
                   index=data.columns))

         PC1       PC2       PC3       PC4
x1  0.555062 -0.235396  0.459613  0.652113
x2  0.574183 -0.052473  0.336235 -0.744652
x3  0.529782 -0.207526 -0.820671  0.052565
x4 -0.285572 -0.948034  0.046915 -0.132210

PCA reports the explained variances (pca.explained_variance_). The standard deviations can be calculated by taking the square root of the explained variances. To obtain the rotation matrix $\boldsymbol{\Psi}$ transpose the component output (pca.components_.T). You will recognize the rotation as the matrix of eigenvectors from the previous analysis and the standard deviations as the square roots of the variances of the scores computed earlier. PCA scores can be found with pca.fit_transform().

The sum of the variances of the principal components is $p=4$ and it can be used to break down the variance into proportions explained by the components.

# Function for Table of Standard Deviation and Explained Variances by PC
def PCA_sum(pca):
  var_sum = pd.DataFrame(np.array((pca.explained_variance_ ,
  pca.explained_variance_ratio_, np.cumsum(pca.explained_variance_ratio_))))

  #Change explained variance to standard deviation by taking square root
  var_sum.iloc[0] = var_sum.iloc[0].apply(lambda x: math.sqrt(x))
  
  colnames = [f'PC{i+1}' for i in range(pca.n_components)]
  var_sum = var_sum.set_axis(colnames, axis=1)
  var_sum.index = ['Standard Deviation', 'Explained Variance Ratio', 
  'Cumulative Explained Variance Ratio']
  
  return var_sum

PCA_sum(pca)

                                          PC1       PC2       PC3       PC4
Standard Deviation                   1.719617  0.918597  0.445545  0.024239
Explained Variance Ratio             0.739271  0.210955  0.049628  0.000147
Cumulative Explained Variance Ratio  0.739271  0.950225  0.999853  1.000000

The first principal component, with somewhat equal loading coefficients for $X_1$, $X_2$, and $X_3$, explains 73.927% of the overall variance. The second component, which is dominated by $X_4$, explains an additional 21.095% of the variance.

24.4 Predicting in a PCA

You can predict a new observation using principal components by calculating the component scores for the new obs. For example, suppose the input values for the new data point are the square roots of the values for the first observation:

R
Python

xx <- sqrt(data[1,]) # the "new" observation
predict(pca, newdata=xx)

        PC1      PC2      PC3       PC4
1 -6.202193 4.362812 1.553425 0.5424985

xx = np.sqrt(data.iloc[0])  # the "new" observation

# Using sklearn's transform method
xx_reshaped = xx.values.reshape(1, -1)  # reshape for sklearn
newdata = pd.DataFrame(scaler.transform(xx_reshaped)).map(lambda x: sd_correct(x, n=10))
pca_prediction = pca.transform(newdata)

print(pca_prediction[0])

[ 6.20219314 -4.36281244  1.5534247  -0.54249853]

The predicted value is a vector, with one element for each of the components. To reconstruct the predicted value, we can apply the rotation matrix to the new observation. Before doing so, the values need to be centered and scaled to match the derivation of the rotations. It is thus important to apply the same centering and scaling values that were used on the training data:

R
Python

xx_centered_and_scaled <- (xx - pca$center)/pca$scale
as.matrix(xx_centered_and_scaled) %*% pca$rotation

        PC1      PC2      PC3       PC4
1 -6.202193 4.362812 1.553425 0.5424985

# Manually scaling new observation with with original training and scaling values
xx_centered_and_scaled = pd.DataFrame((xx_reshaped - scaler.mean_) / scaler.scale_).map (lambda x: sd_correct(x, n=10)) 

# Manual calculation of predictions by matrix multiplication of scaled and centered
# data and rotation matrix
manual_pred = xx_centered_and_scaled @ pca.components_.T
colnames = [f'PC{i+1}' for i in range(pca.n_components)]
manual_pred = manual_pred.set_axis(colnames, axis=1)
print(manual_pred)

        PC1       PC2       PC3       PC4
0  6.202193 -4.362812  1.553425 -0.542499

24.5 Scree Plot

How should we choose the number of principal components to use in subsequent analyses? If the goal is summarization or visualization $m=2$ is common. If the components are input to a regression model, cross-validation can be used in PCR (Section 8.3.2). In other situations you have to choose $m$ such that a “fair” amount of variation is explained. To help with that, a scree plot is commonly used. Scree describes the small rock material that accumulates at the base of a mountain or steep rock formation (Figure 24.1). The scree forms a sharp transition from the rocky cliff to the area of the rock fragments.

Figure 24.1: Scree at the base of a mountain. Source: Wikipedia

In principal component analysis the scree plot displays the proprtion of variance explained against the number of components.

Example: Hitters Data (ISLR2)

For the Hitters baseball data set from James et al. (2021), the PCA scree plot is obtained as follows:

R
Python

library(ISLR2)

pca_hit <- prcomp(Hitters[,1:13], scale.=TRUE) 

plot(summary(pca_hit)$importance[2,],
     type="b",
     ylab="Proportion of Variance Explained",
     xlab="Principal Component",
     las=1)

Scree plot for first thirteen attributes of `Hitters` data.


import duckdb

con = duckdb.connect(database="ads.ddb", read_only=True)
Hitters = con.sql("SELECT * FROM Hitters;").df().dropna()
con.close()

#Scale Data
scaler_hit = StandardScaler()
Hitters_scaled = pd.DataFrame(scaler_hit.fit_transform(Hitters.iloc[:, :13]),
                  columns=Hitters.columns[:13])
Hitters_scaled = Hitters_scaled.map(lambda x: sd_correct(x, n=len(Hitters_scaled)))

# PCA on first 13 columns
pca_hit = PCA(n_components=13)
pca_hit_scores = pca_hit.fit_transform(Hitters_scaled)

import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
plt.plot(range(1, 14), pca_hit.explained_variance_ratio_, 'b-o', linewidth=2, markersize=6)
plt.xlabel('Principal Component')
plt.ylabel('Proportion of Variance Explained')
plt.title('Proportion of Variance Explained by Each PC')
plt.grid(True, alpha=0.3)
plt.xticks(range(1, 14))

([<matplotlib.axis.XTick object at 0x337a7e780>, <matplotlib.axis.XTick object at 0x3375a2fc0>, <matplotlib.axis.XTick object at 0x337aa6360>, <matplotlib.axis.XTick object at 0x337ae0590>, <matplotlib.axis.XTick object at 0x337aa6660>, <matplotlib.axis.XTick object at 0x337a892e0>, <matplotlib.axis.XTick object at 0x337ae2cf0>, <matplotlib.axis.XTick object at 0x337ae3bc0>, <matplotlib.axis.XTick object at 0x337b78680>, <matplotlib.axis.XTick object at 0x337ae21e0>, <matplotlib.axis.XTick object at 0x337b79220>, <matplotlib.axis.XTick object at 0x337b79a30>, <matplotlib.axis.XTick object at 0x337b7a600>], [Text(1, 0, '1'), Text(2, 0, '2'), Text(3, 0, '3'), Text(4, 0, '4'), Text(5, 0, '5'), Text(6, 0, '6'), Text(7, 0, '7'), Text(8, 0, '8'), Text(9, 0, '9'), Text(10, 0, '10'), Text(11, 0, '11'), Text(12, 0, '12'), Text(13, 0, '13')])

plt.show()

The sharp “knee” (“elbow”) at $m=3$ shows that most of the variability is captured by the first two principal components.

24.6 Biplot

The biplot of a PCA displays the scores and loadings for two principal components in the same visualization. Typically, it is produced using the first two components as these account for the most variability. The scores $z_{ij}$ are shown in the biplot as a scatter, the rotations are displayed as arrows. Points that are close to each other on the biplot have similar values of the inputs. The length of an arrow depicts the contribution of the variable, long arrows correspond to large coefficients in the rotation matrix. The angle of the arrows shows the correlation with other variables:

uncorrelated variables have perpendicular arrows
highly correlated variables have arrows that point in the same direction

Example: PCA for Iris Data

For the Iris data, the first two principal component account for 95% of the variability in the four flower measurements.

R
Python

pca_iris <- prcomp(iris[,1:4],retx=TRUE,scale.=TRUE)
pca_iris

Standard deviations (1, .., p=4):
[1] 1.7083611 0.9560494 0.3830886 0.1439265

Rotation (n x k) = (4 x 4):
                    PC1         PC2        PC3        PC4
Sepal.Length  0.5210659 -0.37741762  0.7195664  0.2612863
Sepal.Width  -0.2693474 -0.92329566 -0.2443818 -0.1235096
Petal.Length  0.5804131 -0.02449161 -0.1421264 -0.8014492
Petal.Width   0.5648565 -0.06694199 -0.6342727  0.5235971

summary(pca_iris)

Importance of components:
                          PC1    PC2     PC3     PC4
Standard deviation     1.7084 0.9560 0.38309 0.14393
Proportion of Variance 0.7296 0.2285 0.03669 0.00518
Cumulative Proportion  0.7296 0.9581 0.99482 1.00000

The raw data for the biplot (Figure 24.2) of the first two components are the first two columns of the rotation matrix and the first two columns of the score matrix:

pca_iris$rotation[,1:2]

                    PC1         PC2
Sepal.Length  0.5210659 -0.37741762
Sepal.Width  -0.2693474 -0.92329566
Petal.Length  0.5804131 -0.02449161
Petal.Width   0.5648565 -0.06694199

pca_iris$x[1:20,1:2]

            PC1         PC2
 [1,] -2.257141 -0.47842383
 [2,] -2.074013  0.67188269
 [3,] -2.356335  0.34076642
 [4,] -2.291707  0.59539986
 [5,] -2.381863 -0.64467566
 [6,] -2.068701 -1.48420530
 [7,] -2.435868 -0.04748512
 [8,] -2.225392 -0.22240300
 [9,] -2.326845  1.11160370
[10,] -2.177035  0.46744757
[11,] -2.159077 -1.04020587
[12,] -2.318364 -0.13263400
[13,] -2.211044  0.72624318
[14,] -2.624309  0.95829635
[15,] -2.191399 -1.85384655
[16,] -2.254661 -2.67731523
[17,] -2.200217 -1.47865573
[18,] -2.183036 -0.48720613
[19,] -1.892233 -1.40032757
[20,] -2.335545 -1.12408360

biplot(pca_iris,cex=0.75)

Figure 24.2: Biplot for first two principal components of Iris data.

There are two systems of axes on the biplot. The axes on the bottom and on the left are for the scores of the principal components. The axes on the right and on the top are for the variable arrows (the rotations). Note that the Sepal.Width rotation for the first two components is [-0.2693474, -0.9232957] but the arrow does not point at that coordinate. The biplot function applies a scaling to the scores and the arrows, which you can suppress by adding scale=0 to the function call.

con = duckdb.connect(database="ads.ddb", read_only=True)
iris = con.sql("SELECT * FROM iris;").df()
con.close()

scaler_iris = StandardScaler()
iris_scaled = pd.DataFrame(scaler_iris.fit_transform(iris.iloc[:, :4]))
iris_scaled = iris_scaled.map(lambda x: sd_correct(x, n=len(iris_scaled)))

pca_iris = PCA(n_components=4)
pca_iris_scores = pca_iris.fit_transform(iris_scaled)

pca_iris_rotation = pd.DataFrame(pca_iris.components_.T, columns = colnames, index = iris.columns[:4])

Iris PCA Rotation Matrix:

print(pca_iris_rotation)

                   PC1       PC2       PC3       PC4
Sepal.Length  0.521066  0.377418 -0.719566 -0.261286
Sepal.Width  -0.269347  0.923296  0.244382  0.123510
Petal.Length  0.580413  0.024492  0.142126  0.801449
Petal.Width   0.564857  0.066942  0.634273 -0.523597

Importance of Components:

# Table of Standard Deviation and Explained Variances
PCA_sum(pca_iris)

                                          PC1       PC2       PC3       PC4
Standard Deviation                   1.708361  0.956049  0.383089  0.143926
Explained Variance Ratio             0.729624  0.228508  0.036689  0.005179
Cumulative Explained Variance Ratio  0.729624  0.958132  0.994821  1.000000

The raw data for the biplot (Figure 24.3) of the first two components are the first two columns of the rotation matrix and the first two columns of the score matrix:

pca_iris_rotation.iloc [:, :2]

                   PC1       PC2
Sepal.Length  0.521066  0.377418
Sepal.Width  -0.269347  0.923296
Petal.Length  0.580413  0.024492
Petal.Width   0.564857  0.066942

pd.DataFrame(pca_iris_scores, columns = colnames).iloc[:20, :2]

         PC1       PC2
0  -2.257141  0.478424
1  -2.074013 -0.671883
2  -2.356335 -0.340766
3  -2.291707 -0.595400
4  -2.381863  0.644676
5  -2.068701  1.484205
6  -2.435868  0.047485
7  -2.225392  0.222403
8  -2.326845 -1.111604
9  -2.177035 -0.467448
10 -2.159077  1.040206
11 -2.318364  0.132634
12 -2.211044 -0.726243
13 -2.624309 -0.958296
14 -2.191399  1.853847
15 -2.254661  2.677315
16 -2.200217  1.478656
17 -2.183036  0.487206
18 -1.892233  1.400328
19 -2.335545  1.124084

# Biplot
plt.figure(figsize=(8, 8))

plt.scatter( pca_iris_scores[:, 0] , pca_iris_scores[:, 1] , 
                alpha=0.7, s=25)

# Add loading vectors
feature_names_short = ['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width']  # shortened names
for i, (name, pc1, pc2) in enumerate(zip(feature_names_short, 
                                         pca_iris.components_[0], 
                                         pca_iris.components_[1])):
    plt.arrow(0, 0, pc1*3, pc2*3, head_width=0.1, head_length=0.1, 
              fc='black', ec='black', alpha=0.8)
    plt.text(pc1*3.2, pc2*3.2, name, fontsize=10, ha='center', va='center')

plt.xlabel(f'PC1 ({pca_iris.explained_variance_ratio_[0]:.1%} variance)', fontsize = 12)
plt.ylabel(f'PC2 ({pca_iris.explained_variance_ratio_[1]:.1%} variance)', fontsize = 12)
plt.title('PCA Biplot - Iris Dataset', fontsize = 14)
plt.grid(True, alpha=0.3)

plt.show()

Figure 24.3: Biplot for first two principal components of Iris data.

We see from the biplot that the scores cluster in distinct groups. The arrow for Sepal.Width is pointing away from the arrows of the other variables, which suggests that it is not highly correlated with the other flower measurements. The length and width of the petals are very highly correlated, however; their arrows point into nearly the same direction. This can be verified by computing the correlation matrix of the measurements:

R
Python

cor(iris[,1:4])

             Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000

Using the autoplot function in the ggfortify library allows you to color the scores in the biplot by another variable. We see that the distinct group of scores low in PC1 correspond to the Iris setosa species.

library(ggfortify)
autoplot(pca_iris, data=iris, color="Species", 
        loadings=TRUE, loadings.label=TRUE)

iris_corr = iris.iloc[:, :4].corr()
print(np.round(iris_corr, 4))

              Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
Sepal.Length        1.0000      -0.1176        0.8718       0.8179
Sepal.Width        -0.1176       1.0000       -0.4284      -0.3661
Petal.Length        0.8718      -0.4284        1.0000       0.9629
Petal.Width         0.8179      -0.3661        0.9629       1.0000

The following plot colors the scores in the biplot by species. We see that the distinct group of scores low in PC1 correspond to the Iris setosa species.

import seaborn as sns

plt.figure(figsize=(8, 8))

# Assigning colors to each species
colors = {'setosa': "#FF0000", 'versicolor': "#00FF00", 'virginica': "#0000FF"}

# Creating dataframe of values to include in plot
plot_df = pd.DataFrame({'PC1': pca_iris_scores[:, 0],
                        'PC2': pca_iris_scores[:, 1],
                        'Species': iris['Species']})

sns.scatterplot(x="PC1",y="PC2",hue="Species",data=plot_df, palette=colors, alpha=0.5)

# Add loading vectors
feature_names_short = ['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width']  # shortened names
for i, (name, pc1, pc2) in enumerate(zip(feature_names_short, 
                                         pca_iris.components_[0], 
                                         pca_iris.components_[1])):
    plt.arrow(0, 0, pc1*3, pc2*3, head_width=0.1, head_length=0.1, 
              fc='black', ec='black', alpha=0.8)
    plt.text(pc1*3.2, pc2*3.2, name, fontsize=10, ha='center', va='center')

plt.xlabel(f'PC1 ({pca_iris.explained_variance_ratio_[0]:.1%} variance)', fontsize = 12)
plt.ylabel(f'PC2 ({pca_iris.explained_variance_ratio_[1]:.1%} variance)', fontsize = 12)
plt.title('PCA Biplot - Iris Dataset', fontsize = 14)
plt.grid(True, alpha=0.3)
plt.show()

I. setosa has large sepal widths and the negative rotation coefficient for Sepal.Width in PC1 separates the species in that dimension from the other species.

R
Python

library(dplyr)
iris %>% group_by(Species) %>% 
         summarize(min=min(Sepal.Width), 
                   max=max(Sepal.Width),
                   mean=mean(Sepal.Width))

# A tibble: 3 × 4
  Species      min   max  mean
  <fct>      <dbl> <dbl> <dbl>
1 setosa       2.3   4.4  3.43
2 versicolor   2     3.4  2.77
3 virginica    2.2   3.8  2.97

summary_stats = iris.groupby('Species').agg({
    'Sepal.Width': ['min', 'max', 'mean']}).round(2)

summary_stats

           Sepal.Width           
                   min  max  mean
Species                          
setosa             2.3  4.4  3.43
versicolor         2.0  3.4  2.77
virginica          2.2  3.8  2.97

Looking at the sign of the rotation coefficients is helpful to determine which input variables increase or decrease a component compared to other inputs. However, you cannot interpret the sign as increasing or decreasing a target variable. The rotation matrix of a principal component analysis is unique up to sign. Two software packages might produce the same rotation matrices, but with different signs of the coefficients.

24.7 To Scale or Not To Scale

The variables should always be centered to have mean 0, so that the loadings have the same origin. Should the variables also be scaled to have the same variance?

In general, scaling is recommended whenever the results of an analysis depends on units of measurements. For example, when calculating distances it matters whether time is measured in days, hours, minutes, or seconds. The choice of units should not affect the outcome of the analysis. Similarly, large things tend to have greater variability than small things. Since the principal components are arranged in the order of variability explained, variables that have very large variances (because of their units of measurement or size) will dominate the components unless scaling is applied.

Example: U.S. Arrests

The USArrests data set contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 U.S. states in 1973. Also given is the percent of the population living in urban areas.

R
Python

The data set comes with R.

head(USArrests)

           Murder Assault UrbanPop Rape
Alabama      13.2     236       58 21.2
Alaska       10.0     263       48 44.5
Arizona       8.1     294       80 31.0
Arkansas      8.8     190       50 19.5
California    9.0     276       91 40.6
Colorado      7.9     204       78 38.7

apply(USArrests,2,sd)

   Murder   Assault  UrbanPop      Rape 
 4.355510 83.337661 14.474763  9.366385

#Import data from duckdb
con = duckdb.connect(database="ads.ddb", read_only=True)
USArrests = con.sql("SELECT * FROM USArrests;").df()
con.close()
USArrests.head()

   Murder  Assault  UrbanPop  Rape
0    13.2      236        58  21.2
1    10.0      263        48  44.5
2     8.1      294        80  31.0
3     8.8      190        50  19.5
4     9.0      276        91  40.6

#Standard Deviation for each feature
USArrests.std()

Murder       4.355510
Assault     83.337661
UrbanPop    14.474763
Rape         9.366385
dtype: float64

The standard deviations of the four variables is quite different, that of Assault is 20-times larger than that of the Murder variable.

We now perform PCA without and with scaling of the data.

R
Python

pca_unscaled <- prcomp(USArrests,retx=TRUE,scale.=FALSE)
pca_unscaled

Standard deviations (1, .., p=4):
[1] 83.732400 14.212402  6.489426  2.482790

Rotation (n x k) = (4 x 4):
                PC1         PC2         PC3         PC4
Murder   0.04170432 -0.04482166  0.07989066 -0.99492173
Assault  0.99522128 -0.05876003 -0.06756974  0.03893830
UrbanPop 0.04633575  0.97685748 -0.20054629 -0.05816914
Rape     0.07515550  0.20071807  0.97408059  0.07232502

summary(pca_unscaled)

Importance of components:
                           PC1      PC2    PC3     PC4
Standard deviation     83.7324 14.21240 6.4894 2.48279
Proportion of Variance  0.9655  0.02782 0.0058 0.00085
Cumulative Proportion   0.9655  0.99335 0.9991 1.00000

# PCA without scaling
pca_unscaled = PCA(n_components=4)
pca_unscaled_transformed = pca_unscaled.fit_transform(USArrests)

#Rotation Matrix
# Set column names for each principal component
colnames = [f'PC{i+1}' for i in range(4)]
pca_unscaled_rotation = pd.DataFrame(pca_unscaled.components_.T, columns = colnames, index = USArrests.columns[:4])

pca_unscaled_rotation

               PC1       PC2       PC3       PC4
Murder    0.041704  0.044822  0.079891  0.994922
Assault   0.995221  0.058760 -0.067570 -0.038938
UrbanPop  0.046336 -0.976857 -0.200546  0.058169
Rape      0.075156 -0.200718  0.974081 -0.072325

Importance of Components

PCA_sum(pca_unscaled)

                                           PC1        PC2       PC3       PC4
Standard Deviation                   83.732400  14.212402  6.489426  2.482790
Explained Variance Ratio              0.965534   0.027817  0.005800  0.000849
Cumulative Explained Variance Ratio   0.965534   0.993352  0.999151  1.000000

The first principal component in the unscaled analysis explains 96.55% of the variability. The component is dominated by the Assault variable, with a loading coefficient of 0.9952.

R
Python

pca_scaled <- prcomp(USArrests,retx=TRUE,scale.=TRUE)
pca_scaled

Standard deviations (1, .., p=4):
[1] 1.5748783 0.9948694 0.5971291 0.4164494

Rotation (n x k) = (4 x 4):
                PC1        PC2        PC3         PC4
Murder   -0.5358995 -0.4181809  0.3412327  0.64922780
Assault  -0.5831836 -0.1879856  0.2681484 -0.74340748
UrbanPop -0.2781909  0.8728062  0.3780158  0.13387773
Rape     -0.5434321  0.1673186 -0.8177779  0.08902432

summary(pca_scaled)

Importance of components:
                          PC1    PC2     PC3     PC4
Standard deviation     1.5749 0.9949 0.59713 0.41645
Proportion of Variance 0.6201 0.2474 0.08914 0.04336
Cumulative Proportion  0.6201 0.8675 0.95664 1.00000

# Scaling USArrests data
scaler_arrests = StandardScaler()
USArrests_scaled = pd.DataFrame(scaler_arrests.fit_transform(USArrests))
USArrests_scaled = USArrests_scaled.map(lambda x: sd_correct(x, n=len(USArrests_scaled)))

# PCA with scaled USArrests data
pca_scaled = PCA(n_components=4)
pca_scaled_transformed = pca_scaled.fit_transform(USArrests_scaled)

#Rotation Matrix
# Set column names for each principal component
colnames = [f'PC{i+1}' for i in range(4)]
pca_scaled_rotation = pd.DataFrame(pca_scaled.components_.T, columns = colnames, index = USArrests.columns[:4])

pca_scaled_rotation

               PC1       PC2       PC3       PC4
Murder    0.535899  0.418181 -0.341233  0.649228
Assault   0.583184  0.187986 -0.268148 -0.743407
UrbanPop  0.278191 -0.872806 -0.378016  0.133878
Rape      0.543432 -0.167319  0.817778  0.089024

#Importance of Components

PCA_sum(pca_scaled)

                                          PC1       PC2       PC3       PC4
Standard Deviation                   1.574878  0.994869  0.597129  0.416449
Explained Variance Ratio             0.620060  0.247441  0.089141  0.043358
Cumulative Explained Variance Ratio  0.620060  0.867502  0.956642  1.000000

In the scaled analysis the variables are on equal footing with respect to their dispersion. The loading coefficients are more evenly sized in the first principal component which explains 62.01% of variability in the scaled analysis. The Assault variable is not allowed to dominate the component analysis because of its large variance compared to the other variables.

R
Python

The biplots for the scaled and unscaled analysis are markedly different. The dominance of Assault in the unscaled analysis makes it difficult (impossible) to meaningfully interpret the biplot (Figure 24.4).

par(mfrow=c(1,2))
par(mar = c(1, 2, 1, 2))
biplot(pca_unscaled,cex=0.6)
biplot(pca_scaled,cex=0.6)

Figure 24.4: Biplots in unscaled (left) and scaled (right) analysis.

# Create biplots (equivalent to biplot() in R)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# Biplot for PCA
def create_biplot(ax, pca_obj, transformed_data, original_data, title, scale_factor=3):
    # Plot the transformed data points
    ax.scatter(transformed_data[:, 0], transformed_data[:, 1], alpha=0.6, s=20)
    
    # Plot the feature vectors (loadings)
    feature_names = original_data.columns if hasattr(original_data, 'columns') else [f'Var{i}' for i in range(original_data.shape[1])]
    
    for i, (feature, loading) in enumerate(zip(feature_names, pca_obj.components_.T)):
        ax.arrow(0, 0, loading[0] * scale_factor, loading[1] * scale_factor, 
                head_width=0.1 * scale_factor, head_length=0.1 * scale_factor, fc='red', ec='red')
        ax.text(loading[0] * scale_factor * 1.1, loading[1] * scale_factor * 1.1, 
                feature, fontsize=8, ha='center', va='center')
    
    ax.set_xlabel(f'PC1 ({pca_obj.explained_variance_ratio_[0]:.2%})')
    ax.set_ylabel(f'PC2 ({pca_obj.explained_variance_ratio_[1]:.2%})')
    ax.set_title(title)
    ax.grid(True, alpha=0.3)

# Create biplots

xmin, xmax = -150, 150
ymin, ymax = -150, 150
ax1.set_xlim(xmin, xmax)

(-150.0, 150.0)

ax1.set_ylim(ymin, ymax)

(-150.0, 150.0)

create_biplot(ax1, pca_unscaled, pca_unscaled_transformed, USArrests, 'PCA Unscaled Biplot', scale_factor=40)
create_biplot(ax2, pca_scaled, pca_scaled_transformed, pd.DataFrame(USArrests_scaled, columns=USArrests.columns), 'PCA Scaled Biplot')

plt.tight_layout()
plt.show()

Figure 24.5: Biplots in unscaled (left) and scaled (right) analysis.

In the next section we encounter a situation where we center but do not scale the $\textbf{X}$ matrix, when values of the inputs are reconstructed.

24.8 Imputation through Matrix Completion

In the introduction to this chapter we expressed the PCA scores as a linear function of the inputs: \[ z_{ij} = \psi_{1j} x_{i1} + \psi_{2j} x_{i2} + \cdots + \psi_{pj} x_{ip} = \sum_{k=1}^p \psi_{kj}x_{ik} \]

This relationship can be reversed, expressing the inputs as a linear combination of the PCA scores and the rotations: \[ x_{ij} = \sum_{k=1}^p z_{ik}\psi_{kj} \]

To demonstrate, we can reconstruct the values for the first observation for the Iris data set from its PCA results. First, we only center the data and do not scale:

R
Python

pca_iris_ns <- prcomp(iris[,1:4],retx=TRUE,scale.=FALSE)
pca_iris_ns$rotation
##                      PC1         PC2         PC3        PC4
## Sepal.Length  0.36138659 -0.65658877  0.58202985  0.3154872
## Sepal.Width  -0.08452251 -0.73016143 -0.59791083 -0.3197231
## Petal.Length  0.85667061  0.17337266 -0.07623608 -0.4798390
## Petal.Width   0.35828920  0.07548102 -0.54583143  0.7536574

sum(pca_iris_ns$x[1,] * pca_iris_ns$rotation[1,]) + pca_iris_ns$center[1]
## Sepal.Length 
##          5.1
sum(pca_iris_ns$x[1,] * pca_iris_ns$rotation[2,]) + pca_iris_ns$center[2]
## Sepal.Width 
##         3.5
sum(pca_iris_ns$x[1,] * pca_iris_ns$rotation[3,]) + pca_iris_ns$center[3]
## Petal.Length 
##          1.4
sum(pca_iris_ns$x[1,] * pca_iris_ns$rotation[4,]) + pca_iris_ns$center[4]
## Petal.Width 
##         0.2

# PCA on iris dataset (first 4 columns) without scaling
pca_iris_ns = PCA(n_components=4)
pca_iris_ns_transformed = pca_iris_ns.fit_transform(iris.iloc[:, :4])

rotation_ns_df = pd.DataFrame(pca_iris_ns.components_.T, 
                          index=iris.iloc[:, :4].columns,
                          columns=[f'PC{i+1}' for i in range(pca_iris_ns.n_components_)])
print(rotation_ns_df)

                   PC1       PC2       PC3       PC4
Sepal.Length  0.361387  0.656589 -0.582030 -0.315487
Sepal.Width  -0.084523  0.730161  0.597911  0.319723
Petal.Length  0.856671 -0.173373  0.076236  0.479839
Petal.Width   0.358289 -0.075481  0.545831 -0.753657

sepal_len = round(sum(pca_iris_ns_transformed[0, :] * rotation_ns_df.iloc[0, :]) +pca_iris_ns.mean_[0], 2)
print(f"Sepal Length: {sepal_len}")

Sepal Length: 5.1

sepal_wid = round(sum(pca_iris_ns_transformed[0, :] * rotation_ns_df.iloc[1, :]) +pca_iris_ns.mean_[1], 2)
print(f"Sepal Width: {sepal_wid}")

Sepal Width: 3.5

petal_len = round(sum(pca_iris_ns_transformed[0, :] * rotation_ns_df.iloc[2, :]) +pca_iris_ns.mean_[2], 2)
print(f"Petal Length: {petal_len}")

Petal Length: 1.4

petal_wid = round(sum(pca_iris_ns_transformed[0, :] * rotation_ns_df.iloc[3, :]) +pca_iris_ns.mean_[3], 2)
print(f"Petal Width: {petal_wid}")

Petal Width: 0.2

If the PCA was centered and scaled, the reconstruction of the variables needs to take the scaling factor into account:

R
Python

sum(pca_iris$x[1,] * pca_iris$rotation[1,])*pca_iris$scale[1] + pca_iris$center[1]
## Sepal.Length 
##          5.1
sum(pca_iris$x[1,] * pca_iris$rotation[2,])*pca_iris$scale[2] + pca_iris$center[2]
## Sepal.Width 
##         3.5
sum(pca_iris$x[1,] * pca_iris$rotation[3,])*pca_iris$scale[3] + pca_iris$center[3]
## Petal.Length 
##          1.4
sum(pca_iris$x[1,] * pca_iris$rotation[4,])*pca_iris$scale[4] + pca_iris$center[4]
## Petal.Width 
##         0.2

sepal_len = round(sum(pca_iris_scores[0, :] * pca_iris_rotation.iloc[0, :]) * scaler_iris.scale_[0]   +scaler_iris.mean_[0], 1)
print(f"Sepal Length: {sepal_len}")

Sepal Length: 5.1

sepal_wid = round(sum(pca_iris_scores[0, :] * pca_iris_rotation.iloc[1, :]) * scaler_iris.scale_[1] +scaler_iris.mean_[1], 1)
print(f"Sepal Width: {sepal_wid}")

Sepal Width: 3.5

petal_len = round(sum(pca_iris_scores[0, :] * pca_iris_rotation.iloc[2, :]) * scaler_iris.scale_ [2]  +scaler_iris.mean_[2], 1)
print(f"Petal Length: {petal_len}")

Petal Length: 1.4

petal_wid = round(sum(pca_iris_scores[0, :] * pca_iris_rotation.iloc[3, :]) * scaler_iris.scale_[3]  +scaler_iris.mean_[3], 1)
print(f"Petal Width: {petal_wid}")

Petal Width: 0.2

These calculations match the first observation:

iris[1,]

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa

If only a subset of the principal components is used, we obtain an approximation of the input value. If $M < p$, then

\[ x^*_{ij} = \sum_{k=1}^M z_{ik}\psi_{kj} \approx x_{ij} \] In fact, $x^*_{ij}$ is the best $M$-dimensional approximation to $x_{ij}$ in terms of having the smallest Euclidean distance.

The sepal width of the first observation can be approximated based on the first two principal components as

R
Python

sum( pca_iris_ns$x[1,] * pca_iris_ns$rotation[2,1:2]) + pca_iris_ns$center[2]

Sepal.Width 
   3.513403

sepal_wid = round(sum( pca_iris_ns_transformed[0, :2] * rotation_ns_df.iloc[1, :2] ) + pca_iris_ns.mean_[1], 3)
print(f"Sepal Width: {sepal_wid}")

Sepal Width: 3.517

If principal component analysis can be used to approximate the observed input values, then it can be used to predict unobserved values. This is the idea behind using PCA to impute missing values, a technique known as matrix completion. The advantage of using PCA to fill in missing values in matrices is to draw on the correlations between the variables, as compared to imputing missing values by column means, which ignores information from other variables.

It is important to point out, however, that this form of imputation is only appropriate if the missingness process is missing completely at random (MCAR): the missingness is unrelated to any study variable, including the target variable. For example, if a measurement cannot be taken because an instrument randomly stopped functioning, the missing value occurred completely at random. If the measurement cannot be taken because the instrument cannot record objects of that size, then the process is not MCAR.

An iterative algorithm for matrix completion is described in $12.3 of James et al. (2021) and implemented in the eimpute library in R. The algorithm starts with imputing missing values with the means of the observed data and iterates the following steps:

Perform a low-rank PCA, that is, a PCA with $M < p$.
Replace the unobserved values with approximations based on the PCA scores and rotation.
Compute the mean squared error between observed $x_{ij}$ and predicted $x^*_{ij}$. As long as the mean squared error continues to decrease, return to step 1.

Figure 24.6 displays observed and imputed values for the Iris data set when 20% of the observations are randomly set to missing (by assigning NA). Two methods of imputation are used: using the column means, and a low-rank PCA approximation with 2 principal components using eimpute(x=iris_miss,r=2).

There are only four possible imputed values when using the column means, all missing values for a particular variable are replaced with the same column mean. The PCA imputation produces much more variety in the imputed values since it draws on the correlations between the variables.

Because missing values were introduced artificially, we have access to the true values and can compute the correlation between original and imputed values. This correlation is 0.83 for the imputation based on column means and 0.93 for the imputation based on the PCA.

Figure 24.6: Comparison of original and imputed values for Iris data with 20% missingness.