# The intuition of Principal Component Analysis

T

As PCA and linear autoencoder have a close relation, this post introduces again PCA as a powerful dimension reduction tool while skipping many mathematical proofs.

PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables (entities each of which takes on various numerical values) into a set of values of linearly uncorrelated variables called principal components.

Wikipedia

### PCA intuition

Suppose that our data has D-dimensions. We want to transform the data to K-dimensions ( ) so that it keeps the most important information.

[DxN] = [DxD][DxN]
= [Dx(k + D-k)][(k +D-k)xN]
= [DxK][KxN] + [Dx(D-k)]x[(D-k)xN]

Consider a new orthogonal coordinate system , is a matrix comprised from the first column vectors of . The objective of PCA is to find an orthogonal coordinate system so that most information can be mapped into (the green) while replacing (the red) with a matrix (bias) that is independent to orginal data.

Objective ### PCA procedure

1. Find expectation (mean) vectors: 2. Subtract the mean 3. Calculate the covariance matrix 4. Calculate the eigenvectors and eigenvalues of the covariance matrix:
It is important to notice that these eigenvectors are both unit eigenvectors ie. their lengths are both 1. This is very important for PCA, but luckily, most maths packages, when asked for eigenvectors, will give you unit eigenvectors.
5. Choosing components and forming a feature vector
Once eigenvectors are found from the covariance matrix, the next step is to order them by eigenvalue, highest to lowest. This gives you the components in order of significance. Now, if you like, you can decide to ignore the components of lesser significance. You do lose some information, but if the eigenvalues are small, you don’t lose much
6. Deriving the new data set ### Appling PCA with Iris dataset

We use PCA from sklearn on the iris dataset. The dataset contains a set of 150 records under five attributes (features) – petal length, petal width, sepal length, sepal width and species.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
%matplotlib inline

Load and transform data using StandardScaler

iris = datasets.load_iris()
features = ['sepal length', 'sepal width', 'petal length', 'petal width']
# Separating out the features
x = iris.data[:, :4]
# Separating out the target
y = iris.target
# Standardizing the features
x = StandardScaler().fit_transform(x)

The new x has new scaled values. The first 5 rows:

array([[-0.90068117,  1.01900435, -1.34022653, -1.3154443 ],
[-1.14301691, -0.13197948, -1.34022653, -1.3154443 ],
[-1.38535265,  0.32841405, -1.39706395, -1.3154443 ],
[-1.50652052,  0.09821729, -1.2833891 , -1.3154443 ],
[-1.02184904,  1.24920112, -1.34022653, -1.3154443 ]])

Now we transform 4-D data into 2-D data using PCA.

pca = PCA(n_components=2)
principalComponents = pca.fit_transform(x)
principalDf = pd.DataFrame(data = principalComponents
, columns = ['PC 1', 'PC 2'])

Plot the transformed data:

finalDf = pd.concat([principalDf, pd.DataFrame(y, columns=['target'])], axis = 1)
fig = plt.figure(figsize = (8,8))
ax.set_xlabel('Principal Component 1', fontsize = 15)
ax.set_ylabel('Principal Component 2', fontsize = 15)
ax.set_title('2 component PCA', fontsize = 20)
targets = [0, 1, 2]
colors = ['r', 'g', 'b']
for target, color in zip(targets,colors):
indicesToKeep = finalDf['target'] == target
ax.scatter(finalDf.loc[indicesToKeep, 'PC 1']
, finalDf.loc[indicesToKeep, 'PC 2']
, c = color
, s = 50)
ax.legend(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'])
ax.grid()

### How much information preserved

To compare the amount of information in the new coordinate system, we can use the Explained Variance ratio of PCA.

evr = pca.explained_variance_ratio_
np.sum(evr)
#evr= [0.72962445 0.22850762]
#sum = 0.9581320720000165

We can see that the first principal component contains 72.96% of the variance and the second principal component contains 22.85% of the variance. Together, the two components contain 95.81% of the information. We lose 4.19% of the information of the original data which is not so bad.

### Conclusion

Both PCA and linear autoencoder use linear transformation for dimensionality reduction. To be effective, there needs to be underlying low dimensional structure in the feature space, i.e the features should have some linear relationship with each other. Nevertheless, for a non-linear relationship, autoencoders are more flexible with non-linear activation functions.