As PCA and linear autoencoder have a close relation, this post introduces again PCA as a powerful dimension reduction tool while skipping many mathematical proofs.
PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables (entities each of which takes on various numerical values) into a set of values of linearly uncorrelated variables called principal components.
Wikipedia
PCA intuition
Suppose that our data has D-dimensions. We want to transform the data to K-dimensions (
[DxN] = [DxD][DxN]
= [Dx(k + D-k)][(k +D-k)xN]
= [DxK][KxN] + [Dx(D-k)]x[(D-k)xN]
Consider a new orthogonal coordinate system
The objective of PCA is to find an orthogonal coordinate system so that most information can be mapped into
Objective
PCA procedure
- Find expectation (mean) vectors:
- Subtract the mean
- Calculate the covariance matrix
- Calculate the eigenvectors and eigenvalues of the covariance matrix:
It is important to notice that these eigenvectors are both unit eigenvectors ie. their lengths are both 1. This is very important for PCA, but luckily, most maths packages, when asked for eigenvectors, will give you unit eigenvectors. - Choosing components and forming a feature vector
Once eigenvectors are found from the covariance matrix, the next step is to order them by eigenvalue, highest to lowest. This gives you the components in order of significance. Now, if you like, you can decide to ignore the components of lesser significance. You do lose some information, but if the eigenvalues are small, you don’t lose much - Deriving the new data set
Appling PCA with Iris dataset
We use PCA from sklearn
on the iris dataset. The dataset contains a set of 150 records under five attributes (features) – petal length, petal width, sepal length, sepal width and species.
Dataset Order | Sepal length | Sepal width | Petal length | Petal width | Species |
---|---|---|---|---|---|
1 | 5.1 | 3.5 | 1.4 | 0.2 | I. setosa |
2 | 4.9 | 3.0 | 1.4 | 0.2 | I. setosa |
3 | 4.7 | 3.2 | 1.3 | 0.2 | I. setosa |
4 | 4.6 | 3.1 | 1.5 | 0.2 | I. setosa |
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
%matplotlib inline
Load and transform data using StandardScaler
iris = datasets.load_iris()
features = ['sepal length', 'sepal width', 'petal length', 'petal width']
# Separating out the features
x = iris.data[:, :4]
# Separating out the target
y = iris.target
# Standardizing the features
x = StandardScaler().fit_transform(x)
The new x
has new scaled values. The first 5 rows:
array([[-0.90068117, 1.01900435, -1.34022653, -1.3154443 ],
[-1.14301691, -0.13197948, -1.34022653, -1.3154443 ],
[-1.38535265, 0.32841405, -1.39706395, -1.3154443 ],
[-1.50652052, 0.09821729, -1.2833891 , -1.3154443 ],
[-1.02184904, 1.24920112, -1.34022653, -1.3154443 ]])
Now we transform 4-D data into 2-D data using PCA.
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(x)
principalDf = pd.DataFrame(data = principalComponents
, columns = ['PC 1', 'PC 2'])
Plot the transformed data:
finalDf = pd.concat([principalDf, pd.DataFrame(y, columns=['target'])], axis = 1)
fig = plt.figure(figsize = (8,8))
ax = fig.add_subplot(1,1,1)
ax.set_xlabel('Principal Component 1', fontsize = 15)
ax.set_ylabel('Principal Component 2', fontsize = 15)
ax.set_title('2 component PCA', fontsize = 20)
targets = [0, 1, 2]
colors = ['r', 'g', 'b']
for target, color in zip(targets,colors):
indicesToKeep = finalDf['target'] == target
ax.scatter(finalDf.loc[indicesToKeep, 'PC 1']
, finalDf.loc[indicesToKeep, 'PC 2']
, c = color
, s = 50)
ax.legend(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'])
ax.grid()
How much information preserved
To compare the amount of information in the new coordinate system, we can use the Explained Variance ratio of PCA.
evr = pca.explained_variance_ratio_
np.sum(evr)
#evr= [0.72962445 0.22850762]
#sum = 0.9581320720000165
We can see that the first principal component contains 72.96% of the variance and the second principal component contains 22.85% of the variance. Together, the two components contain 95.81% of the information. We lose 4.19% of the information of the original data which is not so bad.
Conclusion
Both PCA and linear autoencoder use linear transformation for dimensionality reduction. To be effective, there needs to be underlying low dimensional structure in the feature space, i.e the features should have some linear relationship with each other. Nevertheless, for a non-linear relationship, autoencoders are more flexible with non-linear activation functions.