A.I, Data and Software Engineering

The intuition of Principal Component Analysis


As PCA and linear autoencoder have a close relation, this post introduces again PCA as a powerful dimension reduction tool while skipping many mathematical proofs.

PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables (entities each of which takes on various numerical values) into a set of values of linearly uncorrelated variables called principal components.


PCA intuition

Suppose that our data has D-dimensions. We want to transform the data to K-dimensions (K < D) so that it keeps the most important information.

[DxN] = [DxD][DxN]
      = [Dx(k + D-k)][(k +D-k)xN]
      = [DxK][KxN] + [Dx(D-k)]x[(D-k)xN]
PCA intuition

Consider a new orthogonal coordinate system U = [U_K, \bar{U_k}], U_k is a matrix comprised from the first K column vectors of U.

    \[<span class="ql-right-eqno"> (1) </span><span class="ql-left-eqno">   </span><img src="https://petamind.com/wp-content/ql-cache/quicklatex.com-e6dc4b202d46d0259de6b45530c02a4e_l3.png" height="43" width="231" class="ql-img-displayed-equation quicklatex-auto-format" alt="\begin{eqnarray*} \left[ \begin{matrix} \mathbf{Z} \\ \mathbf{Y} \end{matrix} \right] = \left[ \begin{matrix} \mathbf{U}_K^T \\ \bar{\mathbf{U}}_K^T \end{matrix} \right]\mathbf{X} \Rightarrow \begin{matrix} \mathbf{Z} = \mathbf{U}_K^T \mathbf{X} \\ \mathbf{Y} = \bar{\mathbf{U}}_K^T\mathbf{X} \end{matrix} \end{eqnarray*}" title="Rendered by QuickLaTeX.com"/>\]

The objective of PCA is to find an orthogonal coordinate system so that most information can be mapped into U_kZ (the green) while replacing \bar{U_k}Y (the red) with a matrix (bias) that is independent to orginal data.


    \[\mathbf{X} \approx \tilde{\mathbf{X}} = \mathbf{U}_K \mathbf{Z} + \bar{\mathbf{U}}_K \bar{\mathbf{U}}_K^T\bar{\mathbf{x}}\mathbf{1}^T ~~~ (3)\]

PCA procedure

  1. Find expectation (mean) vectors:

        \[\bar{x} = \frac{1}{N}\sum_{n=1}^Nx_n\]

  2. Subtract the mean

        \[\hat{x}_n = x_n - \bar{x}\]

  3. Calculate the covariance matrix

        \[\mathbf{S} = \frac{1}{N}\hat{\mathbf{X}}\hat{\mathbf{X}}^T\]

  4. Calculate the eigenvectors and eigenvalues of the covariance matrix:
    It is important to notice that these eigenvectors are both unit eigenvectors ie. their lengths are both 1. This is very important for PCA, but luckily, most maths packages, when asked for eigenvectors, will give you unit eigenvectors.
  5. Choosing components and forming a feature vector
    Once eigenvectors are found from the covariance matrix, the next step is to order them by eigenvalue, highest to lowest. This gives you the components in order of significance. Now, if you like, you can decide to ignore the components of lesser significance. You do lose some information, but if the eigenvalues are small, you don’t lose much
  6. Deriving the new data set

        \[\mathbf{Z} = \mathbf{U}_K^T\hat{\mathbf{X}}\]

PCA procedure summary
PCA summary

Appling PCA with Iris dataset

We use PCA from sklearn on the iris dataset. The dataset contains a set of 150 records under five attributes (features) – petal length, petal width, sepal length, sepal width and species.

Dataset OrderSepal lengthSepal widthPetal lengthPetal widthSpecies setosa setosa setosa setosa
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
%matplotlib inline

Load and transform data using StandardScaler

iris = datasets.load_iris()
features = ['sepal length', 'sepal width', 'petal length', 'petal width']
# Separating out the features
x = iris.data[:, :4]
# Separating out the target
y = iris.target
# Standardizing the features
x = StandardScaler().fit_transform(x)

The new x has new scaled values. The first 5 rows:

array([[-0.90068117,  1.01900435, -1.34022653, -1.3154443 ],
       [-1.14301691, -0.13197948, -1.34022653, -1.3154443 ],
       [-1.38535265,  0.32841405, -1.39706395, -1.3154443 ],
       [-1.50652052,  0.09821729, -1.2833891 , -1.3154443 ],
       [-1.02184904,  1.24920112, -1.34022653, -1.3154443 ]])

Now we transform 4-D data into 2-D data using PCA.

pca = PCA(n_components=2)
principalComponents = pca.fit_transform(x)
principalDf = pd.DataFrame(data = principalComponents
             , columns = ['PC 1', 'PC 2'])

Plot the transformed data:

finalDf = pd.concat([principalDf, pd.DataFrame(y, columns=['target'])], axis = 1)
fig = plt.figure(figsize = (8,8))
ax = fig.add_subplot(1,1,1)
ax.set_xlabel('Principal Component 1', fontsize = 15)
ax.set_ylabel('Principal Component 2', fontsize = 15)
ax.set_title('2 component PCA', fontsize = 20)
targets = [0, 1, 2]
colors = ['r', 'g', 'b']
for target, color in zip(targets,colors):
    indicesToKeep = finalDf['target'] == target
    ax.scatter(finalDf.loc[indicesToKeep, 'PC 1']
               , finalDf.loc[indicesToKeep, 'PC 2']
               , c = color
               , s = 50)
ax.legend(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'])
Principle component analysis

How much information preserved

To compare the amount of information in the new coordinate system, we can use the Explained Variance ratio of PCA.

evr = pca.explained_variance_ratio_
#evr= [0.72962445 0.22850762]
#sum = 0.9581320720000165

We can see that the first principal component contains 72.96% of the variance and the second principal component contains 22.85% of the variance. Together, the two components contain 95.81% of the information. We lose 4.19% of the information of the original data which is not so bad.


Both PCA and linear autoencoder use linear transformation for dimensionality reduction. To be effective, there needs to be underlying low dimensional structure in the feature space, i.e the features should have some linear relationship with each other. Nevertheless, for a non-linear relationship, autoencoders are more flexible with non-linear activation functions.

Add comment


A.I, Data and Software Engineering

PetaMinds focuses on developing the coolest topics in data science, A.I, and programming, and make them so digestible for everyone to learn and create amazing applications in a short time.