Table of contents

As PCA and linear autoencoder have a close relation, this post introduces again PCA as a powerful dimension reduction tool while skipping many mathematical proofs.

PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables (entities each of which takes on various numerical values) into a set of values of linearly uncorrelated variables called

Wikipediaprincipal components.

### PCA intuition

Suppose that our data has D-dimensions. We want to transform the data to K-dimensions (\(K < D\)) so that it keeps the most important information.

1 2 3 | [DxN] = [DxD][DxN] = [Dx(k + D-k)][(k +D-k)xN] = [DxK][KxN] + [Dx(D-k)]x[(D-k)xN] |

Consider a new orthogonal coordinate system \(U = [U_K, \bar{U_k}]\), \(U_k\) is a matrix comprised from the first \(K\) column vectors of \(U\).

$$\begin{eqnarray}

\left[

\begin{matrix}

\mathbf{Z} \\ \mathbf{Y}

\end{matrix}

\right] =

\left[

\begin{matrix}

\mathbf{U}_K^T \\ \bar{\mathbf{U}}_K^T

\end{matrix}

\right]\mathbf{X}

\Rightarrow

\begin{matrix}

\mathbf{Z} = \mathbf{U}_K^T \mathbf{X} \\

\mathbf{Y} = \bar{\mathbf{U}}_K^T\mathbf{X}

\end{matrix}

\end{eqnarray}$$

The objective of PCA is to find an orthogonal coordinate system so that most information can be mapped into \(U_kZ\) (the green) while replacing \(\bar{U_k}Y\) (the red) with a matrix (bias) that is independent to orginal data.

Objective

$$\mathbf{X} \approx \tilde{\mathbf{X}} = \mathbf{U}_K \mathbf{Z} + \bar{\mathbf{U}}_K \bar{\mathbf{U}}_K^T\bar{\mathbf{x}}\mathbf{1}^T ~~~ (3)$$

### PCA procedure

**Find expectation (mean) vectors:**

$$\bar{x} = \frac{1}{N}\sum_{n=1}^Nx_n$$**Subtract the mean**

$$\hat{x}_n = x_n – \bar{x}$$**Calculate the covariance matrix**

$$\mathbf{S} = \frac{1}{N}\hat{\mathbf{X}}\hat{\mathbf{X}}^T$$-
**Calculate the eigenvectors and eigenvalues of the covariance matrix:**

It is important to notice that these eigenvectors are both unit eigenvectors ie. their lengths are both 1. This is very important for PCA, but luckily, most maths packages, when asked for eigenvectors, will give you unit eigenvectors. **Choosing components and forming a feature vector**

Once eigenvectors are found from the covariance matrix, the next step is to order them by eigenvalue, highest to lowest. This gives you the components in order of significance. Now, if you like, you can decide to ignore the components of lesser significance. You do lose some information, but if the eigenvalues are small, you don’t lose much-
**Deriving the new data se**t

$$\mathbf{Z} = \mathbf{U}_K^T\hat{\mathbf{X}} $$

### Appling PCA with Iris dataset

We use PCA from `sklearn`

on the iris dataset. The dataset contains a set of 150 records under five attributes (features) – petal length, petal width, sepal length, sepal width and species.

Dataset Order | Sepal length | Sepal width | Petal length | Petal width | Species |
---|---|---|---|---|---|

1 | 5.1 | 3.5 | 1.4 | 0.2 | I. setosa |

2 | 4.9 | 3.0 | 1.4 | 0.2 | I. setosa |

3 | 4.7 | 3.2 | 1.3 | 0.2 | I. setosa |

4 | 4.6 | 3.1 | 1.5 | 0.2 | I. setosa |

1 2 3 4 5 6 7 8 | import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn import datasets from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler %matplotlib inline |

Load and transform data using StandardScaler

1 2 3 4 5 6 7 8 9 | iris = datasets.load_iris() features = ['sepal length', 'sepal width', 'petal length', 'petal width'] # Separating out the features x = iris.data[:, :4] # Separating out the target y = iris.target # Standardizing the features x = StandardScaler().fit_transform(x) |

The new `x`

has new scaled values. The first 5 rows:

1 2 3 4 5 | array([[-0.90068117, 1.01900435, -1.34022653, -1.3154443 ], [-1.14301691, -0.13197948, -1.34022653, -1.3154443 ], [-1.38535265, 0.32841405, -1.39706395, -1.3154443 ], [-1.50652052, 0.09821729, -1.2833891 , -1.3154443 ], [-1.02184904, 1.24920112, -1.34022653, -1.3154443 ]]) |

Now we transform 4-D data into 2-D data using PCA.

1 2 3 4 | pca = PCA(n_components=2) principalComponents = pca.fit_transform(x) principalDf = pd.DataFrame(data = principalComponents , columns = ['PC 1', 'PC 2']) |

Plot the transformed data:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | finalDf = pd.concat([principalDf, pd.DataFrame(y, columns=['target'])], axis = 1) fig = plt.figure(figsize = (8,8)) ax = fig.add_subplot(1,1,1) ax.set_xlabel('Principal Component 1', fontsize = 15) ax.set_ylabel('Principal Component 2', fontsize = 15) ax.set_title('2 component PCA', fontsize = 20) targets = [0, 1, 2] colors = ['r', 'g', 'b'] for target, color in zip(targets,colors): indicesToKeep = finalDf['target'] == target ax.scatter(finalDf.loc[indicesToKeep, 'PC 1'] , finalDf.loc[indicesToKeep, 'PC 2'] , c = color , s = 50) ax.legend(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']) ax.grid() |

### How much information preserved

To compare the amount of information in the new coordinate system, we can use the Explained Variance ratio of PCA.

1 2 3 4 | evr = pca.explained_variance_ratio_ np.sum(evr) #evr= [0.72962445 0.22850762] #sum = 0.9581320720000165 |

We can see that the first principal component contains 72.96% of the variance and the second principal component contains 22.85% of the variance. Together, the two components contain 95.81% of the information. We lose 4.19% of the information of the original data which is not so bad.

### Conclusion

Both PCA and linear autoencoder use linear transformation for dimensionality reduction. To be effective, there needs to be underlying low dimensional structure in the feature space, i.e the features should have some *linear relationship* with each other. Nevertheless, for a non-linear relationship, autoencoders are more flexible with non-linear activation functions.