How many principal components to take?
Asked Answered
I

6

50

I know that principal component analysis does a SVD on a matrix and then generates an eigen value matrix. To select the principal components we have to take only the first few eigen values. Now, how do we decide on the number of eigen values that we should take from the eigen value matrix?

Improvisation answered 22/8, 2012 at 6:31 Comment(0)
F
56

To decide how many eigenvalues/eigenvectors to keep, you should consider your reason for doing PCA in the first place. Are you doing it for reducing storage requirements, to reduce dimensionality for a classification algorithm, or for some other reason? If you don't have any strict constraints, I recommend plotting the cumulative sum of eigenvalues (assuming they are in descending order). If you divide each value by the total sum of eigenvalues prior to plotting, then your plot will show the fraction of total variance retained vs. number of eigenvalues. The plot will then provide a good indication of when you hit the point of diminishing returns (i.e., little variance is gained by retaining additional eigenvalues).

Finedraw answered 22/8, 2012 at 13:11 Comment(1)
Great article with more detailed explanation and code for calculation which is described in the answer below mikulskibartosz.name/pca-how-to-choose-the-number-of-componentsArium
S
33

There is no correct answer, it is somewhere between 1 and n.

Think of a principal component as a street in a town you have never visited before. How many streets should you take to get to know the town?

Well, you should obviously visit the main street (the first component), and maybe some of the other big streets too. Do you need to visit every street to know the town well enough? Probably not.

To know the town perfectly, you should visit all of the streets. But what if you could visit, say 10 out of the 50 streets, and have a 95% understanding of the town? Is that good enough?

Basically, you should select enough components to explain enough of the variance that you are comfortable with.

Showman answered 4/9, 2012 at 15:29 Comment(0)
R
11

As others said, it doesn't hurt to plot the explained variance.

If you use PCA as a preprocessing step for a supervised learning task, you should cross validate the whole data processing pipeline and treat the number of PCA dimension as an hyperparameter to select using a grid search on the final supervised score (e.g. F1 score for classification or RMSE for regression).

If cross-validated grid search on the whole dataset is too costly try on a 2 sub samples, e.g. one with 1% of the data and the second with 10% and see if you come up with the same optimal value for the PCA dimensions.

Receptacle answered 22/8, 2012 at 21:8 Comment(0)
S
9

There are a number of heuristics use for that.

E.g. taking the first k eigenvectors that capture at least 85% of the total variance.

However, for high dimensionality, these heuristics usually are not very good.

Sliwa answered 22/8, 2012 at 8:42 Comment(2)
Thanks. Just a small doubt. The eigen vectors would be arranged in decreasing order right? Do you mean first k eigenvalues that capture 85% of the total sum of the eigenvalues?Improvisation
Yes, the eigenvalues correspond to the relative variance. But it is questionable whether high variance = high importance. It makes sense in low dimensions, e.g. physical x,y,z. But when the dimensions have different meanings and scales, it doesn't really make sense anymore.Sliwa
S
7

Depending on your situation, it may be interesting to define the maximal allowed relative error by projecting your data on ndim dimensions.

Matlab example

I will illustrate this with a small matlab example. Just skip the code if you are not interested in it.

I will first generate a random matrix of n samples (rows) and p features containing exactly 100 non zero principal components.

n = 200;
p = 119;
data = zeros(n, p);
for i = 1:100
  data = data + rand(n, 1)*rand(1, p);
end

The image will look similar to:

enter image description here

For this sample image, one can calculate the relative error made by projecting your input data to ndim dimensions as follows:

[coeff,score] = pca(data,'Economy',true);

relativeError = zeros(p, 1);
for ndim=1:p
    reconstructed = repmat(mean(data,1),n,1) + score(:,1:ndim)*coeff(:,1:ndim)';
    residuals = data - reconstructed;
    relativeError(ndim) = max(max(residuals./data));
end

Plotting the relative error in function of the number of dimensions (principal components) results in the following graph:

enter image description here

Based on this graph, you can decide how many principal components you need to take into account. In this theoretical image taking 100 components result in an exact image representation. So, taking more than 100 elements is useless. If you want for example maximum 5% error, you should take about 40 principal components.

Disclaimer: The obtained values are only valid for my artificial data. So, do not use the proposed values blindly in your situation, but perform the same analysis and make a trade off between the error you make and the number of components you need.

Code reference

Scirrhous answered 22/6, 2017 at 14:42 Comment(2)
Forgive my ignorance, but what language is your code in?Gallice
@Gallice Code is written in Matlab. Updated answer to clearly indicate it and provided correct code hightlightingScirrhous
W
6

I highly recommend the following paper by Gavish and Donoho: The Optimal Hard Threshold for Singular Values is 4/sqrt(3).

I posted a longer summary of this on CrossValidated (stats.stackexchange.com). Briefly, they obtain an optimal procedure in the limit of very large matrices. The procedure is very simple, does not require any hand-tuned parameters, and seems to work very well in practice.

They have a nice code supplement here: https://purl.stanford.edu/vg705qn9070

Wolfram answered 7/1, 2016 at 1:32 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.