How to compare predictive power of PCA and NMF
Asked Answered
T

1

16

I would like to compare the output of an algorithm with different preprocessed data: NMF and PCA. In order to get somehow a comparable result, instead of choosing just the same number of components for each PCA and NMF, I would like to pick the amount that explains e.g 95% of retained variance.

I was wondering if its possible to identify the variance retained in each component of NMF.

For instance using PCA this would be given by: retainedVariance(i) = eigenvalue(i) / sum(eigenvalue)

Any ideas?

Tenorrhaphy answered 8/1, 2018 at 10:46 Comment(3)
I have the same question. Is there any relationship between PCA and NMF?Bronchus
just looking at one metric may not be enough. what if the relationships discovered in output of dimensionality reduction algorithms do not make any sense? even if it covers 95%. this can happen if you have outliers, noise and don't use robust methods.Pachisi
my main question is, how do i select the number of components in order to maintain a big amount of variance, when using NMF algorithms? In the NMF, there are no eigenvalues or singular values but only X = W*HBronchus
G
32

TL;DR

You should loop over different n_components and estimate explained_variance_score of the decoded X at each iteration. This will show you how many components do you need to explain 95% of variance.

Now I will explain why.

Relationship between PCA and NMF

NMF and PCA, as many other unsupervised learning algorithms, are aimed to do two things:

  • encode input X into a compressed representation H;
  • decode H back to X', which should be as close to X as possible.

They do it in a somehow similar way:

  • Decoding is similar in PCA and NMF: they output X' = dot(H, W), where W is a learned matrix parameter.
  • Encoding is different. In PCA, it is also linear: H = dot(X, V), where V is also a learned parameter. In NMF, H = argmin(loss(X, H, W)) (with respect to H only), where loss is mean squared error between X and dot(H, W), plus some additional penalties. Minimization is performed by coordinate descent, and result may be nonlinear in X.
  • Training is also different. PCA learns sequentially: the first component minimizes MSE without constraints, each next kth component minimizes residual MSE subject to being orthogonal with the previous components. NMF minimizes the same loss(X, H, W) as when encoding, but now with respect to both H and W.

How to measure performance of dimensionality reduction

If you want to measure performance of an encoding/decoding algorithm, you can follow the usual steps:

  1. Train your encoder+decoder on X_train
  2. To measure in-sample performance, compare X_train'=decode(encode(X_train)) with X_train using your preferred metric (e.g. MAE, RMSE, or explained variance)
  3. To measure out-of-sample performance (generalizing ability) of your algorithm, do step 2 with the unseen X_test.

Let's try it with PCA and NMF!

from sklearn import decomposition, datasets, model_selection, preprocessing, metrics
# use the well-known Iris dataset
X, _ = datasets.load_iris(return_X_y=True)
# split the dataset, to measure overfitting
X_train, X_test = model_selection.train_test_split(X, test_size=0.5, random_state=1)
# I scale the data in order to give equal importance to all its dimensions
# NMF does not allow negative input, so I don't center the data
scaler = preprocessing.StandardScaler(with_mean=False).fit(X_train)
X_train_sc = scaler.transform(X_train)
X_test_sc = scaler.transform(X_test)
# train the both decomposers
pca = decomposition.PCA(n_components=2).fit(X_train_sc)
nmf = decomposition.NMF(n_components=2).fit(X_train_sc)
print(sum(pca.explained_variance_ratio_))

It will print you explained variance ratio of 0.9536930834362043 - the default metric of PCA, estimated using its eigenvalues. We can measure it in a more direct way - by applying a metric to actual and "predicted" values:

def get_score(model, data, scorer=metrics.explained_variance_score):
    """ Estimate performance of the model on the data """
    prediction = model.inverse_transform(model.transform(data))
    return scorer(data, prediction)

print('train set performance')
print(get_score(pca, X_train_sc))
print(get_score(nmf, X_train_sc))

print('test set performance')
print(get_score(pca, X_test_sc))
print(get_score(nmf, X_test_sc))

which gives

train set performance
0.9536930834362043 # same as before!
0.937291711378812 
test set performance
0.9597828443047842
0.9590555069007827

You can see that on the training set PCA performs better than NMF, but on the test set their performance is almost identical. This happens, because NMF applies lots of regularization:

  • H and W (the learned parameter) must be non-negative
  • H should be as small as possible (L1 and L2 penalties)
  • W should be as small as possible (L1 and L2 penalties)

These regularizations make NMF fit worse than possible to the training data, but they might improve its generalizing ability, which happened in our case.

How to choose the number of components

In PCA, it is simple, because its components h_1, h_2, ... h_k are learned sequentially. If you add the new component h_(k+1), the first k will not change. Thus, you can estimate performance of each component, and these estimates will not depent on the number of components. This makes it possible for PCA to output the explained_variance_ratio_ array after only a single fit to data.

NMF is more complex, because all its components are trained at the same time, and each one depends on all the rest. Thus, if you add the k+1th component, the first k components will change, and you cannot match each particular component with its explained variance (or any other metric).

But what you can to is to fit a new instance of NMF for each number of components, and compare the total explained variance:

ks = [1,2,3,4]
perfs_train = []
perfs_test = []
for k in ks:
    nmf = decomposition.NMF(n_components=k).fit(X_train_sc)
    perfs_train.append(get_score(nmf, X_train_sc))
    perfs_test.append(get_score(nmf, X_test_sc))
print(perfs_train)
print(perfs_test)

which would give

[0.3236945680665101, 0.937291711378812, 0.995459457205891, 0.9974027602663655]
[0.26186701106012833, 0.9590555069007827, 0.9941424954209546, 0.9968456603914185]

Thus, three components (judging by the train set performance) or two components (by the test set) are required to explain at least 95% of variance. Please notice that this case is unusual and caused by a small size of training and test data: usually performance degrades a little bit on the test set, but in my case it actually improved a little.

Guarino answered 11/3, 2018 at 7:48 Comment(2)
thank you. Is there any direct connection between PCA or SVD and NMF? For example,I know that PCA can be performed via SVD if the initial matrix is centered. But what about PCA and NMF?Bronchus
No, they are not connected directly. NMF has non-negativity constraints (and in the sklearn implementation, L1 penalties as well) that make analytical solutions, such as SVD, generally impossible. In some cases, hyperplane of NMF may coincide with that of PCA (if PCA solution is non-negative), but this is pure chance. And the components itself may differ even in case of identical hyperplanes (that is, identical encoding and decoding), because of different rotations and stretches of the coordinate system.Guarino

© 2022 - 2024 — McMap. All rights reserved.