Reconstructing new data using sklearn NMF components Vs inverse_transform does not match
Asked Answered
C

1

6

I fit a model using scikit-learn NMF model on my training data. Now I perform an inverse transform of new data using

result_1 = model.inverse_transform(model.transform(new_data))

Then I compute the inverse transform of my data manually taking the components from the NMF model, using the equation as in Slide 15 here.

temp = np.dot(model.components_, model.components_.T)
transform = np.dot(np.dot(model.components_.T, np.linalg.pinv(temp)), 
model.components_)
result_2 = np.dot(new_data, transform)

I would like to understand why the 2 results do not match. What am I doing wrong while computing the inverse transform and reconstructing the data?

Sample code:

import numpy as np
from sklearn.decomposition import NMF

data = np.array([[0,0,1,1,1],[0,1,1,0,0],[0,1,0,0,0],[1,0,0,1,0]])
print(data)
//array([[0, 0, 1, 1, 1],
       [0, 1, 1, 0, 0],
       [0, 1, 0, 0, 0],
       [1, 0, 0, 1, 0]])


model = NMF(alpha=0.0, init='random', l1_ratio=0.0, max_iter=200, n_components=2, random_state=0, shuffle=False, solver='cd', tol=0.0001, verbose=0)
model.fit(data)
NMF(alpha=0.0, beta_loss='frobenius', init='random', l1_ratio=0.0,
  max_iter=200, n_components=2, random_state=0, shuffle=False, solver='cd',
  tol=0.0001, verbose=0)

new_data = np.array([[0,0,1,0,0], [1,0,0,0,0]])
print(new_data)
//array([[0, 0, 1, 0, 0],
       [1, 0, 0, 0, 0]])

result_1 = model.inverse_transform(model.transform(new_data))
print(result_1)
//array([[ 0.09232497,  0.38903892,  0.36668712,  0.23067627,  0.1383513 ],
       [ 0.0877082 ,  0.        ,  0.12131779,  0.21914115,  0.13143295]])

temp = np.dot(model.components_, model.components_.T)
transform = np.dot(np.dot(model.components_.T, np.linalg.pinv(temp)), model.components_)
result_2 = np.dot(new_data, transform)
print(result_2)
//array([[ 0.09232484,  0.389039  ,  0.36668699,  0.23067595,  0.13835111],
       [ 0.09193481, -0.05671439,  0.09232484,  0.22970145,  0.13776664]])

Note: Although this is not the best data describing my issue, the code is essentially the same. Also result_1 and result_2 are much more different from each other in the actual case. data and new_data are also large arrays.

Coleencolella answered 17/3, 2018 at 18:39 Comment(8)
You can check the implementation to find the differences.Ppm
scikit-learn implementation is calculating the dot between transformed data and the components:- return np.dot(W, self.components_)Ppm
Of course I did; and I still don't get it. During the transform method, components_ remain the same (from the fit method) and only the new_data is projected onto latent space. This should be equivalent to what I am doing in the first 2 lines in the above code. Finally, there is the product with components_ in the inverse_transform, which I am also doing. Hence my doubt of why the results do not seem similar.Coleencolella
@VivekKumar - Yes, it is calculating the dot product, which I am also doing..Coleencolella
Can you include a minimal working example? Maybe even with some random data?Deck
@Charlie - Edited the post for a minimal example.Coleencolella
Thank you! Also, can you please explain what you expect between result_1 and result_2? Do you expect them to be exactly equal? Or equal within machine accuracy? or equal within some specified error?Deck
@Charlie - I was expecting them to be equal within some specified error.Coleencolella
P
4

What happens

In scikit-learn, NMF does more than simple matrix multiplication: it optimizes!

Decoding (inverse_transform) is linear: the model calculates X_decoded = dot(W, H), where W is the encoded matrix, and H=model.components_ is a learned matrix of model parameters.

Encoding (transform), however, is nonlinear : it performs W = argmin(loss(X_original, H, W)) (with respect to W only), where loss is mean squared error between X_original and dot(W, H), plus some additional penalties (L1 and L2 norms of W), and with the constraint that W must be non-negative. Minimization is performed by coordinate descent, and result may be nonlinear in X_original. Thus, you cannot simply get W by multiplying matrices.

Why it is so weird

NMF has to perform such strange calculations because, otherwise, the model may produce negative results. Indeed, in your own example, you could try to perform transform by matrix multiplication

 print(np.dot(new_data, np.dot(model.components_.T, np.linalg.pinv(temp))))

and get the result W that contains negative numbers:

[[ 0.17328927  0.39649966]
 [ 0.1725572  -0.05780202]]

However, the coordinate descent within NMF avoids this problem by slightly modifying the matrix:

 print(model.transform(new_data))

gives a non-negative result

[[0.17328951 0.39649958]
 [0.16462405 0.        ]]

You can see that it does not simply clip W matrix from below, but modifies the positive elements as well, in order to improve the fit (and obey the regularization penalties).

Peasant answered 20/3, 2018 at 11:46 Comment(9)
It seems that you have W and H reversed in your answer Vs the notation used in scikit-learn. Also during the transform/encoding, the error minimized is between X_new(not X_original) and product of W and H. Isn't that right?Coleencolella
Yes, they were reversed, fixed that. No, during the encoding the loss is minimized between X_original and X_new, where X_new exactly equals dot(W, H).Peasant
X_original and X_new both need not necessarily match along the first dimension (i.e., number of samples). So how exactly can the loss between them be minimized?Coleencolella
In NMF, isn't the non-negativity constraint only on W and H (i.e., the latent factors) OR are the transformed and inverse-transformed/reconstructed matrices also required to be non-negative?Coleencolella
Sorry, I misunderstood your notation. By X_original I mean the matrix that is plugged into the transform() method, and it has nothing to do with the matrix that was used to train the model (to calculate W).Peasant
H IS THE transformed matrix! So the non-negativity constraint applies to it as well.Peasant
I suppose that W is the transformed matrix and H is the latent representation for the features.Coleencolella
O, yes. I have started from the opposite meaning of H and W. Nevertheless, one of them is the latent representation and the other is transformed matrix. Both of them must be non-negative. The latent representation is calculated during the training stage. The transformed matrix is recalculated non-linearly for every input matrix.Peasant
Let us continue this discussion in chat.Coleencolella

© 2022 - 2024 — McMap. All rights reserved.