How to deal with missing values in python scikit NMF

Asked 7/9, 2016 at 10:37 Answered 9/10, 2023 at 0:8

python scikit-learn recommendation-engine matrix-factorization nmf

I am trying to apply NMF on my dataset, using python scikit-learn. My dataset contains 0 values and missing values. But scikit-learn does not allow NaN value in data matrix. Some posts said that replace missing values with zeros.

my questions are:

If I replace missing value with zeros, how can the algorithm tell the missing values and real zero values?
Is there any other NMF implementations can deal with missing values?
Or if there are any other matrix factorization algorithms can do the missing value prediction?

Acronym answered 7/9, 2016 at 10:37 Comment(4)

The replacement of missing-values with zero (or column-mean, or row-mean or ...) is not known by the classifier. It will treat these numbers as any other which might be okay (we are always assuming a low-rank model exists with these methods). / In general i would say, that missing-value prediction is a harder problem (which needs stronger assumptions) compared to finding a low-rank factorization of a matrix without missing-values. As an alternative: write a SGD-based optimizer for some common nmf-problem (and you can sample from the known values only) – Pisci 7/9, 2016 at 10:52

Thanks, it seems ignoring missing values when applying SGD is the solution. – Acronym 19/9, 2016 at 6:2

Facing the same problem. Have you written your own SGD implementation? If yes, how is it performing? So far I have not been able to achieve anything that performs similar to NMF. – Dyne 2/2, 2017 at 16:39

@Dyne Yes I have tried my own SGD implementation. It has similar performance compared with sklearn implementation, but much slower. – Acronym 31/3, 2017 at 3:38

There is a thread about this in scikit-learn github and a version seams to be available but not yet commited to the main code.

https://github.com/scikit-learn/scikit-learn/pull/8474

Mithras answered 25/10, 2017 at 20:11 Comment(1)

Please add more info. Links expire. – Cermet 25/10, 2017 at 20:18

SGD will do the job here, but scikit-learn does not have one that could be applied for the task. Writing your own one will do the job, but will be really slow since one cannot directly parallelise matrix factorization SGD. Check Distributed SGD algorithm described here. It is not so hard to implement and it speeds up things significantly.

Dyne answered 31/3, 2017 at 6:52 Comment(1)

The link seems broken. Is this the same one as your original? citeseerx.ist.psu.edu/viewdoc/… – Thitherto 3/3, 2019 at 11:4

For some usecases, replacing the missing ratings by some sort of average values (e.g., of nearest neighbors) prior to using sklearn.decomposition.nmf() (which does not accept NaN values) may be useful.

For example, consider the item-item collaborative filtering as low-rank matrix factorization problem, to be solved using non-negative matrix factorization.

The next code snippet

first randomly generates a toy user-item ratings dataset with missing entries.
next it imputes the missing ratings for any item using the average rating for the item.
after that the matrix is reconstructed using NMF.

It achieves low RMSE for reconstruction.

nusers = 10
nitems = 5

# generate nusers x nitems matrix for ratings
np.random.seed(1)
ratings = np.random.choice(range(1,6), nusers*nitems).reshape(nusers, nitems).astype(np.float64)
missing = np.random.choice(range(0,2), nusers*nitems, p=[0.4,0.6]).reshape(nusers, nitems).astype(bool)
ratings[missing] = np.nan

ratings  # user-item rating matrix with missing entries
#array([[nan, nan, nan,  2., nan],
#       [ 1., nan, nan,  5.,  5.],
#       [ 2.,  3., nan,  3.,  5.],
#       [nan,  5., nan,  5., nan],
#       [nan,  2., nan, nan, nan],
#       [ 2., nan, nan, nan, nan],
#       [nan, nan,  1.,  4., nan],
#       [ 2.,  1., nan,  2., nan],
#       [nan, nan, nan, nan,  4.],
#       [ 5., nan, nan, nan, nan]])

Now fill the missing values with mean item ratings

# fill NaN values with mean item ratings
mean_item_rating = np.nanmean(ratings, axis=0)
inds = np.where(np.isnan(ratings))
ratings[inds] = np.take(mean_item_rating, inds[1])

Fit NMF model on the imputed ratings and reconstruct.

k = 2 # number of latent features
nmf = decomposition.NMF(
    n_components=k, 
    random_state=0, 
    init = "nndsvda", 
    beta_loss="frobenius",
    alpha_W=0.001,
    alpha_H=0.001,
    )

W1 = nmf.fit_transform(ratings)
H1 = nmf.components_
# nmf.reconstruction_err_
pred = W1 @ H1

Here is the reconstructed matrix:

pred # reconstructed ratings matrix with nmf
#array([[2.7538703 , 2.21530206, 0.91599971, 2.74505676, 4.24230709],
#       [1.25524759, 3.59251295, 0.97033684, 4.66032356, 4.61479129],
#       [2.35280766, 2.78126364, 0.97279116, 3.51816793, 4.54689907],
#       [1.84354306, 3.98096118, 1.14786757, 5.13481057, 5.4330296 ],
#       [2.53186319, 2.57236077, 0.95680549, 3.23078732, 4.45635212],
#       [2.15528539, 2.83261418, 0.95209144, 3.59878839, 4.46086756],
#       [2.33351068, 2.97179146, 1.01047576, 3.77090889, 4.73106912],
#       [2.68816785, 1.77452768, 0.81111479, 2.16752937, 3.73840089],
#       [2.19156078, 2.64733105, 0.91825346, 3.3518664 , 4.29411891],
#       [4.28014936, 2.45802613, 1.21283129, 2.96621938, 5.57095048]])

Finally compare with the initial ratings matrix with missing values.

pred[missing] = np.nan # mask the reconstructed missing ratings, for the ease of comparison 
np.sqrt(np.sum((ratings[~missing] - pred[~missing])**2) / (np.sum(~missing))) # rmse
# 0.49721783850318846

pred
#array([[       nan,        nan,        nan, 2.74505676,        nan],
#       [1.25524759,        nan,        nan, 4.66032356, 4.61479129],
#       [2.35280766, 2.78126364,        nan, 3.51816793, 4.54689907],
#       [       nan, 3.98096118,        nan, 5.13481057,        nan],
#       [       nan, 2.57236077,        nan,        nan,        nan],
#       [2.15528539,        nan,        nan,        nan,        nan],
#       [       nan,        nan, 1.01047576, 3.77090889,        nan],
#       [2.68816785, 1.77452768,        nan, 2.16752937,        nan],
#       [       nan,        nan,        nan,        nan, 4.29411891],
#       [4.28014936,        nan,        nan,        nan,        nan]])

Now compare the matrix ratings and the reconstructed matrix pred in terms of the ratings already present, they are pretty close (with low RMSE).

Bloodandthunder answered 9/10, 2023 at 0:8 Comment(0)

Recommended topics

Hot tags