For some usecases, replacing the missing ratings by some sort of average values (e.g., of nearest neighbors) prior to using sklearn.decomposition.nmf()
(which does not accept NaN
values) may be useful.
For example, consider the item-item collaborative filtering as low-rank matrix factorization problem, to be solved using non-negative matrix factorization
The next code snippet
- first randomly generates a toy user-item ratings dataset with missing entries.
- next it imputes the missing ratings for any item using the average rating for the item.
- after that the matrix is reconstructed using
It achieves low RMSE for reconstruction.
nusers = 10
nitems = 5
# generate nusers x nitems matrix for ratings
ratings = np.random.choice(range(1,6), nusers*nitems).reshape(nusers, nitems).astype(np.float64)
missing = np.random.choice(range(0,2), nusers*nitems, p=[0.4,0.6]).reshape(nusers, nitems).astype(bool)
ratings[missing] = np.nan
ratings # user-item rating matrix with missing entries
#array([[nan, nan, nan, 2., nan],
# [ 1., nan, nan, 5., 5.],
# [ 2., 3., nan, 3., 5.],
# [nan, 5., nan, 5., nan],
# [nan, 2., nan, nan, nan],
# [ 2., nan, nan, nan, nan],
# [nan, nan, 1., 4., nan],
# [ 2., 1., nan, 2., nan],
# [nan, nan, nan, nan, 4.],
# [ 5., nan, nan, nan, nan]])
Now fill the missing values with mean item ratings
# fill NaN values with mean item ratings
mean_item_rating = np.nanmean(ratings, axis=0)
inds = np.where(np.isnan(ratings))
ratings[inds] = np.take(mean_item_rating, inds[1])
model on the imputed ratings and reconstruct.
k = 2 # number of latent features
nmf = decomposition.NMF(
init = "nndsvda",
W1 = nmf.fit_transform(ratings)
H1 = nmf.components_
# nmf.reconstruction_err_
pred = W1 @ H1
Here is the reconstructed matrix:
pred # reconstructed ratings matrix with nmf
#array([[2.7538703 , 2.21530206, 0.91599971, 2.74505676, 4.24230709],
# [1.25524759, 3.59251295, 0.97033684, 4.66032356, 4.61479129],
# [2.35280766, 2.78126364, 0.97279116, 3.51816793, 4.54689907],
# [1.84354306, 3.98096118, 1.14786757, 5.13481057, 5.4330296 ],
# [2.53186319, 2.57236077, 0.95680549, 3.23078732, 4.45635212],
# [2.15528539, 2.83261418, 0.95209144, 3.59878839, 4.46086756],
# [2.33351068, 2.97179146, 1.01047576, 3.77090889, 4.73106912],
# [2.68816785, 1.77452768, 0.81111479, 2.16752937, 3.73840089],
# [2.19156078, 2.64733105, 0.91825346, 3.3518664 , 4.29411891],
# [4.28014936, 2.45802613, 1.21283129, 2.96621938, 5.57095048]])
Finally compare with the initial ratings matrix with missing values.
pred[missing] = np.nan # mask the reconstructed missing ratings, for the ease of comparison
np.sqrt(np.sum((ratings[~missing] - pred[~missing])**2) / (np.sum(~missing))) # rmse
# 0.49721783850318846
#array([[ nan, nan, nan, 2.74505676, nan],
# [1.25524759, nan, nan, 4.66032356, 4.61479129],
# [2.35280766, 2.78126364, nan, 3.51816793, 4.54689907],
# [ nan, 3.98096118, nan, 5.13481057, nan],
# [ nan, 2.57236077, nan, nan, nan],
# [2.15528539, nan, nan, nan, nan],
# [ nan, nan, 1.01047576, 3.77090889, nan],
# [2.68816785, 1.77452768, nan, 2.16752937, nan],
# [ nan, nan, nan, nan, 4.29411891],
# [4.28014936, nan, nan, nan, nan]])
Now compare the matrix ratings
and the reconstructed matrix pred
in terms of the ratings already present, they are pretty close (with low RMSE