ALS algorithm in Dask optimization
Asked Answered
B

1

6

I am trying to implement ALS algorithm in Dask, but I am having trouble figuring out how to compute latent feautures in one step. I followed formulas on this stackoverflow thread and come up with this code:

    Items = da.linalg.lstsq(da.add(da.dot(Users, Users.T), lambda_ * da.eye(n_factors)), 
                            da.dot(Users, X))[0].T.compute()
    Items = np.where(Items < 0, 0, Items)

    Users = da.linalg.lstsq(da.add(da.dot(Items.T, Items), lambda_ * da.eye(n_factors)), 
                            da.dot(Items.T, X.T))[0].compute()
    Users = np.where(Users < 0, 0, Users)

But I don't think this works correctly, because MSE is not decreasing.

Example input:

n_factors = 2
lambda_ = 0.1
# We have 6 users and 4 items

Matrix X_train(6x4), R(4x6), Users(2x6) and Items(4x2) looks like:

1  0  0  0  5  2        1 0 0 0    0.8  1.3     1.1  0.2  4.1  1.6
0  0  0  0  4  0        0 0 1 1    3.9  4.3     3.5  2.7  4.3  0.5
0  3  0  0  4  0        0 0 0 0    2.9  1.5
0  3  0  0  0  0        0 0 0 0    0.2  4.7
                        1 1 1 0    0.9  1.1
                        1 0 0 0    4.8  3.0

EDIT: I found the problem, but I don't know how to get around it. Before the iteration starts I set all values in X_train matrix, where there is no rating, to 0.

X_train = da.nan_to_num(X_train)

Reason for that is because dot product works only on numeric values. But because the matrix is very sparse 90% of it now consists of zeros. And insted of fiting real ratings in the matrix it fits this zeros.

Any help would be highly appreciated. <3

Baines answered 22/5, 2021 at 14:56 Comment(3)
You might be able to overcome this issue by trying a RANSAC approach instead of vanilla least squares, however I am not aware about how this modification might impact the overall result of the ALS method.Syreetasyria
Carefully selecting RANSAC's parameters will help you treat these zero entries as outliers, therefore reducing their effect on your least squares fitting steps.Syreetasyria
@Syreetasyria if I understand correctly RANSAC tries to select set of "inliners" to find the optimal fitting result. But the problem with user-item matrix is not selecting which values to use, because I already have them specified.Baines
M
0

One way to handle gaps or missing values in data sets is to use masked arrays. As of May 2017 Dask also supports them.

Defining a masked array in Dask is fairly simple and simmilar to numpy's. All supported functions are listed in docs, here are just some most commonly used approaches:

data_set = da.array([[1, 2], [3, 4]])

masked_data_set_1 = da.ma.masked_array(data_set, mask=[[False, True],[True, False]])
# returns [[1, --],[--, 4]]

masked_data_set_2 = da.ma.masked_equal(data_set, 4)
# returns [[1, 2],[3, --]]

masked_data_set_3 = da.ma.masked_where(data_set < 3, data_set)
# returns [[--, --],[3, 4]]

In your case, you are trying to perform dot product of da.dot(Users, X)). Instead of setting all NaN values to 0, you can use masked array as:

masked_X = da.ma.masked_where(X != X, X)

Now you can easily perform dot product like:

da.ma.getdata(da.dot(Users,masked_X))
Mulholland answered 5/6, 2021 at 9:18 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.