I have three DB models (from Django) that can be used as the input for building a recommendation system:
- Users List - with
userId
,username
,email
etc - Movies List - with
movieId
,movieTitle
,Topics
etc - Saves List - with
userId
,movieId
andtimestamp
(the current recommendation system will be a little bit simpler than the usual approaches found online in the sense that there is no rating score, just the fact that the user has saved a certain movie, and this model contains all the movie saves)
I should still be able to use matrix factorization (MF) for building a recommendation system, even though the rating of a certain item will just be in the form of 1
and 0
(saved or not saved).
In order to use all the MF algorithms found in either scipy
or surprise
, I have to create a pandas
DataFrame and pivot it such that all userIds will be the rows (indexes) and all movieIds will be the columns.
A snippet code for doing this is:
# usersSet and moviesSet contain only ids of users or movies
zeros = numpy.zeros(shape=(len(usersSet), len(moviesSet)), dtype=numpy.int8)
saves_df = pandas.DataFrame(zeros, index=list(usersSet), columns=list(moviesSet))
for save in savesFromDb.iterator(chunk_size=50000):
userId = save['user__id']
movieId = save['movie__id']
saves_df.at[userId, movieId] = 1
Problems so far:
- using
DataFrame.loc
frompandas
to assign values to multiple columns instead ofDataFrame.at
gives MemoryError. This is why I went for the latter method. - using
svds
fromscipy
for MF requires floats or doubles as the values of the DataFrame, and as soon as I doDataFrame.asfptype()
I get a MemoryError
Questions:
- Given that there are ~100k users, ~120k movies and ~450k saves, what's the best approach to model this in order to use recommendation algorithms but still not get MemoryError?
- I also tried using
DataFrame.pivot()
, but is there a way to build it from 3 different DataFrames? i.e.indexes
will be fromlist(usersSet)
,columns
fromlist(moviesList)
andvalues
by iterating oversavesFromDb
and seeing where there is a userId -> movieId relationship and adding1
in the pivot. - Aside from
surprise
'srating_scale
parameter where you can define the rating (in my case would be(0, 1)
), is there any other way in terms of algorithm approach or data model structure to leverage the fact that the rating in my case is only1
or0
(saved or not saved)?