I'm trying to cluster and classify users with Mahout. At the moment I am at the planning phase, my mind is completely mixed with ideas, and since I'm relatively new to the area I'm stuck at the data formatting.
Let's say we have two data table (big enough). In the first table there are users and their actions. Every user has at least one action and they can have too many actions, too. About 10000 different user_actions and millions of records are in the table.
user - user_action
u1 - a
u2 - b
u3 - a
u1 - c
u2 - c
u2 - c
u1 - b
u4 - f
u4 - e
u1 - e
u1 - d
u5 - d
In the other table, there're action categories. Every action may have none or multiple categories. There are 60 categories.
user_action - category
a - cat1
b - cat2
c - cat1
d - NULL
e - cat1, cat3
f - cat4
I'm going to try to build a user classification model with Mahout but I've no idea what I should do. What type of user vectors should I create? Or do I really need user vectors?
I think I need to create something like;
u1 (a, c, b, e, d)
u2 (b, c, c)
u3 (a)
u4 (f, e)
u5 ()
Problem in here, some users performed more than 100000 actions (some of them are same actions)
So; this is more useful, I think;
u1 (cat1, cat1, cat2, cat1, cat3)
u2 (cat2, cat1, cat1)
u3 (cat1)
u4 (cat4, cat1, cat3)
u5 ()
The things I also worry about are
- How should I weight categories for users? For example u1 has at least three action that related with cat1, while u3 has only 1. These one should be different?
- How can I decrease the difference between active users and passive ones? Like u1 has too many actions and so categories, u3 has only 1.
Any guidance are welcome.