User profiling with Mahout from categorized user behavior

I'm trying to cluster and classify users with Mahout. At the moment I am at the planning phase, my mind is completely mixed with ideas, and since I'm relatively new to the area I'm stuck at the data formatting.

Let's say we have two data table (big enough). In the first table there are users and their actions. Every user has at least one action and they can have too many actions, too. About 10000 different user_actions and millions of records are in the table.

user        - user_action
u1          - a
u2          - b
u3          - a
u1          - c
u2          - c
u2          - c
u1          - b
u4          - f
u4          - e
u1          - e
u1          - d
u5          - d

In the other table, there're action categories. Every action may have none or multiple categories. There are 60 categories.

user_action - category
a           - cat1
b           - cat2
c           - cat1
d           - NULL
e           - cat1, cat3
f           - cat4

I'm going to try to build a user classification model with Mahout but I've no idea what I should do. What type of user vectors should I create? Or do I really need user vectors?

I think I need to create something like;

u1 (a, c, b, e, d)
u2 (b, c, c)
u3 (a)
u4 (f, e)
u5 ()

Problem in here, some users performed more than 100000 actions (some of them are same actions)

So; this is more useful, I think;

u1 (cat1, cat1, cat2, cat1, cat3)
u2 (cat2, cat1, cat1)
u3 (cat1)
u4 (cat4, cat1, cat3)
u5 ()

The things I also worry about are

How should I weight categories for users? For example u1 has at least three action that related with cat1, while u3 has only 1. These one should be different?
How can I decrease the difference between active users and passive ones? Like u1 has too many actions and so categories, u3 has only 1.

Any guidance are welcome.

I would create one row per user as you are doing and I would have one column for each of the categories; this would result in 60 columns if I understand your example correctly. The values of the columns would range from 0 to the maximum number of times the category was seen for the user. The result would be 60 numbers for each user, most of them being 0.

It might be necessary to perform some sort of normalisation on the rows. By analogy with what is done to produce document vectors in text mining, something like term frequency normalisation could be applied to the row. Each column might also require normalising.

From here, clustering could be performed using your algorithm of choice with clustering validity measures to help guide your choice of the most interesting clusterings.

It is the nature of this that you would have to repeat the process iteratively perhaps representing the input data in new ways.

Recommended topics

Hot tags