Computing user similarity using mahout mapreduce
Asked Answered
N

1

6

I am using Mahout clustering and I have large clusters each having around 100k users and each user having 5 features. In the next step i need compute pearson correlation to find similarity between the users of the cluster.

Currently i have a python script which does the same for me , but as expected it takes way to long for the computation and is no longer a feasible option

I looked at Mahout as it provides functionality to find UserSimilarity using Pearson, Tanimoto, loglikelyhood measures, What i am not able to find is the way to develop Mapreduce version of these similarity measures.

Is there any resource which can take an example and explain me on how to develop a mapreduce version of the UserSimilarity or would it make sense to use hadoop streaming and the same java class.

EDIT

Even though i have 100k users in a cluster, i rarely need to compute 100k* 100k matrix. Most of the time it will be a 10k*100k matrix. However i have around 500 such clusters so the time consumed for computing 500 clusters of 10k * 100k is pretty long and thats the reason i was looking for better approaches and trigger discussions

Noway answered 7/9, 2012 at 13:49 Comment(4)
Please share some more information on what you plan to do. Do you really need all pairwise similarities (which an O(n^2) data set!)Lias
@Anony-Mousse : Edited the question and provided more detailsNoway
Well, what should be in the 10k*100k matrix that you are interested in? Computing only that part should be 10x faster...Lias
In fact, the pairwise comparison with large number of users can be solved by MinHash algorithm, the time complexity can be reduced to O(n).Utile
L
8

Limitations of MapReduce

Pairwise user similarity is not representable in MapReduce by definition.

If you recall what map does, is that it reads one key-value-pair. It is not reading two key-value pairs.

By definition of map, you cannot compute pairwise comparisons.

This should be obvious if you realize that map and reduce are linear processes. And pairwise similarity is a quadratic task. It cannot work.

That is, unless you abuse mapreduce. You can explode your data space to quadratic size. If you first generate all n*n pairs, then you can process these pairs with MapReduce again. But that is a bad abuse of it (although some people seem to do exactly that).

Alternatives to (abusing) MapReduce

Well first of all, sometimes you can rewrite the problem to be actually linear, for example by inverting your data. Or by doing some clever pruning so that while in reality it is quadratic, in real data it usually remains linear (because you can drop a lot of the theoretically quadratic data during or before generation).

Secondly, note that Mahout builds on the Hadoop platform, but doesn't solve everything with just MapReduce. A lot of the Hadoop things do not do just "map+reduce". For example TeraSort - as sorting is not a linear problem (it also involves comparing elements!), it cannot be solved by MapReduce. But you can use Hadoop for TeraSort, by writing a distributed sort that uses a generalized median-of-median approach to estimate quantiles, distribute the data according to these quantiles, then sort the individual buckets (in O(n log n)) on the individual nodes. The complexity remains superlinear, but the run times are much lower than on a single node.

I'm pretty sure that Hadoop gives you a couple of options beyond MapReduce. You may want to look more closely at what Mahout does to solve such non-linear problems. It doesn't try or pretend to MapReduce everything. There is Apache Hama, also going beyond MapReduce using a generalizaion known as "Bulk Synchronous Processing". Hadoop (and Yarn) in general do allow such things. But the API obviously is much harder than the classic Mapper and Reducer APIs.

The naive abuse of MapReduce

Or you go the abuse way (this is probably not what Mahout does). It is actually fairly simple to abuse map reduce. Because there is no constraint on the output size. So if you do not care about memory, you can do the following:

Assuming that you know all keys, e.g. because they are numbered 1 to n. Then you can use

map(K, V) -> [ (1, (K, V)), (2, (K, V)),  (3, (K, V)), ..., (n, (K, V)) ]

(which obviously creates a data set of quadratic size!) and in the reducer, every key now has a full copy of the data set and one ID to care.

So the reducers then for each object read the complete data set again, compute n similarities, and output them.

Boom, you map-reduced the problem. It's stupid and inefficient, but it works and it is being done.

A smarter abuse with auxillary data

A smarter version of this will directly load the data set as so called "auxilliary data" into the system. So it does:

map (K,V) -> [ (K_1, sim(V, aux_obj1)), (K_2, sim(V, aux_obj2)),
             (K_3, sim(V, aux_obj3)), ... (K_n, sim(V, aux_objn)) ]

This is not as bad an abuse of MapReduce (it is just the canonical way of parallel computation of your quadratic result matrix), since it is at least honest about what it does: using massive auxillary data. It will also work - as long as the auxillary data fits into your worker memory. And it is not longer a proper mapreduce, because it uses auxiallary data that can't really be seen as parameterization of your function.

Given that you apparently can load your data into memory (even in python), this last option is probably the easiest way for you to go. You could even break the auxillary memory into chunks and compute these slices of the database as separate jobs.

Still, it isn't MapReduce. It is a quadratic problem, which by definition cannot be MapReduce. These problems can be solved on Hadoop (see Mahout), but you need more than MapReduce for that.

A final remark to your actual task

First of all, please share more on what you really plan on doing. The key idea for becoming faster is to do less. If you can save computations, you are always faster.

100k users and 5 attributes (double valued?) is not very much. So maybe your python implementation is too inefficient. A compiled and optimized language can probably get you 1-2 orders of magnitude faster. I've done pairwise similarites in 110k objects with 8 attributes in 10 minutes; so your problem should be solvable in about this time, too. 100k users and 5 attributes is not really "hadoop size big data" yet. You may end up paying more for the Hadoop overhead than you get out as opposed to a fast low-level implementations.

Optimizing Pearson correlation: there is a couple of things you can do here. Note how you end up recomputing things like the standard deviations? If you preprocess your data and standardize every record to have mean 0 and standard deviation 1, the Pearson correlation simplifies to the Covariance (because stddev is 1). And because the mean is 0, Covariance becomes just E(x*y) = 1/5 \sum x_i * y_i. This function can probably be accelerated by using a spatial index, if you are interested e.g. only in the top-10 similar objects each. I figure you can easily add this into ELKI and use the spatial index structures there. This usually shaves off another order of magnitude in running time, which means you should be on the order of 1 minute processing time, on a single CPU.

Lias answered 7/9, 2012 at 15:5 Comment(5)
You've suggested one way to compute user-user similarity, and argued that it is not feasible in M/R. But the approach you suggest is not even how Mahout does it. Computing user-user similarity involves an operation over each item and its list of user associations. At the individual item level you output and aggregate information regarding user-user interactions, which can be huge. Smart pruning fixes this in practice. Naive adaptations to M/R may not work but Mahout is an existence proof of (somewhat) smarter possibilities, of the form the OP is looking for.Sundries
He wrote users have 5 features and he uses Pearson correlation, so I assumed he computes user similarity by computing pairwise pearson correlation on these vectors. It definitely does not sounds as if he is using graph data (= he does not have associations?). I never said that Mahout does it the way I sketched it. I said this is how MapReduce is abused to do non-linear problems. I explicitely suggest to look into Mahout instead of trying to do it with MapReduce manually.Lias
User-item association means a rating or pref -- no graphs. You don't have to literally collect two users' data to compute all-pairs similarity, and Mahout doesn't. You actually need to map by item, not user. Your post suggests this is not representable in M/R, but, Mahout is an existence proof. You suggest Mahout may not be using M/R for this, but it is. You are suggesting Mahout explodes a full Cartesian join of the data, but it doesn't. Giving reasons that one approach doesn't work welL (yes, that doesn't) doesn't mean there isn't a feasible approach and that is what the OP is getting at?Sundries
Let me just reiterate Pearson correlation on 5 features again... This is not a Biparite Graph like user-item associations. You can't just split it on these 5 features and process them independently (apart from that only giving you 5 partitions).Lias
You can certainly look at it as an n x 5 "user-item" matrix and run an all-pairs user similarity job. Look at org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.PearsonCorrelationSimilarity then. (It's cheating by centering the data which means this reduces to cosine similarity.) You only need dot products and norms then. Norms are easy -- nothing pairwise. Dot products you can get by examining one feature at a time and outputting pairwise products of elements, then summing after. This does much better than the Cartesian join upfront.Sundries

© 2022 - 2024 — McMap. All rights reserved.