I am using Mahout clustering and I have large clusters each having around 100k users and each user having 5 features. In the next step i need compute pearson correlation to find similarity between the users of the cluster.

Currently i have a python script which does the same for me , but as expected it takes way to long for the computation and is no longer a feasible option

I looked at Mahout as it provides functionality to find UserSimilarity using Pearson, Tanimoto, loglikelyhood measures, What i am not able to find is the way to develop Mapreduce version of these similarity measures.

Is there any resource which can take an example and explain me on how to develop a mapreduce version of the UserSimilarity or would it make sense to use hadoop streaming and the same java class.

EDIT

Even though i have 100k users in a cluster, i rarely need to compute 100k* 100k matrix. Most of the time it will be a 10k*100k matrix. However i have around 500 such clusters so the time consumed for computing 500 clusters of 10k * 100k is pretty long and thats the reason i was looking for better approaches and trigger discussions

Limitations of MapReduce

Pairwise user similarity is not representable in MapReduce by definition.

If you recall what map does, is that it reads one key-value-pair. It is not reading two key-value pairs.

By definition of map, you cannot compute pairwise comparisons.

This should be obvious if you realize that map and reduce are linear processes. And pairwise similarity is a quadratic task. It cannot work.

That is, unless you abuse mapreduce. You can explode your data space to quadratic size. If you first generate all n*n pairs, then you can process these pairs with MapReduce again. But that is a bad abuse of it (although some people seem to do exactly that).

Alternatives to (abusing) MapReduce

Well first of all, sometimes you can rewrite the problem to be actually linear, for example by inverting your data. Or by doing some clever pruning so that while in reality it is quadratic, in real data it usually remains linear (because you can drop a lot of the theoretically quadratic data during or before generation).

Secondly, note that Mahout builds on the Hadoop platform, but doesn't solve everything with just MapReduce. A lot of the Hadoop things do not do just "map+reduce". For example TeraSort - as sorting is not a linear problem (it also involves comparing elements!), it cannot be solved by MapReduce. But you can use Hadoop for TeraSort, by writing a distributed sort that uses a generalized median-of-median approach to estimate quantiles, distribute the data according to these quantiles, then sort the individual buckets (in O(n log n)) on the individual nodes. The complexity remains superlinear, but the run times are much lower than on a single node.

I'm pretty sure that Hadoop gives you a couple of options beyond MapReduce. You may want to look more closely at what Mahout does to solve such non-linear problems. It doesn't try or pretend to MapReduce everything. There is Apache Hama, also going beyond MapReduce using a generalizaion known as "Bulk Synchronous Processing". Hadoop (and Yarn) in general do allow such things. But the API obviously is much harder than the classic Mapper and Reducer APIs.

The naive abuse of MapReduce

Or you go the abuse way (this is probably not what Mahout does). It is actually fairly simple to abuse map reduce. Because there is no constraint on the output size. So if you do not care about memory, you can do the following:

Assuming that you know all keys, e.g. because they are numbered 1 to n. Then you can use

map(K, V) -> [ (1, (K, V)), (2, (K, V)),  (3, (K, V)), ..., (n, (K, V)) ]

(which obviously creates a data set of quadratic size!) and in the reducer, every key now has a full copy of the data set and one ID to care.

So the reducers then for each object read the complete data set again, compute n similarities, and output them.

Boom, you map-reduced the problem. It's stupid and inefficient, but it works and it is being done.

A smarter abuse with auxillary data

A smarter version of this will directly load the data set as so called "auxilliary data" into the system. So it does:

map (K,V) -> [ (K_1, sim(V, aux_obj1)), (K_2, sim(V, aux_obj2)),
             (K_3, sim(V, aux_obj3)), ... (K_n, sim(V, aux_objn)) ]

This is not as bad an abuse of MapReduce (it is just the canonical way of parallel computation of your quadratic result matrix), since it is at least honest about what it does: using massive auxillary data. It will also work - as long as the auxillary data fits into your worker memory. And it is not longer a proper mapreduce, because it uses auxiallary data that can't really be seen as parameterization of your function.

Given that you apparently can load your data into memory (even in python), this last option is probably the easiest way for you to go. You could even break the auxillary memory into chunks and compute these slices of the database as separate jobs.

Still, it isn't MapReduce. It is a quadratic problem, which by definition cannot be MapReduce. These problems can be solved on Hadoop (see Mahout), but you need more than MapReduce for that.

A final remark to your actual task

First of all, please share more on what you really plan on doing. The key idea for becoming faster is to do less. If you can save computations, you are always faster.

100k users and 5 attributes (double valued?) is not very much. So maybe your python implementation is too inefficient. A compiled and optimized language can probably get you 1-2 orders of magnitude faster. I've done pairwise similarites in 110k objects with 8 attributes in 10 minutes; so your problem should be solvable in about this time, too. 100k users and 5 attributes is not really "hadoop size big data" yet. You may end up paying more for the Hadoop overhead than you get out as opposed to a fast low-level implementations.

Optimizing Pearson correlation: there is a couple of things you can do here. Note how you end up recomputing things like the standard deviations? If you preprocess your data and standardize every record to have mean 0 and standard deviation 1, the Pearson correlation simplifies to the Covariance (because stddev is 1). And because the mean is 0, Covariance becomes just E(x*y) = 1/5 \sum x_i * y_i. This function can probably be accelerated by using a spatial index, if you are interested e.g. only in the top-10 similar objects each. I figure you can easily add this into ELKI and use the spatial index structures there. This usually shaves off another order of magnitude in running time, which means you should be on the order of 1 minute processing time, on a single CPU.

Limitations of MapReduce

Alternatives to (abusing) MapReduce

The naive abuse of MapReduce

A smarter abuse with auxillary data

A final remark to your actual task

Recommended topics

Hot tags