Finding Connected Components using Hadoop/MapReduce

Asked 20/5, 2012 at 21:30 Answered 5/7, 2023 at 2:41

graph hadoop mapreduce social-networking graph-algorithm

I need to find connected components for a huge dataset. (Graph being Undirected)

One obvious choice is MapReduce. But i'm a newbie to MapReduce and am quiet short of time to pick it up and to code it myself.

I was just wondering if there is any existing API for the same since it is a very common problem in Social Network Analysis?

Or atleast if anyone is aware of any reliable(tried and tested) source using which atleast i can get started with the implementation myself?

Thanks

Maclaine answered 20/5, 2012 at 21:30 Comment(0)

I blogged about it for myself:

http://codingwiththomas.blogspot.de/2011/04/graph-exploration-with-hadoop-mapreduce.html

But MapReduce isn't a good fit for these Graph analysis things. Better use BSP (bulk synchronous parallel) for that, Apache Hama provides a good graph API on top of Hadoop HDFS.

I've written a connected components algorithm with MapReduce here: (Mindist search)

https://github.com/thomasjungblut/tjungblut-graph/tree/master/src/de/jungblut/graph/mapreduce

Also a BSP version for Apache Hama can be found here:

https://github.com/thomasjungblut/tjungblut-graph/blob/master/src/de/jungblut/graph/bsp/MindistSearch.java

The implementation isn't as difficult as in MapReduce and it is at least 10 times faster. If you're interested, checkout the latest version in TRUNK and visit our mailing list.

http://hama.apache.org/

http://apache.org/hama/mail-lists.html

Flung answered 21/5, 2012 at 7:57 Comment(2)

As for now, i'm not at all concerned about the complexity. I am doing a proof of concept thing so for now running time won't matter. I am actually short of time so instead of going to the normal JAVA/C programming to achieve it, i was just hoping to get an existing implementation howsoever dirty. It won't be possible for me to lookup any way other than Hadoop/MapReduce for now. Thanks – Maclaine 21/5, 2012 at 10:22

So you are prototyping in MapReduce? Interesting. My solution in the blog works as it stands there and it is production tested by many other people I know. Don't hesitate to take it. – Flung 21/5, 2012 at 11:56

I don't really know if an API is available which has methods to find strongly connected components. But, I implemented the BFS algorithm to find distance from source node to all other nodes in the graph (the graph was a directed graph as big as 65 million nodes).

The idea was to explore the neighbors (distance of 1) for each node in one iteration and feeding the output of reduce back to map, until the distances converge. The map emits the shortest distances possible from each node, and reduce updated the node with the shortest distance from the list.

I would suggest to check this out. Also, this could help. These two links would give you the basic idea about graph algorithms in map reduce paradigm (if you are already not familiar). Essentially, you need to twist the algorithm to use DFS instead of BFS.

Itinerary answered 21/5, 2012 at 0:15 Comment(0)

You may want to look at the Pegasus project from Carnegie Mellon University. They provide an efficient - and elegant - implementation using MapReduce. They also provide binaries, samples and a very detailed documentation.

The implementation itself is based on the Generalized Iterative Matrix-Vector multiplication (GIM-V) proposed by U Kang in 2009.

PEGASUS: A Peta-Scale Graph Mining System - Implementation and Observations U Kang, Charalampos E. Tsourakakis, Christos Faloutsos In IEEE International Conference on Data Mining (ICDM 2009)

EDIT: The official implementation is actually limited to 2.1 billions nodes (node id are stored as integers). I'm creating a fork on github (https://github.com/placeiq/pegasus) to share my patch and other enhancements (eg. Snappy compression).

Kaiserslautern answered 8/4, 2014 at 21:25 Comment(0)

It is a little old question but here is something you want to checkout. We implemented connected component using map-reduce on Spark platform.

https://github.com/kwartile/connected-component

Blakeslee answered 4/8, 2017 at 19:53 Comment(0)

Try the builtin WCC algorithm in GraphScope, which exactly suit your need.

GraphScope is a distributed graph platform which has a number of algorithms ready to use.

Back to your question, it could be done in 3 steps:

Launch GraphScope cluster
Load the Graph
Run WCC over it and retrieve the result.

Here's the code

import graphscope
sess = graphscope.session(num_workers=4)
graph = sess.g()
graph = graph.add_edges('/path/to/dataset')

result = graphscope.wcc(graph, src=0)
print(result.to_dataframe(selector={'id': 'v.id', 'distance': 'r'})
sess.close()

You could refer to this article to learn how to deploy graphscope, and this article to learn how to deploy by helm.

Castigate answered 5/7, 2023 at 2:41 Comment(0)

Recommended topics

Hot tags