How to use Cassandra's Map Reduce with or w/o Pig?

Asked 29/4, 2010 at 0:17 Answered 21/8, 2010 at 1:40

Can someone explain how MapReduce works with Cassandra .6? I've read through the word count example, but I don't quite follow what's happening on the Cassandra end vs. the "client" end.

https://svn.apache.org/repos/asf/cassandra/trunk/contrib/word_count/

For instance, let's say I'm using Python and Pycassa, how would I load in a new map reduce function, and then call it? Does my map reduce function have to be java that's installed on the cassandra server? If so, how do I call it from Pycassa?

There's also mention of Pig making this all easier, but I'm a complete Hadoop noob, so that didn't really help.

Your answer can use Thrift or whatever, I just mentioned Pycassa to denote the client side. I'm just trying to understand the difference between what runs in the Cassandra cluster vs. the actual server making the requests.

Face answered 29/4, 2010 at 0:17 Comment(0)

From what I've heard (and from here), the way that a developer writes a MapReduce program that uses Cassandra as the data source is as follows. You write a regular MapReduce program (the example you linked to is for the pure-Java version) and the jars that are now available provide a CustomInputFormat that allows the input source to be Cassandra (instead of the default, which is Hadoop).

If you're using Pycassa I'd say you're out of luck until either (1) the maintainer of that project adds support for MapReduce or (2) you throw some Python functions together that write up a Java MapReduce program and run it. The latter is definitely a bit of a hack but would get you up and going.

Hofer answered 29/4, 2010 at 0:52 Comment(4)

So the Cassandra nodes aren't doing the map reduce, wherever your Java was running is? – Face 29/4, 2010 at 22:2

Yes, the Hadoop jobtrackers run the m/r jobs. – Stich 30/4, 2010 at 1:35

so isn't the point of map reduce that it's distributed? If it's not run on the cassandra nodes, what's the point? – Face 30/4, 2010 at 20:28

It's still distributed, just distributed over the Hadoop nodes in your system. The main point of the Cassandra interface is that the way some people were doing it before was to dump a subset of their Cassandra database and then read it in, run a MR job, then dump it back to Cassandra. This removes a bit of that boilerplate code (mainly). – Hofer 11/5, 2010 at 18:31

It Knows about the locality ; The Cassandra InputFormat overrides getLocations() to preserve data locality

Eleonoraeleonore answered 21/8, 2010 at 1:40 Comment(0)

The win of using a direct InputFormat from cassandra is that it streams the data efficiently, which is a very big win. Each input split covers a range of tokens and rolls off the disk at its full bandwidth: no seeking, no complex querying. I don't think it knows about locality -- to have each tasktracker prefer input splits from a cassandra process on the same node.

You can try using Pig with the STREAM method as a hack until more direct hadoop streaming support is in place.

Embargo answered 13/6, 2010 at 19:53 Comment(0)

Recommended topics

Hot tags