Cassandras Map Reduce Support
Asked Answered
B

2

6

I recently ran into a case where Cassandra fits in perfectly to store time based events with custom ttls per event type (the other solution would be to save it in hadoop and do the bookkeeping manually (ttls and stuff, IMHO a very complex idea) or switch to hbase). The question is how good the cassandra MapReduce support works out of the box without Datastax Enterprise edition.

It seems that they invested a lot in CassandraFS but I ask myself if the normal Pig CassandraLoader is actively maintained and actually scales (as it seems to do nothing more than to iterate over the rows in slices). Does this work for 100s of millions of rows?

Bonsai answered 1/11, 2012 at 9:45 Comment(0)
C
1

You can map/reduce using random partitioner but of course the keys you get are in random order. you probably want to use CL = 1 in cassandra so you don't ahve to read in from 2 nodes each time while doing map/reduce though and it should read the local data. I have not used Pig though.

Canyon answered 1/11, 2012 at 20:27 Comment(6)
The Pig support for Cassandra uses the ColumnFamilyInputFormat and -OutputFormat. So whatever You can or can't do in hadoop maps fairly well to what you cna and can't do with Cassandra and Pig.Aleris
and is it actually fast using the random partitioner? I guess it just does something like this? #8418948 - I tried to iterate a 100 mio row CF manually once and it never actually started after it sent the first rangeslicequery.Bonsai
that link doesn't look like map/reduce as map/reduce implements a Mapper and Reducer or something ...I need to set it up again soon and don't have code from my previous project...yes it is fast since all of them run in parallel...the start is slow just like hadoop as it delivers code to each task tracker.Canyon
"Hadoop" and "fast" don't really go together. That's the nature of sequential scans. But C* scans are faster than HBase, if that makes you feel better: vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdfNailbiting
I understood the nature of Hadoops & Batches. I just tried to iterate over all rows (100.000.000 rows) in cassandra a cassandra cf (random partitioner) which took ages and I aborted. I was just asking myself if Map Reduce through hadoop uses the same mechanisms.Bonsai
how many servers are you using to do 100,000,000 rows? The more servers the faster....one server would take a while.Canyon
D
-2

Why not hbase? Hbase is more suitable for timeseries data. You can easily put billions of rows on very small cluster and get up to 500k rows per second on small 3node cluster (up to 50MB/s) with WAL enabled. Cassandra has several flaws:

  1. In cassandra you actually restricted by amount of keys (imagine, that in case of billions rows your repair would work forever). So you will design schema, which will 'shard' you time by, say, 1 hour, and actual timestamp will be placed as columns. But such scheme don't scale well due of high risk of 'huge columns'.
  2. Other problem - you can't mapreduce range of data in cassandra, except you use ordered partitioner, which is not an option at all, due of its inability to balance well.
Deming answered 1/11, 2012 at 12:40 Comment(4)
It's because I am already using cassandra in the project and don't really want to introduce new technology...Bonsai
Good point. If it is okay to process all data all the time - This should work, but if data will grow, I recommend to reconsider to use more adapted for mapreduce workload storage.Deming
What nonsense is this? Many (most?) Cassandra clusters support billions of rows quite well. You mention repair but that is of course distributed as well.Nailbiting
It is true that Cassandra discourages relying on global ordering for your data model but this is not much of a downside, particularly with Cassandra's built-in support for column indexes (which are supported in map/reduce as well).Nailbiting

© 2022 - 2024 — McMap. All rights reserved.