Hadoop, Mahout real-time processing alternative

Asked 1/10, 2011 at 10:21 Answered 3/10, 2011 at 16:43

Solved java hadoop scalability real-time mahout

I intended to use hadoop as "computation cluster" in my project. However then I read that Hadoop is not inteded for real-time systems because of overhead connected with start of a job. I'm looking for solution which could be use this way - jobs which could can be easly scaled into multiple machines but which does not require much input data. What is more I want to use machine learning jobs e.g. using created before neural network in real-time.

What libraries/technologies I can use for this purposes?

Ballinger answered 1/10, 2011 at 10:21 Comment(3)

Do you need real time in the model learning stage, or in the model usage stage? – Thumping 1/10, 2011 at 20:32

@David Gruzman Model usage stage – Ballinger 1/10, 2011 at 20:44

How fast is your real time requirement? Seconds? Minutes? 15minutes? ... – Trauma 2/10, 2011 at 9:43

You are right, Hadoop is designed for batch-type processing.

Reading the question, I though about the Storm framework very recently open sourced by Twitter, which can be considered as "Hadoop for real-time processing".

Storm makes it easy to write and scale complex realtime computations on a cluster of computers, doing for realtime processing what Hadoop did for batch processing. Storm guarantees that every message will be processed. And it's fast — you can process millions of messages per second with a small cluster. Best of all, you can write Storm topologies using any programming language.

(from: InfoQ post)

However, I have not worked with it yet, so I really cannot say much about it in practice.

Twitter Engineering Blog Post: http://engineering.twitter.com/2011/08/storm-is-coming-more-details-and-plans.html
Github: https://github.com/nathanmarz/storm

Xenophobe answered 1/10, 2011 at 10:50 Comment(0)

Given the fact that you want a real-time response in de "seconds" area I recommend something like this:

Setup a batched processing model for pre-computing as much as possible. Essentially try to do everything that does not depend on the "last second" data. Here you can use a regular Hadoop/Mahout setup and run these batches daily or (if needed) every hour or even 15 minutes.
Use a real-time system to do the last few things that cannot be precomputed. For this you should look at either using the mentioned s4 or the recently announced twitter storm.

Sometimes it pays to go really simple and store the precomputed values all in memory and simply do the last aggregation/filter/sorting/... steps in memory. If you can do that you can really scale because each node can run completely independently of all others.

Perhaps having a NoSQL backend for your realtime component helps. There are lot's of those available: mongodb, redis, riak, cassandra, hbase, couchdb, ...

It all depends on your real application.

Trauma answered 2/10, 2011 at 12:40 Comment(0)

Also try S4, initially released by Yahoo! and its now Apache Incubator project. It has been around for a while, and I found it to be good for some basic stuff when I did a proof of concept. Haven't used it extensively though.

Saturniid answered 1/10, 2011 at 18:29 Comment(0)

What you're trying to do would be a better fit for HPCC as it has both, the back end data processing engine (equivalent to Hadoop) and the front-end real-time data delivery engine, eliminating the need to increase complexity through third party components. And a nice thing of HPCC is that both components are programmed using the same exact language and programming paradigms. Check them out at: http://hpccsystems.com

Abortive answered 3/10, 2011 at 16:43 Comment(0)

Recommended topics

Hot tags