Production architecture for big data real time machine learning application?

Asked 6/12, 2012 at 16:7 Answered 31/10, 2013 at 15:22

Solved machine-learning real-time weka mahout pentaho

I'm starting to learn some stuff about big data with a big focus on predictive analysis and for that I have a case study I would like to implement:

I have a dataset of servers health information that is polled every 5sec. I want to show the data that is retrieved but more importantly: I want to run a machine learning model previously built and show the results (alert about servers going to crash).

The machine learning model will be built by a machine learning specialist so that's completely out of scope. My job would be to integrate the machine learning model in a platform that runs the model and shows the results in a nice dashboard.

My problem is the "big picture" architecture of this system: I see that all the pieces already exist (cloudera+mahout) but I'm missing a simple integrated solution for all my needs and I don't believe the state of art is doing some custom software...

So, can anyone shed some light on production systems like this (showing data with predictive analysis)? Reference architecture for this? Tutorials/documentation?

Notes:

I've investigated some related technologies: cloudera/hadoop, pentaho, mahout and weka. I know that Pentaho for example is able to store big data and run ad-hoc Weka analysis on that data. Using cloudera and Impala a data specialist can also run ad-hoc queries and analyse the data but that's not my goal. I want my system to run the ML model and show the results in a nice dashboard alongside the retrieved data. And I'm looking for a platform that already allows this usage instead of custom building.
I'm focusing on Pentaho as it seems to have a nice integration of Machine Learning but every tutorial I read was more about "ad-hoc" ML analysis than real-time. Any tutorial on that subject will be welcomed.
I don't mind opensource or commercial solutions (with a trial)
Depending of the specifics maybe this isn't big data: more "traditional" solutions are also welcomed.
Also real time here is a broad term: if the ML model has good performance running it every 5sec is good enough.
ML model is static (isn't real-time updating or changing its behavior)
I'm not looking for a customized application for my example as my focus is on the big picture: big data with predictive analysis generic platforms.

Disposure answered 6/12, 2012 at 16:7 Comment(1)

Another possible solution for my use case that i'm exploring: Datameer datameer.com/product/data-analytics.html – Disposure 7/12, 2012 at 19:4

(I'm an author of Mahout, and am commercializing a productization of some of the ML in Mahout, with a focus on both real-time and scale: Myrrix. I don't know that it's exactly what you are looking for, but seems to address some of the issues you pose here. It might be useful as another reference point.)

You have highlighted the tension between real-time and large-scale. These aren't the same thing. Hadoop, as a computation environment, scales well but can do nothing in real-time. Part of Mahout is built and Hadoop and so is also ML of that form. Weka, and the other parts of Mahout, are disposed to be more or less real-time, but then are challenged to scale.

An ML system that does both well necessarily has two layers: scalable offline model-building, with real-time online serving and updates. This is how it should look, IMHO, for recommenders for example: http://myrrix.com/design/

But, you don't have any issue with model building, right? Someone's going to build a static model? if so, that makes it much easier. Updating your model in real-time is useful, but complicating. If you don't have to, you're just generating predictions out of a static model, which is usually fast.

I don't think Pentaho is relevant if you are interested in ML, or, running something based on your own ML model.

1 query every 5 seconds is not challenging -- is this 1 query per 5 seconds per machine or something?

My advice is to simply create a server that can answer queries against the model. Just reuse any old HTTP server container like Tomcat. It can load the latest model as it is published from some backing store like HDFS or a NoSQL DB. You can create N instances of the server effortlessly as they don't seem to need to communicate.

The only custom code there is whatever you need to wrap your ML model. This is quite a simple problem if you truly don't need to build your own models or update them dynamically. If you do -- harder question but still possible to architect for.

Commix answered 6/12, 2012 at 17:1 Comment(4)

Thanks Sean. The ML will be static and provided. What I want is a "platform" where I can deploy the ML (custom java program or maybe PMML model) easily which for me means, loading the data, loading ML, schedule it to run every X seconds and presenting the results in a nice dashboard with both real-time data and ML results. That's why Pentaho: I guess it does everything but running the ML and presenting it's results. Running custom code is my backup plan but for me it seems weird there isn't a full stack platform that already does this or at least a common "architecture" to do these things. – Disposure 6/12, 2012 at 19:40

Also, my problem with custom code is just the "dashboarding" part of it where I can Pentaho has some nice dashboards built in. – Disposure 6/12, 2012 at 19:41

I think the "run my custom Java code and display whatever it outputs in a graph" part is not something any platform can provide as it's far too specific to your use case. But the rest is easy. Run a cron job that executes your program periodically and outputs the result to some kind of data store. Then use whatever reporting tool you want to chart it. – Commix 6/12, 2012 at 20:16

Thanks again @Sean Owen, I'll take that into consideration but I'll still wait if someone can suggest an integrated solution. Pentaho seems to have something like this (link ) but I haven't found a tutorial on how to do the predictive analysis in "real-time" and using it in a dashboard (or in my case running it every X seconds), only about using Weka for "ad-hoc" analysis by a data scientist. – Disposure 7/12, 2012 at 4:59

You can configure your own using a combination of Apache Samza or S4 or Storm for real-time data stream analysis and injecting a parallelized and distributed version of machine learning algorithms of your choice. But large-scale parallel machine learning algorithms is a challenging effort and an area of active research. Lately there have been some advances made: you may want to check out Yahoo! Labs SAMOA and Vowpal Wabbit

Librium answered 31/10, 2013 at 15:22 Comment(0)

-3

Something like NewRelic?

The Stats

New Relic is Application Performance Management (APM) as a Service
175,000+ app processes monitored globally
10,000+ customers
20+ Billion application metrics collected every day
1.7+ Billion web page metrics collected every week
Each "timeslice" metric is about 250 bytes
100k timeslice records inserted every second
7 Billion new rows of data every day

Architecture

Platform
- Web UI
  - Ruby on Rails
  - nginx
  - Linux
  - 2 @ 12 core Intel Nehalem CPUs w/ 48Gb RAM
- Data Collector and Web Beacon Services
  - Java
  - Servlets on Jetty
  - App metrics collector: 180k+ requests per minute, responding in 3ms
  - Web metrics beacon service: 200k+ requests per minute, responding in 0.15ms
  - Sharded MySQL using the Percona build
  - Linux
  - 9 @ 24 core Intel Nehalem w/ 48GB RAM, SAS attached RAID 5
  - Bare metal (no virtualization)

BONUS

More info: http://highscalability.com/blog/2011/7/18/new-relic-architecture-collecting-20-billion-metrics-a-day.html

Irving answered 6/12, 2012 at 16:19 Comment(3)

thanks for the tip but that solution doesn't seem to provide any machine learning capabilities and that's a product for monitoring servers where I'm looking more about ML/big data platforms where I can integrate my own ML algorithm, customize dashboards, etc. – Disposure 6/12, 2012 at 16:24

I think they do something simmilar: "The "collector" service digests app metrics and persists them in the right MySQL shard" – Irving 6/12, 2012 at 17:45

Oh, I got it. you want a existent solution where you can include your own algorithm? – Irving 6/12, 2012 at 17:46

The Stats

Architecture

BONUS

Recommended topics

Hot tags