Are there any distributed machine learning libraries for using Python with Hadoop? [closed]

Asked 9/1, 2013 at 11:3 Answered 23/1, 2014 at 10:25

Solved python hadoop mapreduce hadoop-streaming elastic-map-reduce

I have set myself up with Amazon Elastic MapReduce in order to preform various standard machine learning tasks. I have used Python extensively for local machine learning in the past and I do not know Java.

As far as I can tell there are no well developed Python libraries for distributed machine learning. Java on the other hand has Apache Mahout and the more recent Oryx from Cloudera.

Essentially it seems I have to choose between two options. Slog through parallelising my own algorithms to use with Hadoop streaming or one of the Python wrapper for Hadoop until decent libraries exist or jump ship to Java so that I can use Mahout/Oryx. There is a world of difference between writing your own MapReduce word count code and writing your own MapReduce SVM! Even with with help of great tutorials like this.

I don't know which is the wiser choice, so my question is:

A) Is there some Python library I have missed which would be useful? If not, do you know if there are any in development which will be useful in the near future?

B) If the answer to the above is no then would my time be better spent jumping ship to Java?

Purvey answered 9/1, 2013 at 11:3 Comment(4)

Check out: #4819937 – Mortgage 9/1, 2013 at 13:43

Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it. – Fain 16/1, 2014 at 22:33

Proceed to Software Recommendations when it will be opened. – Phlyctena 22/1, 2014 at 15:46

Spark + PySpark + MLLib is your way to go now – Overcloud 2/6, 2017 at 15:17

I do not know of any library that could be used natively in Python for machine learning on Hadoop, but an easy solution would be to use the jpype module, which basically allows you to interact with Java from within your Python code.

You can for example start a JVM like this:

from jpype import *

jvm = None

def start_jpype():
    global jvm
    if (jvm is None):
        cpopt="-Djava.class.path={cp}".format(cp=classpath)
        startJVM(jvmlib,"-ea",cpopt)
        jvm="started"

There is a very good tutorial on the topic here, which explains you how to use KMeans clustering from your Python code using Mahout.

Pontic answered 9/1, 2013 at 19:10 Comment(1)

Link to tutorial doesn't work for me, brings me to a sign-in page. – Madiemadigan 9/8, 2018 at 18:33

Answer to the questions:

To my knowledge, no, python has an extensive collection of machine learning and map-reduce modules but not ML+MR
I would say yes, since you are a heavy programmer you should be able to catch with Java fairly fast if you are not involved with those nasty(sorry no offense) J2EE frameworks

Yeomanly answered 21/1, 2014 at 19:27 Comment(0)

I would recommend using Java, when you are using EMR.

First, and simple, its the way it was designed to work. If your going to play in Windows you write in C#, if your making a web service in apache you use PHP. When your running MapReduce Hadoop in EMR, you use Java.

Second, all the tools are there for you in Java, like AWS SDK. I regularly develop MapReduce jobs in EMR quickly with the help of Netbeans, Cygwin(when on Windows), and s3cmd(in cygwin). I use netbeans to build my MR jar, and cygwin + s3cmd to copy it to my s3 directory to be run be emr. I then also write a program using AWS SDK to launch my EMR cluster with my config and to run my jar.

Third, there are many Hadoop debugging tools(usually need mac or linux os for them to work though) for Java

Please see here for creating a new Netbeans project with maven for hadoop.

Overcloud answered 23/1, 2014 at 0:22 Comment(0)

This blog post provides a fairly comprehensive review of the python frameworks for working with hadoop:

http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/

including:

Hadoop Streaming

mrjob

dumbo

hadoopy

pydoop

and this example provides a working example of parallelized ML with python and hadoop:

http://atbrox.com/2010/02/08/parallel-machine-learning-for-hadoopmapreduce-a-python-example/

Blancablanch answered 23/1, 2014 at 7:58 Comment(0)

-2

A) No

B) No

What you actually want to do is jump ship to Scala and if you want to do any hardcore ML then you also want to forget about using Hadoop and jump ship to Spark. Hadoop is a MapReduce framework, but ML algorithms do not necessarily map to this dataflow structure, as they are often iterative. This means many ML algorithms will result in a large number of MapReduce stages - each stage has the huge overhead of reading and writting to disk.

Spark is a in memory distributed framework that allows data to stay in memory increasing speed by orders of magnitude.

Now Scala is a best of all worlds language, especially for Big Data and ML. It's not dynamically typed, but has type inference and implicit conversions, and it's significantly more concise than Java and Python. This means you can write code very fast in Scala, but moreover, that code is readable and maintainable.

Finally Scala is functional, and naturally lends itself to mathematics and parallelization. This is why all the serious cutting edge work for Big Data and ML is being done in Scala; e.g. Scalding, Scoobi, Scrunch and Spark. Crufty Python & R code will be a thing of the past.

Unfriendly answered 23/1, 2014 at 10:25 Comment(0)

Recommended topics

Hot tags