Hadoop and Stata

Asked 3/10, 2013 at 17:41 Answered 10/9, 2015 at 9:7

Does anyone have any experience using Stata and Hadoop? Stata 13 now has a Java Plugin API, so I think it should be straightforward to get them to play nice.

I am particularly interested in being able to parse weblog data to get it into a form suitable for statistical analysis.

This question came up on Statalist recently, but there was no response, so I thought I would try it here where the audience is more likely to have experience with this technology.

Tieback answered 3/10, 2013 at 17:41 Comment(3)

As a long-time Statalist stalwart, I find the comparison, although well meant, a little invidious. My own guess is that you are getting no answer because the answer is "No". – Impresario 4/10, 2013 at 16:36

@Nick Cox I meant no insult. I have a great amount of respect for Statalist and its members. I will change my awkward phrasing. – Tieback 4/10, 2013 at 16:51

Any recent success @DimitriyV.Masterov ? Be sure to let us know how this works out – Arana 21/12, 2014 at 23:25

Dimitry,

I think it would be easier to do something like this using the ELK Stack (http://www.elastic.co). Logstash (the middle layer) has several parsers/tokenizers/analyzes built on the Apache Lucene engine for cleaning and formatting log data and can push the resulting data into elasticsearch, which exposes an HTTP API that you can curl fairly easily to get results (e.g., use insheetjson and pass the HTTP GET request as the URL and it should be imported into Stata without much problem).

I've been trying to cobble together a program to use the Jackson JSON library to build out more robust JSON I/O capabilities from within Stata and would definitely not mind trying to work with others to get it done.

Hope this helps, Billy

Frei answered 10/9, 2015 at 9:7 Comment(0)

I'll take an (un?)educated stab at this. From the looks of the java API, the caller seems to treat Stata as essentially a datastore. If that's the case, then I would imagine Stata would fit in to the hadoop world as a database and would be accessed by its own InputFormat and OutputFormat. In your specific case I'd imagine you'd write a StataOutputFormat which your reducer would use to write the parsed data. The only drawback seems to be your referenced comments that Stata apps tend to be I/O bound so I don't know that using hadoop is really going to help you since

you'll have to write all that data anyway, and
that write will be I/O bound, whether you use hadoop or not.

Zamarripa answered 20/4, 2014 at 1:34 Comment(0)

Recommended topics

Hot tags