Processing large amount of data in Python

Asked 22/9, 2012 at 18:45 Answered 19/10, 2012 at 7:10

Solved python csv amazon-ec2 machine-learning scientific-computing

I have been trying to process a good chunk of data (a few GBs) but my personal computer resists to do it in a reasonable time span, so I was wondering what options do I have? I was using python's csv.reader but it was painfully slow even to fetch 200,000 lines. Then I migrated this data to an sqlite database which retrieved results a bit faster and without using so much memory but slowness was still a major issue.

So, again... what options do I have to process this data? I was wondering about using amazon's spot instances which seem useful for this kind of purpose but maybe there are other solutions to explore.

Supposing that spot instances is a good option and considering I have never used them before, I'd like to ask what can I expect from them? Does anyone have experience using them for this kind of thing? If so, What is your workflow? I thought I could find a few blog posts detailing workflows for scientific computing, image processing or that kind of thing but I didn't find anything so if you can explain a bit of that or point out some links, I'd appreciate it.

Thanks in advance.

Pomelo answered 22/9, 2012 at 18:45 Comment(11)

In my opinion, it'll almost certainly be faster in terms of wall/calendar time to process a few GB locally than to learn, code, deploy, and process it elsewhere. Slow code is slow code, whether it's on your machine or not, and splitting work across machines adds a lot more complexity. EMR eases much of the pain, but still. – Suggestible 22/9, 2012 at 20:7

Well, in my situation even the simpler manipulations are extremely slow to the point of being unusable. I suggested Amazon's instances because I have been reading about them for a few weeks. They look appropriate for this use but I wanted to know if there are alternatives because AWS services definitely have a steep learning curve. – Pomelo 22/9, 2012 at 20:33

What kind of processing are you doing? Streaming reads tend to be very fast, even on rotating hard drives -- that's why distributed computing tools (map/reduce et al) usually involve phrasing your problem in a way that each step can be addressed by sequential scans. – Suggestible 22/9, 2012 at 20:42

Actually, it's basically string manipulations, running some machine learning algorithms, that kind of stuff. – Pomelo 22/9, 2012 at 20:44

where is the bottleneck --- is it reading, or is it somewhere in the algorithmic part of the code? – Senescent 22/9, 2012 at 21:52

Python is not supposed to be fast. Using it to process large amounts of data just sounds like a bad idea. – Overwork 25/9, 2012 at 8:33

Reading is linear in the number of records, provided you're doing online processing, rather than trying to fit the whole of your data into the memory. Unless you have to run this processing repeatedly, you should be able to do that on your local machine, even if it takes a while. – Relevance 25/9, 2012 at 16:52

It might be a good idea to start by profiling your code to make sure there're no memory leaks, by the way. – Relevance 25/9, 2012 at 16:53

No memory leaks. It's an issue with memory. Just to exemplify the problem, it can take more than 20 minutes to output the number of rows I want to process. Absurdly slow. – Pomelo 25/9, 2012 at 22:15

Do you intend to use the data more than once? It may be beneficial to read the data and save it in a binary format, which can be later read and manipulated faster. The conversion can consume quite some time, which is why it is only useful if the data as a whole is accessed more than once. – Biconcave 3/10, 2012 at 9:9

@MickeyDiamant Good tip. I don't know how much it will help but definitely something to try. – Pomelo 3/10, 2012 at 18:4

I would try to use numpy to work with your large datasets localy. Numpy arrays should use less memory compared csv.reader and computation times should be much faster when using vectorised numpy functions.

However there may be a memory problem when reading the file. numpy.loadtxt or numpy.genfromtxt also consume a lot of memory when reading files. If this is a problem some (brand new) alternative parser engines are compared here. According to this post, the new pandas (a library which is built on top of numpy) parser seems to be an option.

As mentioned in the comments I would also suggest to store your data in a binary format like HDF5 when you have read your files once. Loading the data from a HDF5 file is really fast from my experience (would be interesting to know how fast it is compared to sqlite in your case). The simplest way I know to save your array as HDF5 is with pandas

import pandas as pd

data = pd.read_csv(filename, options...)
store = pd.HDFStore('data.h5')
store['mydata'] = data
store.close()

loading your data is than as simple as

import pandas as pd

store = pd.HDFStore('data.h5')
data = store['mydata']
store.close()

Cinquecento answered 10/10, 2012 at 9:18 Comment(1)

Wonderful answer. I have no prior experience using pandas but I'm looking forward to learning about it. I will wait a couple of days before accepting this answer but this looks very satisfactory. Thanks a lot! – Pomelo 10/10, 2012 at 20:4

If you have to use python, you can try dumbo which allows you to run Hadoop programs in python. It's very easy to start with. Then you can write your own code to do hadoop streaming to process your Big Data. Do check its short tutorial: https://github.com/klbostee/dumbo/wiki/Short-tutorial

A similar one is from yelp: https://github.com/Yelp/mrjob

Maidservant answered 19/10, 2012 at 7:10 Comment(1)

Interesting. I haven't delved into Map Reduce tasks but I will look into this. Thanks! – Pomelo 19/10, 2012 at 18:32

Recommended topics

Hot tags