Python out of memory on large CSV file (numpy)

Asked 21/1, 2012 at 21:26 Answered 22/1, 2012 at 21:16

I have a 3GB CSV file that I try to read with python, I need the median column wise.

from numpy import * 
def data():
    return genfromtxt('All.csv',delimiter=',')

data = data() # This is where it fails already.

med = zeros(len(data[0]))
data = data.T
for i in xrange(len(data)):
    m = median(data[i])
    med[i] = 1.0/float(m)
print med

The error that I get is this:

Python(1545) malloc: *** mmap(size=16777216) failed (error code=12)

*** error: can't allocate region

*** set a breakpoint in malloc_error_break to debug

Traceback (most recent call last):

  File "Normalize.py", line 40, in <module>

  data = data()

  File "Normalize.py", line 39, in data

  return genfromtxt('All.csv',delimiter=',')

File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-
packages/numpy/lib/npyio.py", line 1495, in genfromtxt

for (i, line) in enumerate(itertools.chain([first_line, ], fhd)):

MemoryError

I think it's just an out of memory error. I am running a 64bit MacOSX with 4GB of ram and both numpy and Python compiled in 64bit mode.

How do I fix this? Should I try a distributed approach, just for the memory management?

Thanks

EDIT: Also tried with this but no luck...

genfromtxt('All.csv',delimiter=',', dtype=float16)

Calle answered 21/1, 2012 at 21:26 Comment(1)

Use pandas.read_csv it's significantly faster. – Flay 31/1, 2014 at 18:40

As other folks have mentioned, for a really large file, you're better off iterating.

However, you do commonly want the entire thing in memory for various reasons.

genfromtxt is much less efficient than loadtxt (though it handles missing data, whereas loadtxt is more "lean and mean", which is why the two functions co-exist).

If your data is very regular (e.g. just simple delimited rows of all the same type), you can also improve on either by using numpy.fromiter.

If you have enough ram, consider using np.loadtxt('yourfile.txt', delimiter=',') (You may also need to specify skiprows if you have a header on the file.)

As a quick comparison, loading ~500MB text file with loadtxt uses ~900MB of ram at peak usage, while loading the same file with genfromtxt uses ~2.5GB.

Loadtxt Memory and CPU usage of numpy.loadtxt while loading a ~500MB ascii file

Genfromtxt Memory and CPU usage of numpy.genfromtxt while loading a ~500MB ascii file

Alternately, consider something like the following. It will only work for very simple, regular data, but it's quite fast. (loadtxt and genfromtxt do a lot of guessing and error-checking. If your data is very simple and regular, you can improve on them greatly.)

import numpy as np

def generate_text_file(length=1e6, ncols=20):
    data = np.random.random((length, ncols))
    np.savetxt('large_text_file.csv', data, delimiter=',')

def iter_loadtxt(filename, delimiter=',', skiprows=0, dtype=float):
    def iter_func():
        with open(filename, 'r') as infile:
            for _ in range(skiprows):
                next(infile)
            for line in infile:
                line = line.rstrip().split(delimiter)
                for item in line:
                    yield dtype(item)
        iter_loadtxt.rowlength = len(line)

    data = np.fromiter(iter_func(), dtype=dtype)
    data = data.reshape((-1, iter_loadtxt.rowlength))
    return data

#generate_text_file()
data = iter_loadtxt('large_text_file.csv')

Fromiter

Using fromiter to load the same ~500MB data file

Blatant answered 22/1, 2012 at 21:16 Comment(5)

Basically, brute force. :) Here's my shell script, if you're interested: gist.github.com/2447356 It's far from elegant, but it's close enough. – Blatant 22/4, 2012 at 3:39

Ah, nice! (Although I'll admit I was hoping for import memoryprofile or something, drat!) – Unexperienced 25/4, 2012 at 5:11

Well, there's heapy, (part of guppy: guppy-pe.sourceforge.net ) but it doesn't work well for numpy arrays, unfortunately. Shame, though, import memoryprofile would be damned nice! – Blatant 27/4, 2012 at 19:4

dear @JoeKington, can you please use a single scale for the Y axes of your graphs, so that the comparison is visually similar? – Vapory 17/10, 2012 at 13:33

IMO, you should better compare memory usage with output array size instead of file size. For example if you want to load a 8192x8192 double precision matrix, then an optimal function would only need 512MB (8 * 8192 * 8192 bytes) to load it, regardless how large is the text file. – Sissy 13/8, 2015 at 12:10

The problem with using genfromtxt() is that it attempts to load the whole file into memory, i.e. into a numpy array. This is great for small files but BAD for 3GB inputs like yours. Since you are just calculating column medians, there's no need to read the whole file. A simple, but not the most efficient way to do it would be to read the whole file line-by-line multiple times and iterate over the columns.

Magenmagena answered 21/1, 2012 at 21:33 Comment(3)

Well, okay. But is there a more sustainable solution to this? Like in a java program, you can choose to start it up with, say, 5GB of memory. Is there an equivalent for Python? I mean, next time I might just have a CSV file with a single line of 4Gb.. – Calle 21/1, 2012 at 21:41

Python doesn't limit how much memory you can allocate. If you get MemoryError in 64-bit Python, you really are out of memory. – Entablement 21/1, 2012 at 22:16

Unfortunately, not all of the Python modules support 64-bit architecture. – Gorblimey 17/10, 2012 at 15:12

Why are you not using the python csv module?

>> import csv
>> reader = csv.reader(open('All.csv'))
>>> for row in reader:
...     print row

Leopold answered 21/1, 2012 at 21:40 Comment(2)

Because my whole program uses numpy and basic linear algebra.. with the reader I can't do all that stuff. – Calle 21/1, 2012 at 21:43

Combined with the answer of kz26 this actually gives a workable workaround. Also funny: After one iteration the file is cached and the process jumps from 60 to 99% cpu. – Calle 21/1, 2012 at 23:8

Recommended topics

Hot tags