Python out of memory on large CSV file (numpy)
Asked Answered
C

3

35

I have a 3GB CSV file that I try to read with python, I need the median column wise.

from numpy import * 
def data():
    return genfromtxt('All.csv',delimiter=',')

data = data() # This is where it fails already.

med = zeros(len(data[0]))
data = data.T
for i in xrange(len(data)):
    m = median(data[i])
    med[i] = 1.0/float(m)
print med

The error that I get is this:

Python(1545) malloc: *** mmap(size=16777216) failed (error code=12)

*** error: can't allocate region

*** set a breakpoint in malloc_error_break to debug

Traceback (most recent call last):

  File "Normalize.py", line 40, in <module>

  data = data()

  File "Normalize.py", line 39, in data

  return genfromtxt('All.csv',delimiter=',')

File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-
packages/numpy/lib/npyio.py", line 1495, in genfromtxt

for (i, line) in enumerate(itertools.chain([first_line, ], fhd)):

MemoryError

I think it's just an out of memory error. I am running a 64bit MacOSX with 4GB of ram and both numpy and Python compiled in 64bit mode.

How do I fix this? Should I try a distributed approach, just for the memory management?

Thanks

EDIT: Also tried with this but no luck...

genfromtxt('All.csv',delimiter=',', dtype=float16)
Calle answered 21/1, 2012 at 21:26 Comment(1)
Use pandas.read_csv it's significantly faster.Flay
B
73

As other folks have mentioned, for a really large file, you're better off iterating.

However, you do commonly want the entire thing in memory for various reasons.

genfromtxt is much less efficient than loadtxt (though it handles missing data, whereas loadtxt is more "lean and mean", which is why the two functions co-exist).

If your data is very regular (e.g. just simple delimited rows of all the same type), you can also improve on either by using numpy.fromiter.

If you have enough ram, consider using np.loadtxt('yourfile.txt', delimiter=',') (You may also need to specify skiprows if you have a header on the file.)

As a quick comparison, loading ~500MB text file with loadtxt uses ~900MB of ram at peak usage, while loading the same file with genfromtxt uses ~2.5GB.

Loadtxt Memory and CPU usage of numpy.loadtxt while loading a ~500MB ascii file


Genfromtxt Memory and CPU usage of numpy.genfromtxt while loading a ~500MB ascii file


Alternately, consider something like the following. It will only work for very simple, regular data, but it's quite fast. (loadtxt and genfromtxt do a lot of guessing and error-checking. If your data is very simple and regular, you can improve on them greatly.)

import numpy as np

def generate_text_file(length=1e6, ncols=20):
    data = np.random.random((length, ncols))
    np.savetxt('large_text_file.csv', data, delimiter=',')

def iter_loadtxt(filename, delimiter=',', skiprows=0, dtype=float):
    def iter_func():
        with open(filename, 'r') as infile:
            for _ in range(skiprows):
                next(infile)
            for line in infile:
                line = line.rstrip().split(delimiter)
                for item in line:
                    yield dtype(item)
        iter_loadtxt.rowlength = len(line)

    data = np.fromiter(iter_func(), dtype=dtype)
    data = data.reshape((-1, iter_loadtxt.rowlength))
    return data

#generate_text_file()
data = iter_loadtxt('large_text_file.csv')

Fromiter

Using fromiter to load the same ~500MB data file

Blatant answered 22/1, 2012 at 21:16 Comment(5)
Basically, brute force. :) Here's my shell script, if you're interested: gist.github.com/2447356 It's far from elegant, but it's close enough.Blatant
Ah, nice! (Although I'll admit I was hoping for import memoryprofile or something, drat!)Unexperienced
Well, there's heapy, (part of guppy: guppy-pe.sourceforge.net ) but it doesn't work well for numpy arrays, unfortunately. Shame, though, import memoryprofile would be damned nice!Blatant
dear @JoeKington, can you please use a single scale for the Y axes of your graphs, so that the comparison is visually similar?Vapory
IMO, you should better compare memory usage with output array size instead of file size. For example if you want to load a 8192x8192 double precision matrix, then an optimal function would only need 512MB (8 * 8192 * 8192 bytes) to load it, regardless how large is the text file.Sissy
M
4

The problem with using genfromtxt() is that it attempts to load the whole file into memory, i.e. into a numpy array. This is great for small files but BAD for 3GB inputs like yours. Since you are just calculating column medians, there's no need to read the whole file. A simple, but not the most efficient way to do it would be to read the whole file line-by-line multiple times and iterate over the columns.

Magenmagena answered 21/1, 2012 at 21:33 Comment(3)
Well, okay. But is there a more sustainable solution to this? Like in a java program, you can choose to start it up with, say, 5GB of memory. Is there an equivalent for Python? I mean, next time I might just have a CSV file with a single line of 4Gb..Calle
Python doesn't limit how much memory you can allocate. If you get MemoryError in 64-bit Python, you really are out of memory.Entablement
Unfortunately, not all of the Python modules support 64-bit architecture.Gorblimey
L
2

Why are you not using the python csv module?

>> import csv
>> reader = csv.reader(open('All.csv'))
>>> for row in reader:
...     print row
Leopold answered 21/1, 2012 at 21:40 Comment(2)
Because my whole program uses numpy and basic linear algebra.. with the reader I can't do all that stuff.Calle
Combined with the answer of kz26 this actually gives a workable workaround. Also funny: After one iteration the file is cached and the process jumps from 60 to 99% cpu.Calle

© 2022 - 2024 — McMap. All rights reserved.