I'm using pandas to manage a large array of 8-byte integers. These integers are included as space-delimited elements of a column in a comma-delimited CSV file, and the array size is about 10000x10000.
Pandas is able to quickly read the comma-delimited data from the first few columns as a DataFrame, and also quickly store the space-delimited strings in another DataFrame with minimal hassle. The trouble comes when I try to cast transform the table from a single column of space-delimited strings to a DataFrame of 8-bit integers.
I have tried the following:
intdata = pd.DataFrame(strdata.columnname.str.split().tolist(), dtype='uint8')
But the memory usage is unbearable - 10MB worth of integers consumes 2GB of memory. I'm told that it's a limitation of the language and there's nothing I can do about it in this case.
As a possible workaround, I was advised to save the string data to a CSV file and then reload the CSV file as a DataFrame of space-delimited integers. This works well, but to avoid the slowdown that comes from writing to disk, I tried writing to a StringIO object.
Here's a minimal non-working example:
import numpy as np
import pandas as pd
from cStringIO import StringIO
a = np.random.randint(0,256,(10000,10000)).astype('uint8')
b = pd.DataFrame(a)
c = StringIO()
b.to_csv(c, delimiter=' ', header=False, index=False)
d = pd.io.parsers.read_csv(c, delimiter=' ', header=None, dtype='uint8')
Which yields the following error message:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 443, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 228, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 533, in __init__
self._make_engine(self.engine)
File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 670, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 1032, in __init__
self._reader = _parser.TextReader(src, **kwds)
File "parser.pyx", line 486, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:4494)
ValueError: No columns to parse from file
Which is puzzling, because if I run the exact same code with 'c.csv'
instead of c
, the code works perfectly. Also, if I use the following snippet:
file = open('c.csv', 'w')
file.write(c.getvalue())
The CSV file gets saved without any problems, so writing to the StringIO object is not the issue.
It is possible that I need to replace c
with c.getvalue()
in the read_csv line, but when I do that, the interpreter tries to print the contents of c
in the terminal! Surely there is a way to work around this.
Thanks in advance for the help.