I'm handling a large dataset with about 20,000,000 rows and 4 columns. Unfortunately, the available memory on my machine (~16GB) is not sufficient.
Example (Time is seconds since midnight):
Date Time Price Vol
0 20010102 34222 51.750 227900
1 20010102 34234 51.750 5600
2 20010102 34236 51.875 14400
Then I transform the dataset into a proper time-series object:
Date Time Price Vol
2001-01-02 09:30:22 20010102 34222 51.750 227900
2001-01-02 09:30:34 20010102 34234 51.750 5600
2001-01-02 09:30:36 20010102 34236 51.875 14400
2001-01-02 09:31:03 20010102 34263 51.750 2200
To release memory I want to drop the redundant Date and Time columns.
I do it with the .drop()
method but the memory is not released. I also tried to call gc.collect()
afterwards but that did not help either.
This is the code I call to handle the described actions.
The del
part releases memory but not the drop
part.
# Store date and time components
m, s = divmod(data.Time.values, 60)
h, m = divmod(m, 60)
s, m, h = pd.Series(np.char.mod('%02d', s)), pd.Series(np.char.mod('%02d', m)), pd.Series(np.char.mod('%02d', h))
# Set time series index
data = data.set_index(pd.to_datetime(data.Date.reset_index(drop=True).apply(str) + h + m + s, format='%Y%m%d%H%M%S'))
# Remove redundant information
del s, m, h
data.drop('Date', axis=1, inplace=True)
data.drop('Time', axis=1, inplace=True)
How can I release the memory from the pandas data frame?
data.drop()
you diddata = data[cols]
where 'cols' are the columns you want to keep? Assuming you're using ipython you may want to explore%reset
,%reset_selective
, and%xdel
. Also, and more generally, you might want to think about doing more of this in numpy arrays and not putting into a pandas dataframe until the end (as numpy gives you finer control over views and copies, plus it can be much faster in some cases) – Hintondata.Vol.astype(np.int32)
anddata.Price.astype(np.float32)
. – Hintonmultiprocessing
. – Afar