Pandas: garbage-collect drop'ped columns to release memory
Asked Answered
H

1

12

I'm handling a large dataset with about 20,000,000 rows and 4 columns. Unfortunately, the available memory on my machine (~16GB) is not sufficient.

Example (Time is seconds since midnight):

           Date   Time   Price     Vol
0      20010102  34222  51.750  227900
1      20010102  34234  51.750    5600
2      20010102  34236  51.875   14400

Then I transform the dataset into a proper time-series object:

                         Date   Time   Price     Vol
2001-01-02 09:30:22  20010102  34222  51.750  227900
2001-01-02 09:30:34  20010102  34234  51.750    5600
2001-01-02 09:30:36  20010102  34236  51.875   14400
2001-01-02 09:31:03  20010102  34263  51.750    2200

To release memory I want to drop the redundant Date and Time columns. I do it with the .drop() method but the memory is not released. I also tried to call gc.collect() afterwards but that did not help either.

This is the code I call to handle the described actions. The del part releases memory but not the drop part.

# Store date and time components
m, s = divmod(data.Time.values, 60)
h, m = divmod(m, 60)
s, m, h = pd.Series(np.char.mod('%02d', s)), pd.Series(np.char.mod('%02d', m)), pd.Series(np.char.mod('%02d', h))

# Set time series index
data = data.set_index(pd.to_datetime(data.Date.reset_index(drop=True).apply(str) + h + m + s, format='%Y%m%d%H%M%S'))

# Remove redundant information
del s, m, h
data.drop('Date', axis=1, inplace=True)
data.drop('Time', axis=1, inplace=True)

How can I release the memory from the pandas data frame?

Henton answered 18/7, 2015 at 13:28 Comment(8)
Have you thought of breaking your large dataset to several smaller ones, e.g., along the lines of MapReduce programming model?Midrib
Is your dataset weekly tick data? If so, try breaking down your dataset into day-by-day separate set.Fadden
Actually, we store the tick data on a server running MonetDB. This is the code I use to retrieve data from the server and align it to a regular time grid (e.g. one price every 15 seconds). In this particular case a colleague wants to have data sampled every second which breaks the available memoryHenton
@Henton is the timespan of data just one day? I've seen tick data of cash equity for the top34% liquid symbols, it's just about 5-6 mil rows per day.Fadden
@JianxunLi No, this is the full dataset, 14 years of data for one asset. If releasing memory is not possible I will consider retrieving the years separately but I would much more prefer to alter the existing code.Henton
What if instead of data.drop() you did data = data[cols] where 'cols' are the columns you want to keep? Assuming you're using ipython you may want to explore %reset, %reset_selective, and %xdel. Also, and more generally, you might want to think about doing more of this in numpy arrays and not putting into a pandas dataframe until the end (as numpy gives you finer control over views and copies, plus it can be much faster in some cases)Hinton
You could also consider storing values with 32 bits instead of 64 (the pandas default) if your values are small enough. E.g. data.Vol.astype(np.int32) and data.Price.astype(np.float32).Hinton
My advice on preventing memory leaks in pandas from intermediate columns is to run the reader and conversion code under Python multiprocessing.Afar
D
2
del data['Date']
del data['Time']

This will releases memory.

Dulciana answered 26/9, 2019 at 9:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.