Pandas: garbage-collect drop'ped columns to release memory - McMap

About

Pandas: garbage-collect drop'ped columns to release memory

Asked 18/7, 2015 at 13:28 Answered 26/9, 2019 at 9:55

python pandas garbage-collection drop

H

1

12

I'm handling a large dataset with about 20,000,000 rows and 4 columns. Unfortunately, the available memory on my machine (~16GB) is not sufficient.

Example (Time is seconds since midnight):

           Date   Time   Price     Vol
0      20010102  34222  51.750  227900
1      20010102  34234  51.750    5600
2      20010102  34236  51.875   14400

Then I transform the dataset into a proper time-series object:

                         Date   Time   Price     Vol
2001-01-02 09:30:22  20010102  34222  51.750  227900
2001-01-02 09:30:34  20010102  34234  51.750    5600
2001-01-02 09:30:36  20010102  34236  51.875   14400
2001-01-02 09:31:03  20010102  34263  51.750    2200

To release memory I want to drop the redundant Date and Time columns. I do it with the .drop() method but the memory is not released. I also tried to call gc.collect() afterwards but that did not help either.

This is the code I call to handle the described actions. The del part releases memory but not the drop part.

# Store date and time components
m, s = divmod(data.Time.values, 60)
h, m = divmod(m, 60)
s, m, h = pd.Series(np.char.mod('%02d', s)), pd.Series(np.char.mod('%02d', m)), pd.Series(np.char.mod('%02d', h))

# Set time series index
data = data.set_index(pd.to_datetime(data.Date.reset_index(drop=True).apply(str) + h + m + s, format='%Y%m%d%H%M%S'))

# Remove redundant information
del s, m, h
data.drop('Date', axis=1, inplace=True)
data.drop('Time', axis=1, inplace=True)

How can I release the memory from the pandas data frame?

Henton answered 18/7, 2015 at 13:28 Comment(8)

Have you thought of breaking your large dataset to several smaller ones, e.g., along the lines of MapReduce programming model? – Midrib 18/7, 2015 at 13:36

Is your dataset weekly tick data? If so, try breaking down your dataset into day-by-day separate set. – Fadden 18/7, 2015 at 13:48

Actually, we store the tick data on a server running MonetDB. This is the code I use to retrieve data from the server and align it to a regular time grid (e.g. one price every 15 seconds). In this particular case a colleague wants to have data sampled every second which breaks the available memory – Henton 18/7, 2015 at 13:55

@Henton is the timespan of data just one day? I've seen tick data of cash equity for the top34% liquid symbols, it's just about 5-6 mil rows per day. – Fadden 18/7, 2015 at 13:58

@JianxunLi No, this is the full dataset, 14 years of data for one asset. If releasing memory is not possible I will consider retrieving the years separately but I would much more prefer to alter the existing code. – Henton 18/7, 2015 at 14:0

What if instead of data.drop() you did data = data[cols] where 'cols' are the columns you want to keep? Assuming you're using ipython you may want to explore %reset, %reset_selective, and %xdel. Also, and more generally, you might want to think about doing more of this in numpy arrays and not putting into a pandas dataframe until the end (as numpy gives you finer control over views and copies, plus it can be much faster in some cases) – Hinton 18/7, 2015 at 14:52

You could also consider storing values with 32 bits instead of 64 (the pandas default) if your values are small enough. E.g. data.Vol.astype(np.int32) and data.Price.astype(np.float32). – Hinton 18/7, 2015 at 15:36

My advice on preventing memory leaks in pandas from intermediate columns is to run the reader and conversion code under Python multiprocessing. – Afar 15/8, 2021 at 8:15

D

2

del data['Date']
del data['Time']

This will releases memory.

Dulciana answered 26/9, 2019 at 9:55 Comment(0)

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.