Why does concatenation of DataFrames get exponentially slower?

Asked 8/4, 2016 at 0:34 Answered 8/4, 2016 at 0:53

Solved python performance pandas concatenation processing-efficiency

I have a function which processes a DataFrame, largely to process data into buckets create a binary matrix of features in a particular column using pd.get_dummies(df[col]).

To avoid processing all of my data using this function at once (which goes out of memory and causes iPython to crash), I have broken the large DataFrame into chunks using:

chunks = (len(df) / 10000) + 1
df_list = np.array_split(df, chunks)

pd.get_dummies(df) will automatically create new columns based on the contents of df[col] and these are likely to differ for each df in df_list.

After processing, I am concatenating the DataFrames back together using:

for i, df_chunk in enumerate(df_list):
    print "chunk", i
    [x, y] = preprocess_data(df_chunk)
    super_x = pd.concat([super_x, x], axis=0)
    super_y = pd.concat([super_y, y], axis=0)
    print datetime.datetime.utcnow()

The processing time of the first chunk is perfectly acceptable, however, it grows per chunk! This is not to do with the preprocess_data(df_chunk) as there is no reason for it to increase. Is this increase in time occurring as a result of the call to pd.concat()?

Please see log below:

chunks 6
chunk 0
2016-04-08 00:22:17.728849
chunk 1
2016-04-08 00:22:42.387693 
chunk 2
2016-04-08 00:23:43.124381
chunk 3
2016-04-08 00:25:30.249369
chunk 4
2016-04-08 00:28:11.922305
chunk 5
2016-04-08 00:32:00.357365

Is there a workaround to speed this up? I have 2900 chunks to process so any help is appreciated!

Open to any other suggestions in Python!

Disinfectant answered 8/4, 2016 at 0:34 Comment(0)

141

Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.

pd.concat returns a new DataFrame. Space has to be allocated for the new DataFrame, and data from the old DataFrames have to be copied into the new DataFrame. Consider the amount of copying required by this line inside the for-loop (assuming each x has size 1):

super_x = pd.concat([super_x, x], axis=0)

| iteration | size of old super_x | size of x | copying required |
|         0 |                   0 |         1 |                1 |
|         1 |                   1 |         1 |                2 |
|         2 |                   2 |         1 |                3 |
|       ... |                     |           |                  |
|       N-1 |                 N-1 |         1 |                N |

1 + 2 + 3 + ... + N = N(N+1)/2. So there is O(N**2) copies required to complete the loop.

Now consider

super_x = []
for i, df_chunk in enumerate(df_list):
    [x, y] = preprocess_data(df_chunk)
    super_x.append(x)
super_x = pd.concat(super_x, axis=0)

Appending to a list is an O(1) operation and does not require copying. Now there is a single call to pd.concat after the loop is done. This call to pd.concat requires N copies to be made, since super_x contains N DataFrames of size 1. So when constructed this way, super_x requires O(N) copies.

Miamiami answered 8/4, 2016 at 0:53 Comment(9)

Hi @unutbu, thanks for the detailed explanation, this really explained the theory in detail! – Disinfectant 8/4, 2016 at 1:0

Is it feasible to concatenate 2900 blocks of this shape, this way (43717, 3261)? The processing step now only takes 10 seconds. – Disinfectant 8/4, 2016 at 1:5

if using concat in a loop, wouldn't deleting the old dataframes in the loop resolve the issue? – Brittabrittain 7/6, 2019 at 2:21

@SantoshGupta7: The issue is about speed, not memory. The peak memory usage is about the same either way. Copying can be a slow operation when the dataframe is large and/or the loop is performed many times. Making O(n^2) copies is unnecessarily slow, since there is an O(n) alternative -- append to a list, concat once after the loop. – Miamiami 7/6, 2019 at 11:59

Minor comment: 1 + 2 + 3 + ... + N = N(N-1)/2 should it not be 1 + 2 + 3 + ... + N = N(N+1)/2 ? – Unbend 22/8, 2019 at 14:39

I think copying isn't the issue here, or copy=False should help. Every concat does quite some postprocessing, even though verify_integrity=False and sort=False. This takes longer as data grows and could be done at once at the end like this answer shows! – Runabout 7/1, 2020 at 14:20

Applying your solution to my program with more than 1.5 M data records resulted in execution time going from 60+ hours to under 1h! And I even understand why...! :-) Thanks! – Reinert 5/2, 2020 at 10:41

Applying this to a Kaggle notebook crunching 1.4m very wide records reduced the execution time from something over 9 hours (timeout) to 25 minutes - thanks! – Medication 28/7, 2020 at 22:48

Thank you for the excellent explanation. Similarly, if there are multiple columns in the final dataset, then we can create a dictionary first, and then convert it to a data frame finally. – Trematode 2/6, 2023 at 0:13

Every time you concatenate, you are returning a copy of the data.

You want to keep a list of your chunks, and then concatenate everything as the final step.

df_x = []
df_y = []
for i, df_chunk in enumerate(df_list):
    print "chunk", i
    [x, y] = preprocess_data(df_chunk)
    df_x.append(x)
    df_y.append(y)

super_x = pd.concat(df_x, axis=0)
del df_x  # Free-up memory.
super_y = pd.concat(df_y, axis=0)
del df_y  # Free-up memory.

Zavras answered 8/4, 2016 at 0:53 Comment(2)

Trying to manage memory manually in such high-level languages like Python is a bad practice because actually you cannot manage memory like in C for example. What happens when you del a variable is that you [remove a binding] (docs.python.org/3.10/reference/…) (third paragraph). Later garbage collector may release memory, but when and what amount depends on the GC algorithm (which is quite complex). – Intercross 14/2, 2022 at 10:38

@Intercross I would only delete if the data had a large memory footprint or available memory was limited. Also, it is easier to reassign, e.g. df_x = pd.concat(df_x, axis=0). – Zavras 15/2, 2022 at 7:5

Recommended topics

Hot tags