Improve Pandas Merge performance

Asked 29/11, 2016 at 7:52 Answered 28/1, 2021 at 11:23

I specifically dont have performace issue with Pands Merge, as other posts suggest, but I've a class in which there are lot of methods, which does a lot of merge on datasets.

The class has around 10 group by and around 15 merge. While groupby is pretty fast, out of total execution time of 1.5 seconds for class, around 0.7 seconds goes in those 15 merge calls.

I want to speed up performace in those merge calls. As I will have around 4000 iterations, hence saving .5 seconds overall in single iteration will lead to overall performance reduction by around 30min, which will be great.

Any suggestions I should try? I tried: Cython Numba, and Numba was slower.

Thanks

Edit 1: Adding sample code snippets: My merge statements:

tmpDf = pd.merge(self.data, t1, on='APPT_NBR', how='left')
tmp = tmpDf

tmpDf = pd.merge(tmp, t2, on='APPT_NBR', how='left')
tmp = tmpDf

tmpDf = pd.merge(tmp, t3, on='APPT_NBR', how='left')
tmp = tmpDf

tmpDf = pd.merge(tmp, t4, on='APPT_NBR', how='left')
tmp = tmpDf

tmpDf = pd.merge(tmp, t5, on='APPT_NBR', how='left')

And, by implementing Joins, I incorporate the following satatements:

dat = self.data.set_index('APPT_NBR')

t1.set_index('APPT_NBR', inplace=True)
t2.set_index('APPT_NBR', inplace=True)
t3.set_index('APPT_NBR', inplace=True)
t4.set_index('APPT_NBR', inplace=True)
t5.set_index('APPT_NBR', inplace=True)

tmpDf = dat.join(t1, how='left')
tmpDf = tmpDf.join(t2, how='left')
tmpDf = tmpDf.join(t3, how='left')
tmpDf = tmpDf.join(t4, how='left')
tmpDf = tmpDf.join(t5, how='left')

tmpDf.reset_index(inplace=True)

Note, all are part of a function named: def merge_earlier_created_values(self):

And, when I did timedcall from profilehooks by following:

@timedcall(immediate=True)
def merge_earlier_created_values(self):

I get following results:

The result of profiling of that method gives:

@profile(immediate=True)
def merge_earlier_created_values(self):

The profiling of function, by using Merge is as follows:

*** PROFILER RESULTS ***
merge_earlier_created_values (E:\Projects\Predictive Inbound Cartoon     Estimation-MLO\Python\CodeToSubmit\helpers\get_prev_data_by_date.py:122)
function called 1 times

     71665 function calls (70588 primitive calls) in 0.524 seconds

Ordered by: cumulative time, internal time, call count
List reduced from 563 to 40 due to restriction <40>

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1    0.012    0.012    0.524    0.524 get_prev_data_by_date.py:122(merge_earlier_created_values)
   14    0.000    0.000    0.285    0.020 generic.py:1901(_update_inplace)
   14    0.000    0.000    0.285    0.020 generic.py:1402(_maybe_update_cacher)
   19    0.000    0.000    0.284    0.015 generic.py:1492(_check_setitem_copy)
    7    0.283    0.040    0.283    0.040 {built-in method gc.collect}
   15    0.000    0.000    0.181    0.012 generic.py:1842(drop)
   10    0.000    0.000    0.153    0.015 merge.py:26(merge)
   10    0.000    0.000    0.140    0.014 merge.py:201(get_result)
  8/4    0.000    0.000    0.126    0.031 decorators.py:65(wrapper)
    4    0.000    0.000    0.126    0.031 frame.py:3028(drop_duplicates)
    1    0.000    0.000    0.102    0.102 get_prev_data_by_date.py:264(recreate_previous_cartons)
    1    0.000    0.000    0.101    0.101 get_prev_data_by_date.py:231(recreate_previous_appt_scheduled_date)
    1    0.000    0.000    0.098    0.098 get_prev_data_by_date.py:360(recreate_previous_freight_type)
   10    0.000    0.000    0.092    0.009 internals.py:4455(concatenate_block_managers)
   10    0.001    0.000    0.088    0.009 internals.py:4471(<listcomp>)
  120    0.001    0.000    0.084    0.001 internals.py:4559(concatenate_join_units)
  266    0.004    0.000    0.067    0.000 common.py:733(take_nd)
  120    0.000    0.000    0.061    0.001 internals.py:4569(<listcomp>)
  120    0.003    0.000    0.061    0.001 internals.py:4814(get_reindexed_values)
    1    0.000    0.000    0.059    0.059 get_prev_data_by_date.py:295(recreate_previous_appt_status)
   10    0.000    0.000    0.038    0.004 merge.py:322(_get_join_info)
   10    0.001    0.000    0.036    0.004 merge.py:516(_get_join_indexers)
   25    0.001    0.000    0.024    0.001 merge.py:687(_factorize_keys)
   74    0.023    0.000    0.023    0.000 {pandas.algos.take_2d_axis1_object_object}
   50    0.022    0.000    0.022    0.000 {method 'factorize' of 'pandas.hashtable.Int64Factorizer' objects}
  120    0.003    0.000    0.022    0.000 internals.py:4479(get_empty_dtype_and_na)
   88    0.000    0.000    0.021    0.000 frame.py:1969(__getitem__)
    1    0.000    0.000    0.019    0.019 get_prev_data_by_date.py:328(recreate_previous_location_numbers)
   39    0.000    0.000    0.018    0.000 internals.py:3495(reindex_indexer)
  537    0.017    0.000    0.017    0.000 {built-in method numpy.core.multiarray.empty}
   15    0.000    0.000    0.017    0.001 ops.py:725(wrapper)
   15    0.000    0.000    0.015    0.001 frame.py:2011(_getitem_array)
   24    0.000    0.000    0.014    0.001 internals.py:3625(take)
   10    0.000    0.000    0.014    0.001 merge.py:157(__init__)
   10    0.000    0.000    0.014    0.001 merge.py:382(_get_merge_keys)
   15    0.008    0.001    0.013    0.001 ops.py:662(na_op)
  234    0.000    0.000    0.013    0.000 common.py:158(isnull)
  234    0.001    0.000    0.013    0.000 common.py:179(_isnull_new)
   15    0.000    0.000    0.012    0.001 generic.py:1609(take)
   20    0.000    0.000    0.012    0.001 generic.py:2191(reindex)

The profiling by using Joins is as follows:

65079 function calls (63990 primitive calls) in 0.550 seconds

Ordered by: cumulative time, internal time, call count
List reduced from 592 to 40 due to restriction <40>

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1    0.016    0.016    0.550    0.550 get_prev_data_by_date.py:122(merge_earlier_created_values)
   14    0.000    0.000    0.295    0.021 generic.py:1901(_update_inplace)
   14    0.000    0.000    0.295    0.021 generic.py:1402(_maybe_update_cacher)
   19    0.000    0.000    0.294    0.015 generic.py:1492(_check_setitem_copy)
    7    0.293    0.042    0.293    0.042 {built-in method gc.collect}
   10    0.000    0.000    0.173    0.017 generic.py:1842(drop)
   10    0.000    0.000    0.139    0.014 merge.py:26(merge)
  8/4    0.000    0.000    0.138    0.034 decorators.py:65(wrapper)
    4    0.000    0.000    0.138    0.034 frame.py:3028(drop_duplicates)
   10    0.000    0.000    0.132    0.013 merge.py:201(get_result)
    5    0.000    0.000    0.122    0.024 frame.py:4324(join)
    5    0.000    0.000    0.122    0.024 frame.py:4371(_join_compat)
    1    0.000    0.000    0.111    0.111 get_prev_data_by_date.py:264(recreate_previous_cartons)
    1    0.000    0.000    0.103    0.103 get_prev_data_by_date.py:231(recreate_previous_appt_scheduled_date)
    1    0.000    0.000    0.099    0.099 get_prev_data_by_date.py:360(recreate_previous_freight_type)
   10    0.000    0.000    0.093    0.009 internals.py:4455(concatenate_block_managers)
   10    0.001    0.000    0.089    0.009 internals.py:4471(<listcomp>)
  100    0.001    0.000    0.085    0.001 internals.py:4559(concatenate_join_units)
  205    0.003    0.000    0.068    0.000 common.py:733(take_nd)
  100    0.000    0.000    0.060    0.001 internals.py:4569(<listcomp>)
  100    0.001    0.000    0.060    0.001 internals.py:4814(get_reindexed_values)
    1    0.000    0.000    0.056    0.056 get_prev_data_by_date.py:295(recreate_previous_appt_status)
   10    0.000    0.000    0.033    0.003 merge.py:322(_get_join_info)
   52    0.031    0.001    0.031    0.001 {pandas.algos.take_2d_axis1_object_object}
    5    0.000    0.000    0.030    0.006 base.py:2329(join)
   37    0.001    0.000    0.027    0.001 internals.py:2754(apply)
    6    0.000    0.000    0.024    0.004 frame.py:2763(set_index)
    7    0.000    0.000    0.023    0.003 merge.py:516(_get_join_indexers)
    2    0.000    0.000    0.022    0.011 base.py:2483(_join_non_unique)
    7    0.000    0.000    0.021    0.003 generic.py:2950(copy)
    7    0.000    0.000    0.021    0.003 internals.py:3046(copy)
   84    0.000    0.000    0.020    0.000 frame.py:1969(__getitem__)
   19    0.001    0.000    0.019    0.001 merge.py:687(_factorize_keys)
  100    0.002    0.000    0.019    0.000 internals.py:4479(get_empty_dtype_and_na)
    1    0.000    0.000    0.018    0.018 get_prev_data_by_date.py:328(recreate_previous_location_numbers)
   15    0.000    0.000    0.017    0.001 ops.py:725(wrapper)
   34    0.001    0.000    0.017    0.000 internals.py:3495(reindex_indexer)
   83    0.004    0.000    0.016    0.000 internals.py:3211(_consolidate_inplace)
   68    0.015    0.000    0.015    0.000 {method 'copy' of 'numpy.ndarray' objects}
   15    0.000    0.000    0.015    0.001 frame.py:2011(_getitem_array)

As you can see, the merge is faster than joins, though it is small value, but over 4000 iterations, that small value becomes a huge number, in minutes.

Thanks

Howells answered 29/11, 2016 at 7:52 Comment(1)

Set your merge columns as index, and use df1.join(df2) instead. – Jeramyjerba 29/11, 2016 at 7:57

set_index on merging column does indeed speed this up. Below is a slightly more realistic version of julien-marrec's Answer.

import pandas as pd
import numpy as np
myids=np.random.choice(np.arange(10000000), size=1000000, replace=False)
df1 = pd.DataFrame(myids, columns=['A'])
df1['B'] = np.random.randint(0,1000,(1000000))
df2 = pd.DataFrame(np.random.permutation(myids), columns=['A2'])
df2['B2'] = np.random.randint(0,1000,(1000000))

%%timeit
    x = df1.merge(df2, how='left', left_on='A', right_on='A2')   
#1 loop, best of 3: 664 ms per loop

%%timeit  
    x = df1.set_index('A').join(df2.set_index('A2'), how='left') 
#1 loop, best of 3: 354 ms per loop

%%time 
    df1.set_index('A', inplace=True)
    df2.set_index('A2', inplace=True)
#Wall time: 16 ms

%%timeit
    x = df1.join(df2, how='left')  
#10 loops, best of 3: 80.4 ms per loop

When the column to be joined has integers not in the same order on both tables you can still expect a great speed up of 8 times.

Exaggerative answered 24/3, 2017 at 15:40 Comment(7)

A short explanation why it is faster to merge by index instead of by a "normal" column: Indices have a hash table. Meaning you can look them up in amortized O(1). For a normal column you need O(n) in worst case, meaning merging two dfs with len n takes O(n^2) in worst case. – Krumm 18/6, 2019 at 9:21

In my case the DataFrame.merge() is significantly faster (x5). I am working on a left join from a 3m+ lines dataframe as left and a 900+ row dataframe as right. My index is string, that's pretty much the only explanation i can see – Display 16/7, 2019 at 8:55

A word of caution: speed gains will depend on whether your index is unique. It may even take longer to merge two dataframes on the index if it is not unique. – Cellini 19/7, 2019 at 15:3

Would that still work for multindex? x = df1.set_index(['A','B']).join(df2.set_index((['A','B']), how='left') ? – Lilywhite 26/7, 2019 at 6:48

@Lilywhite yes it does work for multiindex. Check the official docs pandas.pydata.org/docs/reference/api/…. – Jamille 6/5, 2021 at 16:24

Sorting the indexes beforehand improves the speed – Tribromoethanol 1/6, 2022 at 18:44

If set_index before join/merge generally improves the performance, why doesn't join/merge do this by default? – Stringboard 27/8, 2023 at 3:4

I suggest that you set your merge columns as index, and use df1.join(df2) instead of merge, it's much faster.

Here's some example including profiling:

In [1]:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.arange(1000000), columns=['A'])
df1['B'] = np.random.randint(0,1000,(1000000))
df2 = pd.DataFrame(np.arange(1000000), columns=['A2'])
df2['B2'] = np.random.randint(0,1000,(1000000))

Here's a regular left merge on A and A2:

In [2]: %%timeit
        x = df1.merge(df2, how='left', left_on='A', right_on='A2')

1 loop, best of 3: 441 ms per loop

Here's the same, using join:

In [3]: %%timeit
        x = df1.set_index('A').join(df2.set_index('A2'), how='left')

1 loop, best of 3: 184 ms per loop

Now obviously if you can set the index before looping, the gain in terms of time will be much greater:

# Do this before looping
In [4]: %%time
df1.set_index('A', inplace=True)
df2.set_index('A2', inplace=True)

CPU times: user 9.78 ms, sys: 9.31 ms, total: 19.1 ms
Wall time: 16.8 ms

Then in the loop, you'll get something that in this case is 30 times faster:

In [5]: %%timeit
        x = df1.join(df2, how='left')
100 loops, best of 3: 14.3 ms per loop

Jeramyjerba answered 29/11, 2016 at 7:57 Comment(7)

It is a left merge/join. how params in merge is "left", That will work with join? – Howells 29/11, 2016 at 8:2

Somehow I dont see much improvement in performance on my dataset. If I convert all Merge to Joins, rather there is increase in time by around 0.1-0.3 seconds. I converted some merge to joins, and could reduce time by ~0.2 seconds. Anything, I'm missing? Or anything I need to produce like code? – Howells 29/11, 2016 at 11:11

If you would include a mcve to your question that would help to test your specific application. That also is part of why this question was flagged as 'off-topic' – Jeramyjerba 29/11, 2016 at 11:17

Please see updated question, and hope that makes the question relevent. If yes, also it would be great if you can remove off-topic flag, as people wdoing lots of merges, might find this useful, as there are not many posts similar to this question. – Howells 30/11, 2016 at 6:11

Great solution, but make sure to preserve the key col(s) in your df's, b/c set_index would drop them by default (e.g. use: df1.set_index('A', inplace=True, drop=False). – Brawny 9/12, 2021 at 14:16

Another problem is that the original index may be still needed, but after joining it changes to d2.index. So it may be prudent to reset the index using .reset_index(inplace=True, drop=True) after joining. – Brawny 13/12, 2021 at 12:12

And finally...:) data is re-shuffled row-wise by default by all join operations, so you have to keep a unique key and re-sort the data if ordering is important (e.g. for visual inspection or if variables have time component). – Brawny 13/12, 2021 at 12:30

I don't know if this deserved a new answer but personally, the following tricks helped me improve a bit more the joins I had to do on big DataFrames (millions of rows and hundreds of columns):

Beside using set_index(index, inplace=True), you may want to sort it using sort_index(inplace=True). This speeds up a lot the join if your index is not ordered. For example, creating the DataFrames with

import random
import pandas as pd
import numpy as np

nbre_items = 100000

ids = np.arange(nbre_items)
random.shuffle(ids)

df1 = pd.DataFrame({"id": ids})
df1['value'] = 1
df1.set_index("id", inplace=True)

random.shuffle(ids)

df2 = pd.DataFrame({"id": ids})
df2['value2'] = 2
df2.set_index("id", inplace=True)

I got the following results:

%timeit df1.join(df2)
13.2 ms ± 349 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

And after sorting the index (which takes a limited amount of time):

df1.sort_index(inplace=True)
df2.sort_index(inplace=True)
%timeit df1.join(df2)
764 µs ± 17.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

You can split one of your DataFrames in multiple ones with fewer columns. This trick gave me mixed results so be cautious when using it. For example:

for i in range(0, df2.shape[1], 100):
    df1 = df1.join(df2.iloc[:, i:min(df2.shape[1], (i + 100))], how='outer')

Appendectomy answered 28/1, 2021 at 11:23 Comment(4)

In order to be compatible the comparison you should include the two sort_index operations. You can do a multiline timing using %%timeit and place your code at the lines below it – Shibboleth 28/1, 2021 at 13:56

Thank you for the tip! I tested with both sort_index taken into account in the %timeit and still get a full process that is 3 times faster. So, in case of unordered index, this still seems to help. – Appendectomy 29/1, 2021 at 9:48

While sorting can last as long as a normal join itself ... it does improve join times for abnormally long joins (typically the first one of of multiple joins performed sequentially) – Brawny 25/12, 2021 at 10:21

sort_index really did the trick for me! A pd.concat() went from 10+ seconds to a fraction of a second! – Elapse 29/12, 2021 at 6:10

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags