Using replace efficiently in pandas

Asked 2/2, 2017 at 21:9 Answered 15/8, 2023 at 2:12

Solved python pandas indexing dataframe series

I am looking to use the replace function in an efficient way in python3. The code I have is achieving the task, but is much too slow, as I am working with a large dataset. Thus, my priority is efficiency over elegancy whenever there is a tradeoff. Here is a toy of what I would like to do:

import pandas as pd
df = pd.DataFrame([[1,2],[3,4],[5,6]], columns = ['1st', '2nd'])

       1st  2nd
   0    1    2
   1    3    4
   2    5    6


idxDict= dict()
idxDict[1] = 'a'
idxDict[3] = 'b'
idxDict[5] = 'c'

for k,v in idxDict.items():
    df ['1st'] = df ['1st'].replace(k, v)

Which gives

as I desire, but it takes way too long. What would be the fastest way?

Edit: this is a more focused and clean question than this one, for which the solution is similar.

Unblushing answered 2/2, 2017 at 21:9 Comment(1)

Possible duplicate of Pandas replace/dictionary slowness – Ammadas 2/2, 2017 at 21:48

use map to perform a lookup:

In [46]:
df['1st'] = df['1st'].map(idxDict)
df
Out[46]:
  1st  2nd
0   a    2
1   b    4
2   c    6

to avoid the situation where there is no valid key you can pass na_action='ignore'

You can also use df['1st'].replace(idxDict) but to answer you question about efficiency:

timings

In [69]:
%timeit df['1st'].replace(idxDict)
%timeit df['1st'].map(idxDict)

1000 loops, best of 3: 1.57 ms per loop
1000 loops, best of 3: 1.08 ms per loop

In [70]:    
%%timeit
for k,v in idxDict.items():
    df ['1st'] = df ['1st'].replace(k, v)

100 loops, best of 3: 3.25 ms per loop

So using map is over 3x faster here

on a larger dataset:

In [3]:
df = pd.concat([df]*10000, ignore_index=True)
df.shape

Out[3]:
(30000, 2)

In [4]:    
%timeit df['1st'].replace(idxDict)
%timeit df['1st'].map(idxDict)

100 loops, best of 3: 18 ms per loop
100 loops, best of 3: 4.31 ms per loop

In [5]:    
%%timeit
for k,v in idxDict.items():
    df ['1st'] = df ['1st'].replace(k, v)

100 loops, best of 3: 18.2 ms per loop

For 30K row df, map is ~4x faster so it scales better than replace or looping

Aliquot answered 2/2, 2017 at 21:11 Comment(6)

df['1st'].map(idxDict.get) is actually faster than just passing the dictionary itself. – Ammadas 2/2, 2017 at 21:48

@Ammadas I don't observe that:

In [15]:  %timeit df['1st'].map(idxDict) %timeit df['1st'].map(idxDict.get) %timeit df['1st'].map(idxDict) 100 loops, best of 3: 8.38 ms per loop 100 loops, best of 3: 4.22 ms per loop

passing just the dict is 2x faster – Aliquot 2/2, 2017 at 21:50

Hmmm...I guess it depends on the size of the dictionary and values to be looked up. See the question I listed as a duplicate. – Ammadas 2/2, 2017 at 21:51

@Ammadas yes I can imagine that here only looking up 3 values repeatedly would be quicker than using a significantly larger dict – Aliquot 2/2, 2017 at 21:54

Thanks!! Why is that the case? – Zendavesta 9/2, 2021 at 8:53

While this answer is clearly correct and interesting, it is a little odd to focus on timings after eliminating the loop - which was clearly the problem for speed. The other differences between map and replace will almost certainly be more important than the speed, which will likely be tiny even on a large set. – Millrace 15/8, 2023 at 2:10

While map is indeed faster, replace was updated in version 19.2 (details here) to improve its speed making the difference significantly less:

In [1]:
import pandas as pd


df = pd.DataFrame([[1,2],[3,4],[5,6]], columns = ['1st', '2nd'])
df = pd.concat([df]*10000, ignore_index=True)
df.shape

Out [1]:
(30000, 2)

In [2]:
idxDict = {1:'a', 3:"b", 5:"c"}
%timeit df['1st'].replace(idxDict, inplace=True)
%timeit df['1st'].update(df['1st'].map(idxDict))

Out [2]:
100 loops, best of 3: 12.8 ms per loop
100 loops, best of 3: 7.95 ms per loop

Additionally, I modified EdChum's code for map to include update, which, while slower, prevents values not included in an incomplete map from being changed to nans.

Granite answered 10/2, 2018 at 4:49 Comment(1)

this is nice solution w/ the update. i've seen it done w/ masks also – Unsparing 26/3, 2023 at 20:10

In case NaN propagation is not wanted -- you want to replace values but keep ones that are not matched in the dict -- there are two other options:

def numpy_series_replace(series: pd.Series, mapping: dict) -> pd.Series:
    """Replace values in a series according to a mapping."""
    result = series.copy().values
    for k, v in mapping.items():
        result[series.values==k] = v
    return pd.Series(result, index=series.index)

def apply_series_replace(series: pd.Series, mapping: dict) -> pd.Series:
    return series.apply(lambda y: mapping.get(y,y))

The numpy implementation feels a little hacky but is somewhat faster.

v = pd.Series(np.random.randint(0, 10, 1000000))
mapper = {0: 1, 3: 2}

%timeit numpy_series_replace(v, mapper)
60.1 ms ± 200 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit apply_series_replace(v, mapper)
311 ms ± 10.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Jamison answered 18/11, 2020 at 21:55 Comment(3)

brilliant, for some reason, your numpy_series_replace is faster than the pd.map function – Baruch 3/2, 2021 at 9:32

In my case, and with a dict/mapper with more items to substitute, the apply_series_replace function is faster (~3x speedup with the dataframe I'm using, 2006 rows and 3044 cols). – Merimerida 21/4, 2022 at 11:53

In my case, apply_series_replace() is by far the fastest (5ms vs 36s with .replace()). I'm replacing in a column with 30k rows and a map with 42k entries. Thanks Nolan. – Bentlee 7/2, 2023 at 3:16

You don't have to loop through your dict! Pandas will apply an entire dict in one command of replace or map. And these methods can also use a Series.

You can also construct your dict with less effort, as you probably know.

d = {
1: 'a',
3: 'b', 
5: 'c',
}

df['1st'] = df['1st'].replace(d)

Others have noted tiny differences between map and replace in speed, but the loop was clearly your issue. And there are other differences between these methods that will probably dictate which is better to use.

Millrace answered 15/8, 2023 at 2:12 Comment(0)

Recommended topics

Hot tags