Using replace efficiently in pandas
Asked Answered
U

4

11

I am looking to use the replace function in an efficient way in python3. The code I have is achieving the task, but is much too slow, as I am working with a large dataset. Thus, my priority is efficiency over elegancy whenever there is a tradeoff. Here is a toy of what I would like to do:

import pandas as pd
df = pd.DataFrame([[1,2],[3,4],[5,6]], columns = ['1st', '2nd'])

       1st  2nd
   0    1    2
   1    3    4
   2    5    6


idxDict= dict()
idxDict[1] = 'a'
idxDict[3] = 'b'
idxDict[5] = 'c'

for k,v in idxDict.items():
    df ['1st'] = df ['1st'].replace(k, v)

Which gives

     1st  2nd
   0   a    2
   1   b    4
   2   c    6

as I desire, but it takes way too long. What would be the fastest way?

Edit: this is a more focused and clean question than this one, for which the solution is similar.

Unblushing answered 2/2, 2017 at 21:9 Comment(1)
Possible duplicate of Pandas replace/dictionary slownessAmmadas
A
15

use map to perform a lookup:

In [46]:
df['1st'] = df['1st'].map(idxDict)
df
Out[46]:
  1st  2nd
0   a    2
1   b    4
2   c    6

to avoid the situation where there is no valid key you can pass na_action='ignore'

You can also use df['1st'].replace(idxDict) but to answer you question about efficiency:

timings

In [69]:
%timeit df['1st'].replace(idxDict)
%timeit df['1st'].map(idxDict)

1000 loops, best of 3: 1.57 ms per loop
1000 loops, best of 3: 1.08 ms per loop

In [70]:    
%%timeit
for k,v in idxDict.items():
    df ['1st'] = df ['1st'].replace(k, v)

100 loops, best of 3: 3.25 ms per loop

So using map is over 3x faster here

on a larger dataset:

In [3]:
df = pd.concat([df]*10000, ignore_index=True)
df.shape

Out[3]:
(30000, 2)

In [4]:    
%timeit df['1st'].replace(idxDict)
%timeit df['1st'].map(idxDict)

100 loops, best of 3: 18 ms per loop
100 loops, best of 3: 4.31 ms per loop

In [5]:    
%%timeit
for k,v in idxDict.items():
    df ['1st'] = df ['1st'].replace(k, v)

100 loops, best of 3: 18.2 ms per loop

For 30K row df, map is ~4x faster so it scales better than replace or looping

Aliquot answered 2/2, 2017 at 21:11 Comment(6)
df['1st'].map(idxDict.get) is actually faster than just passing the dictionary itself.Ammadas
@Ammadas I don't observe that: In [15]: %timeit df['1st'].map(idxDict) %timeit df['1st'].map(idxDict.get) %timeit df['1st'].map(idxDict) 100 loops, best of 3: 8.38 ms per loop 100 loops, best of 3: 4.22 ms per loop passing just the dict is 2x fasterAliquot
Hmmm...I guess it depends on the size of the dictionary and values to be looked up. See the question I listed as a duplicate.Ammadas
@Ammadas yes I can imagine that here only looking up 3 values repeatedly would be quicker than using a significantly larger dictAliquot
Thanks!! Why is that the case?Zendavesta
While this answer is clearly correct and interesting, it is a little odd to focus on timings after eliminating the loop - which was clearly the problem for speed. The other differences between map and replace will almost certainly be more important than the speed, which will likely be tiny even on a large set.Millrace
G
7

While map is indeed faster, replace was updated in version 19.2 (details here) to improve its speed making the difference significantly less:

In [1]:
import pandas as pd


df = pd.DataFrame([[1,2],[3,4],[5,6]], columns = ['1st', '2nd'])
df = pd.concat([df]*10000, ignore_index=True)
df.shape

Out [1]:
(30000, 2)

In [2]:
idxDict = {1:'a', 3:"b", 5:"c"}
%timeit df['1st'].replace(idxDict, inplace=True)
%timeit df['1st'].update(df['1st'].map(idxDict))

Out [2]:
100 loops, best of 3: 12.8 ms per loop
100 loops, best of 3: 7.95 ms per loop

Additionally, I modified EdChum's code for map to include update, which, while slower, prevents values not included in an incomplete map from being changed to nans.

Granite answered 10/2, 2018 at 4:49 Comment(1)
this is nice solution w/ the update. i've seen it done w/ masks alsoUnsparing
J
3

In case NaN propagation is not wanted -- you want to replace values but keep ones that are not matched in the dict -- there are two other options:

def numpy_series_replace(series: pd.Series, mapping: dict) -> pd.Series:
    """Replace values in a series according to a mapping."""
    result = series.copy().values
    for k, v in mapping.items():
        result[series.values==k] = v
    return pd.Series(result, index=series.index)

or

def apply_series_replace(series: pd.Series, mapping: dict) -> pd.Series:
    return series.apply(lambda y: mapping.get(y,y))

The numpy implementation feels a little hacky but is somewhat faster.

v = pd.Series(np.random.randint(0, 10, 1000000))
mapper = {0: 1, 3: 2}

%timeit numpy_series_replace(v, mapper)
60.1 ms ± 200 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit apply_series_replace(v, mapper)
311 ms ± 10.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Jamison answered 18/11, 2020 at 21:55 Comment(3)
brilliant, for some reason, your numpy_series_replace is faster than the pd.map functionBaruch
In my case, and with a dict/mapper with more items to substitute, the apply_series_replace function is faster (~3x speedup with the dataframe I'm using, 2006 rows and 3044 cols).Merimerida
In my case, apply_series_replace() is by far the fastest (5ms vs 36s with .replace()). I'm replacing in a column with 30k rows and a map with 42k entries. Thanks Nolan.Bentlee
M
0

You don't have to loop through your dict! Pandas will apply an entire dict in one command of replace or map. And these methods can also use a Series.

You can also construct your dict with less effort, as you probably know.

d = {
1: 'a',
3: 'b', 
5: 'c',
}

df['1st'] = df['1st'].replace(d)

Others have noted tiny differences between map and replace in speed, but the loop was clearly your issue. And there are other differences between these methods that will probably dictate which is better to use.

Millrace answered 15/8, 2023 at 2:12 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.