Pandas replace/dictionary slowness

Asked 1/2, 2017 at 17:7 Answered 7/6 at 1:17

Solved python performance pandas dictionary

Please help me understand why this "replace from dictionary" operation is slow in Python/Pandas:

# Series has 200 rows and 1 column
# Dictionary has 11269 key-value pairs
series.replace(dictionary, inplace=True)

Dictionary lookups should be O(1). Replacing a value in a column should be O(1). Isn't this a vectorized operation? Even if it's not vectorized, iterating 200 rows is only 200 iterations, so how can it be slow?

Here is a SSCCE demonstrating the issue:

import pandas as pd
import random

# Initialize dummy data
dictionary = {}
orig = []
for x in range(11270):
    dictionary[x] = 'Some string ' + str(x)
for x in range(200):
    orig.append(random.randint(1, 11269))
series = pd.Series(orig)

# The actual operation we care about
print('Starting...')
series.replace(dictionary, inplace=True)
print('Done.')

Running that command takes more than 1 second on my machine, which is 1000's of times longer than expected to perform <1000 operations.

Gorged answered 1/2, 2017 at 17:7 Comment(3)

Please provide a reproducible example, and define what you mean by "slow". I have no performance issues when I try to replicate your setup, with the replace taking ~200ms. – Unclassical 1/2, 2017 at 17:22

Edited OP with SSCCE. Is ~1ms per operation really the expected performance when working with Python? – Gorged 1/2, 2017 at 17:38

Relevant post which touches on why and when there's a difference in performance: https://mcmap.net/q/64305/-remap-values-in-pandas-column-with-a-dict-preserve-nans – Brophy 29/11, 2023 at 7:38

It looks like replace has a bit of overhead, and explicitly telling the Series what to do via map yields the best performance:

series = series.map(lambda x: dictionary.get(x,x))

If you're sure that all keys are in your dictionary you can get a very slight performance boost by not creating a lambda, and directly supplying the dictionary.get function. Any keys that are not present will return NaN via this method, so beware:

series = series.map(dictionary.get)

You can also supply just the dictionary itself, but this appears to introduce a bit of overhead:

series = series.map(dictionary)

Timings

Some timing comparisons using your example data:

%timeit series.map(dictionary.get)
10000 loops, best of 3: 124 µs per loop

%timeit series.map(lambda x: dictionary.get(x,x))
10000 loops, best of 3: 150 µs per loop

%timeit series.map(dictionary)
100 loops, best of 3: 5.45 ms per loop

%timeit series.replace(dictionary)
1 loop, best of 3: 1.23 s per loop

Unclassical answered 1/2, 2017 at 18:35 Comment(4)

Any idea what causes so much overhead? Using .replace(dictionary) on my dataframe was causing my notebook to crash after a decent wait but doing .map(dictionary.get) takes under a second. It's very strange to me that there could be orders of magnitude of overhead in a built-in function in the dataframe; I'd have expected .map to be worse than .replace. – P 10/10, 2019 at 2:53

.replace can do incomplete substring matches, while .map requires complete values to be supplied in the dictionary – Salado 28/2, 2020 at 19:21

Will also add that for DataFrame, use applymap instead: df = df.applymap(dictionary.get). – Crafty 23/10, 2020 at 23:58

I'm curious why your .map(dictionary) results are so off from @Shaurya Uppal's results. – Coneflower 26/11, 2022 at 8:0

.replacecan do incomplete substring matches, while .map requires complete values to be supplied in the dictionary (or it returns NaNs). The fast but generic solution (that can handle substring) should first use .replace on a dict of all possible values (obtained e.g. with .value_counts().index) and then go over all rows of the Series with this dict and .map. This combo can handle for instance special national characters replacements (full substrings) on 1m-row columns in a quarter of a second, where .replace alone would take 15.

Salado answered 28/2, 2020 at 19:30 Comment(0)

Thanks to @root: I did a benchmarking again and found different results on pandas v1.1.4

Found series.map(dictionary) fastest it also returns NaN is key not present

Warnerwarning answered 17/2, 2022 at 6:34 Comment(3)

I also get faster results with map(dictionary) rather than map(dictionary.get) on Python 3.10.5 and Pandas 1.4.2. In my case it's 6 times faster – Screw 20/7, 2022 at 15:18

@Screw 6 times that is wow. In the above screenshot, I got a gain of 1.9x times. – Warnerwarning 22/7, 2022 at 7:47

map(dictionary) returns NaN if a key is not found in the dictionary, so be cautious – Bunnybunow 6/10, 2023 at 10:54

I haven't done any benchmarking, but I had to do a mass replace on a fairly large file's column (120ish MB with some 15 columns) which was taking some 10-15 minutes just on that operation. I tried using map to create a temp column, using numpy to replace values where they exist and then dropping the temporary column. That only took a few seconds.

df['temp'] = df['original'].map(dict)
trans_df['original'] = np.where(df['temp'].isna(),
                                df['original'],
                                df['temp'])
df.drop(columns=['temp'], inplace=True)

Luxuriance answered 7/6 at 1:17 Comment(0)

Recommended topics

Hot tags