Comparing pandas map and merge
Asked Answered
S

1

6

I have the following df:

df = pd.DataFrame({'key': {0: 'EFG_DS_321',
    1: 'EFG_DS_900',
      2: 'EFG_DS_900',
      3: 'EFG_Q_900',
      4: 'EFG_DS_1000',
      5: 'EFG_DS_1000',
      6: 'EFG_DS_1000',
      7: 'ABC_DS_444',
      8: 'EFG_DS_900',
      9: 'EFG_DS_900',
      10: 'EFG_DS_321',
      11: 'EFG_DS_900',
      12: 'EFG_DS_1000',
      13: 'EFG_DS_900',
      14: 'EFG_DS_321',
      15: 'EFG_DS_321',
      16: 'EFG_DS_1000',
      17: 'EFG_DS_1000',
      18: 'EFG_DS_1000',
      19: 'EFG_DS_1000',
      20: 'ABC_DS_444',
      21: 'EFG_DS_900',
      22: 'EFG_DAS_12345',
      23: 'EFG_DAS_12345',
      24: 'EFG_DAS_321',
      25: 'EFG_DS_321',
      26: 'EFG_DS_12345',
      27: 'EFG_Q_1000',
      28: 'EFG_DS_900',
      29: 'EFG_DS_321'}})

and I have the following dict:

d = {'ABC_AS_1000': 123,
  'ABC_AS_444': 321,
  'ABC_AS_231341': 421,
  'ABC_AS_888': 412,
  'ABC_AS_087': 4215,
  'ABC_DAS_1000': 3415,
  'ABC_DAS_444': 4215,
  'ABC_DAS_231341': 3214,
  'ABC_DAS_888': 321,
  'ABC_DAS_087': 111,
  'ABC_Q_1000': 222,
  'ABC_Q_444': 3214,
  'ABC_Q_231341': 421,
  'ABC_Q_888': 321,
  'ABC_Q_087': 41,
  'ABC_DS_1000': 421,
  'ABC_DS_444': 421,
  'ABC_DS_231341': 321,
  'ABC_DS_888': 41,
  'ABC_DS_087': 41,
  'EFG_AS_1000': 213,
  'EFG_AS_900': 32,
  'EFG_AS_12345': 1,
  'EFG_AS_321': 3,
  'EFG_DAS_1000': 421,
  'EFG_DAS_900': 321,
  'EFG_DAS_12345': 123,
  'EFG_DAS_321': 31,
  'EFG_Q_1000': 41,
  'EFG_Q_900': 51,
  'EFG_Q_12345': 321,
  'EFG_Q_321': 321,
  'EFG_DS_1000': 41,
  'EFG_DS_900': 51,
  'EFG_DS_12345': 321,
  'EFG_DS_321': 1}

I want to map d into df, but given that the real data is very large and complicated, i'm trying to understand if map or merge is better in terms of efficiency (running time). first option: a simple map

res = df['key'].map(d)

second option: convert d into a dataframe and preform a merge

d1 = pd.DataFrame.from_dict(d,orient='index',columns=['res'])
res = df.merge(d1,left_on='key',right_index=True)['res']

Any help will be much appreciated (or any better solutions of course:))

Shutt answered 12/1, 2022 at 13:45 Comment(0)
R
5

map will be faster than a merge

If your goal is simply to assign a numerical category to each unique value in df['AB'], you could use pandas.factorize that should be a bit faster than map:

res = df['AB'].factorize()[0]+1

output: array([1, 1, 1, 2, 2, 3, 3, 3])

test on 800k rows:

factorize   28.6 ms ± 153 µs 
map         32.1 ms ± 110 µs
merge       68.6 ms ± 1.33 ms
Radial answered 12/1, 2022 at 13:56 Comment(3)
I actually need to assign a specific value and not a numerical category. the thing is that my real data dictionary key is something like: 'first_us_1000', and i think that makes the map pretty slow.Shutt
Can you provide a real-life example? Optimization without the real data is impossible ;)Radial
Just edited the post to contain a sample of my dataShutt

© 2022 - 2024 — McMap. All rights reserved.