How to merge pandas table by regex

Asked 27/3, 2018 at 21:37 Answered 9/9, 2022 at 20:22

I am wondering if there a fast way to merge two pandas tables by the regular expression in python .

For example: table A

col1 col2             
1    apple_3dollars_5        
2    apple_2dollar_4
1    orange_5dollar_3
1    apple_1dollar_3

table B

col1 col2
good (apple|oragne)_\dollars_5
bad  .*_1dollar_.*
ok   oragne_\ddollar_\d

Output:

col1 col2              col3
1    apple_3dollars_5  good
1    orange_5dollar_3  ok
1    apple_1dollar_3   bad

this is just an example, what I want is instead of merging by one col that exactly match, I want to join by some regular expression. Thank you!

Buttercup answered 27/3, 2018 at 21:37 Comment(0)

First of all fix RegEx'es in the B DataFrame:

In [222]: B
Out[222]:
   col1                        col2
0  good  (apple|oragne)_\ddollars_5
1   bad               .*_1dollar_.*
2    ok          orange_\ddollar_\d

Now we can prepare the following variables:

In [223]: to_repl = B.col2.values.tolist()

In [224]: vals = B.col1.values.tolist()

In [225]: to_repl
Out[225]: ['(apple|oragne)_\\ddollars_5', '.*_1dollar_.*', 'orange_\\ddollar_\\d']

In [226]: vals
Out[226]: ['good', 'bad', 'ok']

Finally we can use them in the replace function:

In [227]: A['col3'] = A['col2'].replace(to_repl, vals, regex=True)

In [228]: A
Out[228]:
   col1              col2             col3
0     1  apple_3dollars_5             good
1     2   apple_2dollar_4  apple_2dollar_4
2     1  orange_5dollar_3               ok
3     1   apple_1dollar_3              bad

Wellknown answered 27/3, 2018 at 22:59 Comment(2)

Thank you so much for the answer. Sorry for the bad example. In my real case, I was trying to use a list of regular expression (table A) to match up a list of URLs(table B). Your answer works perfectly for my example. I am just wondering what if the two tables have multiple columns(not just replace one column). Or we want to do a left/inner join. Do you have any suggestions on it? To make it more simple is there something like: table1.merge(table2, how = 'inner', left_on = 'col1', right_on='col2', regex = True) Thank you so much! – Buttercup 27/3, 2018 at 23:33

the ".values" is not needed when you do a .tolist(). – Luker 11/2, 2019 at 11:36

I took the idea from https://python.tutorialink.com/can-i-perform-a-left-join-merge-between-two-dataframes-using-regular-expressions-with-pandas/ and improved it a little so that the original data can have more than one column and now we can make a real left join (merge) with regex!

import pandas as pd
d = {'extra_colum1': ['x', 'y', 'z', 'w'],'field': ['ab', 'a', 'cd', 'e'], 'extra_colum2': ['x', 'y', 'z', 'w']}
df = pd.DataFrame(data=d)
df_dict = pd.DataFrame(['a', 'b', 'c', 'd'], columns = 
['destination'])
df_dict['field'] = '.*' + df_dict['destination'] + '.*'
df_dict.columns=['destination','field']

dataframe and dict

def merge_regex(df, df_dict, how, field):
    import re
    df_dict = df_dict.drop_duplicates()
    idx = [(i,j) for i,r in enumerate(df_dict[f'{field}']) for j,v in enumerate(df[f'{field}']) if re.match(r,v)]
    df_dict_idx, df_idx = zip(*idx)
    t = df_dict.iloc[list(df_dict_idx),0].reset_index(drop=True)
    t1 = df.iloc[list(df_idx),df.columns.get_loc(f'{field}')].reset_index(drop=True)
    df_dict_translated = pd.concat([t,t1], axis=1)
    data = pd.merge(
                df,
                df_dict_translated,
                how=f'{how}',
                left_on=f'{field}',
                right_on=f'{field}'
            )
    data = data.drop_duplicates()
    return data

Ellipsis answered 28/10, 2021 at 17:6 Comment(0)

Similar to @MaxU, I use .replace, but I replace the column of values that you want to merge on with the regex strings that they match on. Small warning that this can cause some issues like non-unique index if your normal text matches more than one regex pattern. So using your dataframe A and @MaxU's fixed regexes for dataframe B:

A['joinCol'] = A.col2.replace(B.col2, B.col2, regex=True)
B.rename({'col2': 'joinCol'})  # the join columns should have the same name
C = A.join(B, on='joinCol')

If you want, you can then drop that join column:

C = C.drop('joinCol', axis=1)

Timekeeper answered 9/9, 2022 at 20:22 Comment(0)

Recommended topics

Hot tags