Join dataframes based on partial string-match between columns

Asked 10/9, 2018 at 21:0 Answered 12/7, 2019 at 20:57

Solved python pandas join string-matching partial

I have a dataframe which I want to compare if they are present in another df.

after_h.sample(10, random_state=1)

             movie           year   ratings
108 Mechanic: Resurrection   2016     4.0
206 Warcraft                 2016     4.0
106 Max Steel                2016     3.5
107 Me Before You            2016     4.5

I want to compare if the above movies are present in another df.

              FILM                   Votes
0   Avengers: Age of Ultron (2015)   4170
1   Cinderella (2015)                 950
2   Ant-Man (2015)                   3000 
3   Do You Believe? (2015)            350
4   Max Steel (2016)                  560

I want something like this as my final output:

    FILM              votes
0  Max Steel           560

Amick answered 10/9, 2018 at 21:0 Comment(1)

You can join with pd.merge() if you convert the compound string column df2['FILM'] into its two component columns movie_title (year) – Buttermilk 11/9, 2018 at 0:40

Given input dataframes df1 and df2, you can use Boolean indexing via pd.Series.isin. To align the format of the movie strings you need to first concatenate movie and year from df1:

s = df1['movie'] + ' (' + df1['year'].astype(str) + ')'

res = df2[df2['FILM'].isin(s)]

print(res)

               FILM  VOTES
4  Max Steel (2016)    560

Urdu answered 10/9, 2018 at 21:6 Comment(0)

There are two ways:

get the row-indices for partial-matches: FILM.startswith(title) or FILM.contains(title). Either of:

df1[ df1.movie.apply( lambda title: df2.FILM.str.startswith(title) ).any(1) ]

df1[ df1['movie'].apply(lambda title: df2['FILM'].str.contains(title)).any(1) ]

     movie      year      ratings
106  Max Steel  2016      3.5

Alternatively, you can use merge() if you convert the compound string column df2['FILM'] into its two component columns movie_title (year).

# see code at bottom to recreate your dataframes
df2[['movie','year']] = df2.FILM.str.extract('([^\(]*) \(([0-9]*)\)')
# reorder columns and drop 'FILM' now we have its subfields 'movie','year'
df2 = df2[['movie','year','Votes']]
df2['year'] = df2['year'].astype(int)

df2.merge(df1)
       movie  year  Votes  ratings
0  Max Steel  2016    560      3.5

(Acknowledging much help from @user3483203 here and in Python chat room)

Code to recreate dataframes:

import pandas as pd
from pandas.compat import StringIO

dat1 = """movie           year   ratings
108  Mechanic: Resurrection   2016     4.0
206  Warcraft                 2016     4.0
106  Max Steel                2016     3.5
107  Me Before You            2016     4.5"""

dat2 = """FILM                   Votes
0   Avengers: Age of Ultron (2015)   4170
1   Cinderella (2015)                 950
2   Ant-Man (2015)                   3000
3   Do You Believe? (2015)            350
4   Max Steel (2016)                  560"""

df1 = pd.read_csv(StringIO(dat1), sep='\s{2,}', engine='python', index_col=0)
df2 = pd.read_csv(StringIO(dat2), sep='\s{2,}', engine='python')

Buttermilk answered 10/9, 2018 at 22:7 Comment(3)

df2[df1['movie'].apply(lambda movie_title: df2['FILM'].str.contains(movie_title)).any(0)] – Biconcave 10/9, 2018 at 22:10

Partial matches may not be appropriate for sequels :) – Urdu 10/9, 2018 at 22:25

@jpp: Teehee. Complain to Tom Cruise or George Lucas... Yes strictly, df2['FILM'] is formatted and contains both the title and (year) in parentheses. I should show how if we fixed that up, we could do simple join on title. – Buttermilk 10/9, 2018 at 22:26

Given input dataframes df1 and df2, you can use Boolean indexing via pd.Series.isin. To align the format of the movie strings you need to first concatenate movie and year from df1:

s = df1['movie'] + ' (' + df1['year'].astype(str) + ')'

res = df2[df2['FILM'].isin(s)]

print(res)

               FILM  VOTES
4  Max Steel (2016)    560

Urdu answered 10/9, 2018 at 21:6 Comment(0)

smci's option 1 is nearly there, the following worked for me:

df1['Votes'] = ''
df1['Votes']=df1['movie'].apply(lambda title: df2[df2['FILM'].str.startswith(title)]['Votes'].any(0))

Explanation:

Create a Votes column in df1

Apply a lambda to every movie string in df1

The lambda looks up df2, selecting all rows in df2 where Film starts with the movie title

Select the Votes column of the resulting subset of df2

Take the first value in this column with any(0)

Stately answered 12/7, 2019 at 20:57 Comment(0)

Recommended topics

Hot tags