Creating a derived field based on df value comparison in python pandas
Asked Answered
M

1

5

I have 2 dataframes - one is a data source dataframe and another is reference dataframe. I want to create an additional column in df1 based on the comparison of those 2 dataframes

df1 - data source

No | Name
213344 | Apple
242342 | Orange
234234 | Pineapple

df2 - reference table

RGE_FROM | RGE_TO | Value
2100 | 2190 | Sweet
2200 | 2322 | Bitter
2400 | 5000 | Neutral

final if first 4 character of df1.No fall between the range of df2.RGE_FROM to df2.RGE_TO, get df2.Value for the derived column df.DESC. else, blank

No | Name | DESC
213344 | Apple | Sweet
242342 | Orange | Natural
234234 | Pineapple | 

Any help is appreciated! Thank you!

Methaemoglobin answered 4/5, 2021 at 13:50 Comment(0)
S
7

We can create an IntervalIndex from the columns RGE_FROM and RGE_TO, then set this as an index of column Value to create a mapping series, then slice the first four characters in the column No and using Series.map substitute the values from the mapping series.

i =  pd.IntervalIndex.from_arrays(df2['RGE_FROM'], df2['RGE_TO'], closed='both')
df1['Value'] = df1['No'].astype(str).str[:4].astype(int).map(df2.set_index(i)['Value'])

       No       Name    Value
0  213344      Apple    Sweet
1  242342     Orange  Neutral
2  234234  Pineapple      NaN
Sunglass answered 4/5, 2021 at 14:4 Comment(7)
Question: How does Series.map know to use IntervalIndex as key instead of int when the series type is actually int? Probably the map implementation looks at the index type see if its IntervalIndex, looks if the int falls in the range and maps it?Larock
@Larock Exactly right! In order to substitute values normally mapping operation matches the values in the column to the corresponding matching index values in the mapping series. But in case if the index of mapping series is IntervalIndex then the map operation tries to match the values in the column which falls in the interval range.Sunglass
Hi @ShubhamSharma! Thank you for your response. I am getting the error - "Category, Object, and String Subtypes are not supported for IntervalIndex"Methaemoglobin
@Methaemoglobin We need to change the data types of columns RGE_FROM and RGE_TO to integer type before creating an interval index from them.Sunglass
@ShubhamSharma, i am having trouble converting it to integer type. df2['RGE_FROM'].astype(int). is it the correct way?Methaemoglobin
@ShubhamSharma found the solution to y previous qn. got a new error - left side of interval must be <= right sideMethaemoglobin
Let us continue this discussion in chat.Sunglass

© 2022 - 2024 — McMap. All rights reserved.