Pandas 1.1.0 apply function is altering the row in place

Asked 13/8, 2020 at 17:58 Answered 13/8, 2020 at 19:16

I have a small DF (2rows x 4cols). And a function that will add an extra column depending on some logic, once the apply is performed. With Pandas 0.24.2 I've been doing this as df.apply(func, axis=1) and I would get my extra column. So far, so good.

Now with Pandas 1.1.0 something weird happens: when I apply, the first row is processed twice, and the second row is not even considered.

I will show the original DF, the expected one, and the function. I added a print(row) so you can see how the first row of the DF is repeated in the process.

In [82]: df_attr_list                                                                                                                                                                                                                        
Out[82]: 
      name attrName string_value dict_value
0  FW12611  HW type         None       ALU1
1  FW12612  HW type         None       ALU1

Now, the function, and its output ...

def setFinalValue(row):
    rtrName      = row['name']
    attrName     = row['attrName'].replace(" ","")
    dict_value   = row['dict_value']
    string_value = row['string_value']
    finalValue   = 'N/A'

    if attrName in ['Val1','Val2','Val3']:
        finalValue = dict_value
    elif attrName in ['Val4','Val5',]:
        finalValue = string_value
    else:
        finalValue = "N/A"
    row['finalValue'] = finalValue

    print(row)
    
    return row

Now, the output after the apply ...

In [83]: df_attr_list.apply(setFinalValue, axis=1)                                                                                                                                                                                           
name                       FW12611
attrName                   HW type
string_value                  None
dict_value                    ALU1
finalValue                    ALU1
Name: 0, dtype: object
name                       FW12611
attrName                   HW type
string_value                  None
dict_value                    ALU1
finalValue                    ALU1
Name: 1, dtype: object
Out[83]: 
      name attrName string_value dict_value finalValue
0  FW12611  HW type         None       ALU1       ALU1
1  FW12611  HW type         None       ALU1       ALU1

As you can see, the extra column is added, but the first row of the original DF is processed twice, as if the second didn't exist ...

Why is this happening?

I'm already trying this out with pandas 1.1.0...

In [86]: print(pd.__version__)                                                                                                                                                                                                               
1.1.0

thanks!

Counsellor answered 13/8, 2020 at 17:58 Comment(2)

Does this answer your question? Why does pandas apply calculate twice – Allcot 13/8, 2020 at 18:0

Thanks for the link. Have already seen it. It does not solve my issue. Further, It suggests going to Pandas 1.1.0 and I'm already using it. Actually, as per your second link, I would expect at least the first row being processed twice, but the second to be processed as well: that's not happening ... – Counsellor 13/8, 2020 at 18:5

As per Pandas 1.1.0 What's New Doc: apply and applymap on DataFrame evaluates first row/column only once, .apply does not evaluate the first row twice.
The issue is, the dataframe is replaced when row is returned.
- This seems to be a result of BUG: DataFrame.apply with func altering row in-place #35633
  - Also see Backport PR #35633 on branch 1.1.x (BUG: DataFrame.apply with func altering row in-place) #35666
- Remove row['finalValue'] = finalValue and return finalValue instead of row.
Call the function with df['finalValue'] = df.apply(setFinalValue, axis=1).

import pandas as pd

data = {'name': ['FW12611', 'FW12612', 'FW12613'],
 'attrName': ['HW type', 'HW type', 'HW type'],
 'string_value': ['None', 'None', 'None'],
 'dict_value': ['ALU1', 'ALU1', 'ALU1']}

df = pd.DataFrame(data)


def setFinalValue(row):
    print(row)
    rtrName      = row['name']
    attrName     = row['attrName'].replace(" ","")
    dict_value   = row['dict_value']
    string_value = row['string_value']
    finalValue   = 'N/A'

    if attrName in ['Val1','Val2','Val3']:
        finalValue = dict_value
    elif attrName in ['Val4','Val5',]:
        finalValue = string_value
    else:
        finalValue = "N/A"

    print('\n')
    return finalValue


# apply the function
df['finalValue'] = df.apply(setFinalValue, axis=1)

[out]:
name            FW12611
attrName        HW type
string_value       None
dict_value         ALU1
Name: 0, dtype: object


name            FW12612
attrName        HW type
string_value       None
dict_value         ALU1
Name: 1, dtype: object


name            FW12613
attrName        HW type
string_value       None
dict_value         ALU1
Name: 2, dtype: object

# display(df)
      name attrName string_value dict_value finalValue
0  FW12611  HW type         None       ALU1        N/A
1  FW12612  HW type         None       ALU1        N/A
2  FW12613  HW type         None       ALU1        N/A

Allcot answered 13/8, 2020 at 18:45 Comment(3)

Yes, I see your tweak. It's weird because with previous version of Pandas, I could return the complete row. Now you are only returning a column ... In any case, I tried your suggestion of returning only finalValue but then again the first row gets repeated: I see this with the print(row) at the beginning of the function ... the first row still gets repeated in my case ... :-( – Counsellor 13/8, 2020 at 18:59

Great then !.... thanks for your all your time ... :-) ... I appreciate it ... :-) ... let's wait for the correction ... To your second question: no, I didn't!! :-P ... Now that I've removed the row['finalValue'] = finalValue I get the row to be printed correctly ... – Counsellor 13/8, 2020 at 19:1

Yeah, sure I'll do ... I hope that when the fix this, I can return the row again: I have plenty of code everywhere where I return the row ... I wouldn't like to start fixing everthing just as we saw ... – Counsellor 13/8, 2020 at 19:3

This requirement can also be implemented in a vectorized manner using np.select.

short_name = df["attrName"].str.replace(' ', '')
conditions = [short_name.isin(['Val1','Val2','Val3']), short_name.isin(['Val4','Val5'])]
df["finalValue"] = np.select(conditions, df[["dict_value", "string_value"]], "N/A")

Output:

      name attrName string_value dict_value finalValue
0  FW12611  HW type         None       ALU1        N/A
1  FW12612  HW type         None       ALU1        N/A

Ingratiating answered 13/8, 2020 at 19:16 Comment(0)

Recommended topics

Hot tags