pandas df.apply unexpectedly changes dataframe inplace
Asked Answered
D

2

13

From my understanding, pandas.DataFrame.apply does not apply changes inplace and we should use its return object to persist any changes. However, I've found the following inconsistent behavior:

Let's apply a dummy function for the sake of ensuring that the original df remains untouched:

>>> def foo(row: pd.Series):
...     row['b'] = '42'

>>> df = pd.DataFrame([('a0','b0'),('a1','b1')], columns=['a', 'b'])
>>> df.apply(foo, axis=1)
>>> df
    a   b
0   a0  b0
1   a1  b1

This behaves as expected. However, foo will apply the changes inplace if we modify the way we initialize this df:

>>> df2 = pd.DataFrame(columns=['a', 'b'])
>>> df2['a'] = ['a0','a1']
>>> df2['b'] = ['b0','b1']
>>> df2.apply(foo, axis=1)
>>> df2
    a   b
0   a0  42
1   a1  42

I've also noticed that the above is not true if the columns dtypes are not of type 'object'. Why does apply() behave differently in these two contexts?

Python: 3.6.5

Pandas: 0.23.1

Dispense answered 22/9, 2018 at 15:10 Comment(3)
You are inserting into the df2['a'] the values ['a0','b0']. But in your df2 output the data is different. why?Marmite
edit: updated df2. thanks @roganjosh and ArihantDispense
Turns out that's nothing to do with the behaviour you're seeing. Nice question :)Whet
P
5

Interesting question! I believe the behavior you're seeing is an artifact of the way you use apply.

As you correctly indicate, apply is not intended to be used to modify a dataframe. However, since apply takes an arbitrary function, it doesn't guarantee that applying the function will be idempotent and will not change the dataframe. Here, you've found a great example of that behavior, because your function foo attempts to modify the row that it is passed by apply.

Using apply to modify a row could lead to these side effects. This isn't the best practice.

Instead, consider this idiomatic approach for apply. The function apply is often used to create a new column. Here's an example of how apply is typically used, which I believe would steer you away from this potentially troublesome area:

import pandas as pd
# construct df2 just like you did
df2 = pd.DataFrame(columns=['a', 'b'])
df2['a'] = ['a0','b0']
df2['b'] = ['a1','b1']

df2['b_copy'] = df2.apply(lambda row: row['b'], axis=1) # apply to each row
df2['b_replace'] = df2.apply(lambda row: '42', axis=1) 
df2['b_reverse'] = df2['b'].apply(lambda val: val[::-1]) # apply to each value in b column

print(df2)

# output:
#     a   b b_copy b_replace b_reverse
# 0  a0  a1     a1        42        1a
# 1  b0  b1     b1        42        1b

Notice that pandas passed a row or a cell to the function you give as the first argument to apply, then stores the function's output in a column of your choice.

If you'd like to modify a dataframe row-by-row, take a look at iterrows and loc for the most idiomatic route.

Purposive answered 22/9, 2018 at 15:31 Comment(1)
Does this really answer why it doesn't modify the dataframe in the first instance?Ferri
M
1

Maybe late but I think it may help especially for someone who reach this question.

When we use the foo like:

def foo(row: pd.Series):
    row['b'] = '42'

and then use it in:

df.apply(foo, axis=1)

we won't expect to occur any change in df but it occers. why?

Let's review what happens under the hood:

apply function calls foo and pass one row to it. As it is not of type of specific types in python (like int, float, str, ...) but is an object, so by python rules it is passed by reference not by value. So it is completely equivalent with the row that is sent by apply function.(Equal in values and both points to same block of ram.) So any change to row in foo function will changes the row - which its type is pandas.series and that points to a block of memory that df.row resides - immediately.

We can rewrite the foo(I name it bar) function to not change anything inplace. ( by deep copying row that means make another row with same value(s) but on another cell of ram). This is what relly happens when we use lambda in apply function.

def bar(row: pd.Series):
    row_temp=row.copy(deep=True)
    row_temp['b'] = '42'
    return row_temp

Complete Code

import pandas as pd


#Changes df in place -- not like lamda
def foo(row: pd.Series):
    row['b'] = '42'


#Do not change df inplace -- works like lambda
def bar(row: pd.Series):
    row_temp = row.copy(deep=True)
    row_temp['b'] = '42'
    return row_temp


df2 = pd.DataFrame(columns=['a', 'b'])
df2['a'] = ['a0', 'a1']
df2['b'] = ['b0', 'b1']

print(df2)

# No change inplace
df_b = df2.apply(bar, axis=1)
print(df2)
# bar function works
print(df_b)

print(df2)
# Changes inplace
df2.apply(foo, axis=1)
print(df2)


Output

#df2 before any change
    a   b
0  a0  b0
1  a1  b1

#calling df2.apply(bar, axis=1) not changed df2 inplace
    a   b
0  a0  b0
1  a1  b1

#df_b = df2.apply(bar, axis=1) #bar is working as expected
    a   b
0  a0  42
1  a1  42

#print df2 again to assure it is not changed
    a   b
0  a0  b0
1  a1  b1

#call df2.apply(foo, axis=1) -- as we see foo changed df2 inplace ( to compare with bar)
    a   b
0  a0  42
1  a1  42
Makhachkala answered 28/12, 2020 at 23:1 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.