DataFrame modified inside a function
Asked Answered
M

2

33

I face a problem of modification of a dataframe inside a function that I have never observed previously. Is there a method to deal with this so that the initial dataframe is not modified.

def test(df):
    df['tt'] = np.nan
    return df

dff = pd.DataFrame(data=[])

Now, when I print dff, the output is

Empty DataFrame
Columns: []
Index: []

If I pass dff to test() defined above, dff is modified. In other words,

df = test(dff)
print(dff)

now prints

Empty DataFrame
Columns: [tt]
Index: []

How do I make sure dff is not modified after being passed to test()?

Maryellen answered 24/7, 2015 at 15:9 Comment(7)
Pass a copy of the dataframe? Or make one inside the function, and mutate and return that? It's bad form to mutate an argument and return anything other than None.Joycejoycelin
It's a solution but not memory efficient. But it's the first time I face that. Due to the version 0.16.2 ?Maryellen
you can call .copy() to take an explicit deep copyGurley
Nope, nothing to do with changing versions - this behaviour is the same for all mutable objects passed to Python functions, unique neither to Pandas generally nor v0.16.2 specifically.Joycejoycelin
Can you tell us a bit more about your use case? If you want to return the df at the end of the function, I don't think you can avoid doing a .copy()Aurilia
ok I understand well the mutability of the dataframe... I don't observed that before perhaps I don't re-read the inputed dataframe. It's a little boring having to .copy() explicitly at each started line of the function. But if we have to do... Thanks a lot for your fast answers and explanations !!Maryellen
@Gurley Can you please explain when .copy() is required and when copy() is not required? Because I copy by refernce only happening in some special scenarios. It is not elegant to have df = df.copy() inside every function.Casern
C
64
def test(df):
    df = df.copy(deep=True)
    df['tt'] = np.nan
    return df

If you pass the dataframe into a function and manipulate it and return the same dataframe, you are going to get the same dataframe in modified version. If you want to keep your old dataframe and create a new dataframe with your modifications then by definition you have to have 2 dataframes. The one that you pass in that you don't want modified and the new one that is modified. Therefore, if you don't want to change the original dataframe your best bet is to make a copy of the original dataframe. In my example I rebound the variable "df" in the function to the new copied dataframe. I used the copy method and the argument "deep=True" makes a copy of the dataframe and its contents. You can read more here:http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.copy.html

Constant answered 24/7, 2015 at 23:35 Comment(2)
Is this also true for pyspark dataframes?Romero
Thanks! I've been using pandas for a while and just came across this myself just now. training a model on a dataframe and inside training function it make some changes to df but does not return it. This still leads to modification of original dataframe. Copying is the only way?Bordelon
A
3

As Skorpeo mentioned, since a dataframe can be modified in-place, it can be modified inside a function. One way to not modify the original is to make a new copy inside the function as in Skorpeo's answer.

If you don't want to change the function, passing a copy is also an option:

def test(df):
    df['tt'] = np.nan
    return df

df = test(dff.copy())            # <---- pass a copy of `dff`
Adhere answered 27/3, 2023 at 1:6 Comment(1)
I was wondering if deep=True was not a necessary argument for the copy, then I found out deep=True is the default.Brody

© 2022 - 2024 — McMap. All rights reserved.