DataFrame modified inside a function

Asked 24/7, 2015 at 15:9 Answered 27/3, 2023 at 1:6

Solved python python-3.x pandas dataframe copy

I face a problem of modification of a dataframe inside a function that I have never observed previously. Is there a method to deal with this so that the initial dataframe is not modified.

def test(df):
    df['tt'] = np.nan
    return df

dff = pd.DataFrame(data=[])

Now, when I print dff, the output is

Empty DataFrame
Columns: []
Index: []

If I pass dff to test() defined above, dff is modified. In other words,

df = test(dff)
print(dff)

now prints

Empty DataFrame
Columns: [tt]
Index: []

How do I make sure dff is not modified after being passed to test()?

Maryellen answered 24/7, 2015 at 15:9 Comment(7)

Pass a copy of the dataframe? Or make one inside the function, and mutate and return that? It's bad form to mutate an argument and return anything other than None. – Joycejoycelin 24/7, 2015 at 15:9

It's a solution but not memory efficient. But it's the first time I face that. Due to the version 0.16.2 ? – Maryellen 24/7, 2015 at 15:10

you can call .copy() to take an explicit deep copy – Gurley 24/7, 2015 at 15:10

Nope, nothing to do with changing versions - this behaviour is the same for all mutable objects passed to Python functions, unique neither to Pandas generally nor v0.16.2 specifically. – Joycejoycelin 24/7, 2015 at 15:11

Can you tell us a bit more about your use case? If you want to return the df at the end of the function, I don't think you can avoid doing a .copy() – Aurilia 24/7, 2015 at 22:27

ok I understand well the mutability of the dataframe... I don't observed that before perhaps I don't re-read the inputed dataframe. It's a little boring having to .copy() explicitly at each started line of the function. But if we have to do... Thanks a lot for your fast answers and explanations !! – Maryellen 25/7, 2015 at 21:49

@Gurley Can you please explain when .copy() is required and when copy() is not required? Because I copy by refernce only happening in some special scenarios. It is not elegant to have df = df.copy() inside every function. – Casern 29/10, 2017 at 17:32

def test(df):
    df = df.copy(deep=True)
    df['tt'] = np.nan
    return df

If you pass the dataframe into a function and manipulate it and return the same dataframe, you are going to get the same dataframe in modified version. If you want to keep your old dataframe and create a new dataframe with your modifications then by definition you have to have 2 dataframes. The one that you pass in that you don't want modified and the new one that is modified. Therefore, if you don't want to change the original dataframe your best bet is to make a copy of the original dataframe. In my example I rebound the variable "df" in the function to the new copied dataframe. I used the copy method and the argument "deep=True" makes a copy of the dataframe and its contents. You can read more here:http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.copy.html

Constant answered 24/7, 2015 at 23:35 Comment(2)

Is this also true for pyspark dataframes? – Romero 13/5, 2021 at 14:53

Thanks! I've been using pandas for a while and just came across this myself just now. training a model on a dataframe and inside training function it make some changes to df but does not return it. This still leads to modification of original dataframe. Copying is the only way? – Bordelon 1/11, 2022 at 15:13

As Skorpeo mentioned, since a dataframe can be modified in-place, it can be modified inside a function. One way to not modify the original is to make a new copy inside the function as in Skorpeo's answer.

If you don't want to change the function, passing a copy is also an option:

def test(df):
    df['tt'] = np.nan
    return df

df = test(dff.copy())            # <---- pass a copy of `dff`

Adhere answered 27/3, 2023 at 1:6 Comment(1)

I was wondering if deep=True was not a necessary argument for the copy, then I found out deep=True is the default. – Brody 12/2 at 9:41

Recommended topics

Hot tags