why should I make a copy of a data frame in pandas
Asked Answered
B

8

365

When selecting a sub dataframe from a parent dataframe, I noticed that some programmers make a copy of the data frame using the .copy() method. For example,

X = my_dataframe[features_list].copy()

...instead of just

X = my_dataframe[features_list]

Why are they making a copy of the data frame? What will happen if I don't make a copy?

Barye answered 28/12, 2014 at 2:22 Comment(1)
My guess is they are taking extra precaution to not modify the source data frame. Probably unnecessary, but when you're throwing something together interactively, better safe than sorry.Lieselotteliestal
M
371

This answer has been deprecated in newer versions of pandas. See docs


This expands on Paul's answer. In Pandas, indexing a DataFrame returns a reference to the initial DataFrame. Thus, changing the subset will change the initial DataFrame. Thus, you'd want to use the copy if you want to make sure the initial DataFrame shouldn't change. Consider the following code:

df = DataFrame({'x': [1,2]})
df_sub = df[0:1]
df_sub.x = -1
print(df)

You'll get:

   x
0 -1
1  2

In contrast, the following leaves df unchanged:

df_sub_copy = df[0:1].copy()
df_sub_copy.x = -1
Marbling answered 28/12, 2014 at 20:1 Comment(5)
is this a deep copy?Sentimentality
Yes. The default mode is "deep" copy! pandas.pydata.org/pandas-docs/stable/reference/api/…Laboratory
I found this article on the issue of deep/shallow copies in panda/numpy to be quite clear and comprehensive: realpython.com/pandas-settingwithcopywarningDigestif
If I change any cell within a function then also such manipulation will reflect on the original dataframe?Beatrix
This do not hold true anymore, right?Running
P
90

Because if you don't make a copy then the indices can still be manipulated elsewhere even if you assign the dataFrame to a different name.

For example:

df2 = df
func1(df2)
func2(df)

func1 can modify df by modifying df2, so to avoid that:

df2 = df.copy()
func1(df2)
func2(df)
Pizza answered 22/9, 2016 at 1:27 Comment(3)
Wait wait wait, can you explain WHY this occurs? Doesn't make sense.Tayyebeb
it is because in the first example, ` df2 = df, both variables reference the same DataFrame instance. So any changes made to df` or df2 will be made to the same object instance. Whereas in the df2 = df.copy() a second object instance is created, a copy of the first one, but now df and df2 reference to different object instances and any changes will be made to their respective DataFrame instance.Await
A simple example is like below:Historiographer
G
29

It's necessary to mention that returning copy or view depends on kind of indexing.

The pandas documentation says:

Returning a view versus a copy

The rules about when a view on the data is returned are entirely dependent on NumPy. Whenever an array of labels or a boolean vector are involved in the indexing operation, the result will be a copy. With single label / scalar indexing and slicing, e.g. df.ix[3:6] or df.ix[:, 'A'], a view will be returned.

Grievous answered 20/1, 2017 at 13:22 Comment(1)
R
26

The primary purpose is to avoid chained indexing and eliminate the SettingWithCopyWarning.

Here chained indexing is something like dfc['A'][0] = 111

The document said chained indexing should be avoided in Returning a view versus a copy. Here is a slightly modified example from that document:

In [1]: import pandas as pd

In [2]: dfc = pd.DataFrame({'A':['aaa','bbb','ccc'],'B':[1,2,3]})

In [3]: dfc
Out[3]:
    A   B
0   aaa 1
1   bbb 2
2   ccc 3

In [4]: aColumn = dfc['A']

In [5]: aColumn[0] = 111
SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

In [6]: dfc
Out[6]:
    A   B
0   111 1
1   bbb 2
2   ccc 3

Here the aColumn is a view and not a copy from the original DataFrame, so modifying aColumn will cause the original dfc be modified too. Next, if we index the row first:

In [7]: zero_row = dfc.loc[0]

In [8]: zero_row['A'] = 222
SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

In [9]: dfc
Out[9]:
    A   B
0   111 1
1   bbb 2
2   ccc 3

This time zero_row is a copy, so the original dfc is not modified.

From these two examples above, we see it's ambiguous whether or not you want to change the original DataFrame. This is especially dangerous if you write something like the following:

In [10]: dfc.loc[0]['A'] = 333
SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

In [11]: dfc
Out[11]:
    A   B
0   111 1
1   bbb 2
2   ccc 3

This time it didn't work at all. Here we wanted to change dfc, but we actually modified an intermediate value dfc.loc[0] that is a copy and is discarded immediately. It’s very hard to predict whether the intermediate value like dfc.loc[0] or dfc['A'] is a view or a copy, so it's not guaranteed whether or not original DataFrame will be updated. That's why chained indexing should be avoided, and pandas generates the SettingWithCopyWarning for this kind of chained indexing update.

Now is the use of .copy(). To eliminate the warning, make a copy to express your intention explicitly:

In [12]: zero_row_copy = dfc.loc[0].copy()

In [13]: zero_row_copy['A'] = 444 # This time no warning

Since you are modifying a copy, you know the original dfc will never change and you are not expecting it to change. Your expectation matches the behavior, then the SettingWithCopyWarning disappears.

Note, If you do want to modify the original DataFrame, the document suggests you use loc:

In [14]: dfc.loc[0,'A'] = 555

In [15]: dfc
Out[15]:
    A   B
0   555 1
1   bbb 2
2   ccc 3
Robeson answered 22/10, 2018 at 9:58 Comment(2)
Nice answer. I did not notice before that pandas gives that warning about about "trying to be set on a copy of a slice" even when the object is a view, not a copy. First example with aColumn surprised me.Teatime
And doesn't this mean the pandas warning is the opposite of the problem or at minimum highly confusing? It says to me, wait, you're trying to change a copy, don't do that. It should say, wait, you need to first make a copy() [new pointer] or something might go wrong later. And my response would be, yeah but I know what I'm doing so be quiet, I don't want to make a copy all the time firstDwelling
K
17

Assumed you have data frame as below

df1
     A    B    C    D
4 -1.0 -1.0 -1.0 -1.0
5 -1.0 -1.0 -1.0 -1.0
6 -1.0 -1.0 -1.0 -1.0
6 -1.0 -1.0 -1.0 -1.0

When you would like create another df2 which is identical to df1, without copy

df2=df1
df2
     A    B    C    D
4 -1.0 -1.0 -1.0 -1.0
5 -1.0 -1.0 -1.0 -1.0
6 -1.0 -1.0 -1.0 -1.0
6 -1.0 -1.0 -1.0 -1.0

And would like modify the df2 value only as below

df2.iloc[0,0]='changed'

df2
         A    B    C    D
4  changed -1.0 -1.0 -1.0
5       -1 -1.0 -1.0 -1.0
6       -1 -1.0 -1.0 -1.0
6       -1 -1.0 -1.0 -1.0

At the same time the df1 is changed as well

df1
         A    B    C    D
4  changed -1.0 -1.0 -1.0
5       -1 -1.0 -1.0 -1.0
6       -1 -1.0 -1.0 -1.0
6       -1 -1.0 -1.0 -1.0

Since two df as same object, we can check it by using the id

id(df1)
140367679979600
id(df2)
140367679979600

So they as same object and one change another one will pass the same value as well.


If we add the copy, and now df1 and df2 are considered as different object, if we do the same change to one of them the other will not change.

df2=df1.copy()
id(df1)
140367679979600
id(df2)
140367674641232

df1.iloc[0,0]='changedback'
df2
         A    B    C    D
4  changed -1.0 -1.0 -1.0
5       -1 -1.0 -1.0 -1.0
6       -1 -1.0 -1.0 -1.0
6       -1 -1.0 -1.0 -1.0

Good to mention, when you subset the original dataframe, it is safe to add the copy as well in order to avoid the SettingWithCopyWarning

Kianakiang answered 17/6, 2020 at 1:50 Comment(1)
going through your answer and @Marbling 's answer, I see that in his answer the id of df_sub is different than df as can be understood as logical. Does the object created by df_sub have a pointer or something to df?Orb
S
2

In general it is safer to work on copies than on original data frames, except when you know that you won't be needing the original anymore and want to proceed with the manipulated version. Normally, you would still have some use for the original data frame to compare with the manipulated version, etc. Therefore, most people work on copies and merge at the end.

Snowman answered 28/3, 2018 at 23:31 Comment(0)
F
2

Pandas Deep copy leaves the initial DataFrame unchanged.

This feature is particularly useful when you want to normalize a DataFrame and want to keep the initial df unchanged. For instance:

df = pd.DataFrame(np.arange(20).reshape(2,10))

then you normalize the data:

# Using Sklearn MinMaxSacaler method
scaler = preprocessing.MinMaxScaler()

and you make a new df based on the first one and want the first one unchanged, you have to use .copy() method

new_df = pd.DataFrame(df).copy() # Deep Copy
for i in range(10):
    pd_features[i] = scaler.fit_transform(unnormal_pd_features[i].values.reshape(-1,1))

or else your original df will change too.

Flounce answered 2/6, 2021 at 6:7 Comment(1)
You raise a good point but this would have been better with something reproducibleClimatology
H
1

I was so careless using copy() until I use that line of code below without using copy(), the changes in df_genel3 effects df_genel

df_genel3 = df_genel
df_genel3.loc[(df_genel3['Hareket']=='İmha') , 'Hareket_Tutar'] = tutar 

copy() solved the problem

df_genel3 = df_genel.copy()
df_genel3.loc[(df_genel3['Hareket']=='İmha') , 'Hareket_Tutar'] = tutar
Hiro answered 11/4, 2022 at 6:45 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.