why is blindly using df.copy() a bad idea to fix the SettingWithCopyWarning
Asked Answered
L

4

27

There are countless questions about the dreaded SettingWithCopyWarning

I've got a good handle on how it comes about. (Notice I said good, not great)

It happens when a dataframe df is "attached" to another dataframe via an attribute stored in is_copy.

Here's an example

df = pd.DataFrame([[1]])

d1 = df[:]

d1.is_copy

<weakref at 0x1115a4188; to 'DataFrame' at 0x1119bb0f0>

We can either set that attribute to None or

d1 = d1.copy()

I've seen devs like @Jeff and I can't remember who else, warn about doing that. Citing that the SettingWithCopyWarning has a purpose.

Question
Ok, so what is a concrete example that demonstrates why ignoring the warning by assigning a copy back to the original is a bad idea.

I'll define "bad idea" for clarification.

Bad Idea
It is a bad idea to place code into production that will lead to getting a phone call in the middle of a Saturday night saying your code is broken and needs to be fixed.

Now how can using df = df.copy() in order to bypass the SettingWithCopyWarning lead to getting that kind of phone call. I want it spelled out because this is a source of confusion and I'm attempting to find clarity. I want to see the edge case that blows up!

Lachrymal answered 15/4, 2017 at 7:14 Comment(9)
This is a great question because I was under the impression using df_copy = df.copy() is the "safe" way of handling the original df (meaning, you are free to slice/alter the values without affecting the original df). I'm wondering what these edge cases might be.Entomo
@AndrewL if you want to work on a copy of data and strictly not modify the original dataframe, then it's perfectly correct to call .copy() explicitly. If you want to modify the data in the original dataframe, you need to respect the warning.Largehearted
I'm a bit confused and reading through the answer it seems that others also don't know what exactly it is that you're asking. Is it about an "example where ignoring the exception is a bad idea" or "when using df = df.copy() to bypass the warning a bad idea"? One is about the "difference between views and (temporary) copies" the other is only about "when a possible way to avoid the problem goes haywire". These are loosly connected issues but the answer to these questions will be completly different.Dinny
@Dinny you are correct. StevenG states copy is safe. That is an answer even if it's contrary to what I've been told. Your interpretation and confusion is spot on.Lachrymal
@Dinny I am also confused. Seems most people are talking about how to avoid modifying df. I think it depends on the purposes, if one wants to avoid modifying, then using .copy() is safe and the warning is redundant. If one wants to modify df, then .copy() means bug and the warning need to be respected.Largehearted
Can you provide more information about why you think that df = df.copy() is a bad idea? You mentioned others talking about this, maybe provide some links. I think this question may actually boil down to some general programming best-practice and not a pandas specific issue.Disarticulate
https://mcmap.net/q/534904/-a-faster-way-of-removing-unused-categories-in-pandas is the post I'm referring to.Lachrymal
I don't think that there is such an edge case you are asking for, when df = df.copy() blows up. As @thn pointed out, it completely depends on whether you want to work on a copy or not. However, consider original = df; df = df.copy(). This will yield two instances in memory. The original df is not cleaned up by the GC because there is still a reference (original) to it. In a production system this might eventually result in a MemoryError.Disarticulate
hey my man, have you just tried df = df.copy(deep = True)?Haiphong
C
18

here is my 2 cent on this with a very simple example why the warning is important.

so assuming that I am creating a df such has

x = pd.DataFrame(list(zip(range(4), range(4))), columns=['a', 'b'])
print(x)
   a  b
0  0  0
1  1  1
2  2  2
3  3  3

now I want to create a new dataframe based on a subset of the original and modify it such has:

 q = x.loc[:, 'a']

now this is a slice of the original and whatever I do on it will affect x:

q += 2
print(x)  # checking x again, wow! it changed!
   a  b
0  2  0
1  3  1
2  4  2
3  5  3

this is what the warning is telling you. you are working on a slice, so everything you do on it will be reflected on the original DataFrame

now using .copy(), it won't be a slice of the original, so doing an operation on q wont affect x :

x = pd.DataFrame(list(zip(range(4), range(4))), columns=['a', 'b'])
print(x)
   a  b
0  0  0
1  1  1
2  2  2
3  3  3

q = x.loc[:, 'a'].copy()
q += 2
print(x)  # oh, x did not change because q is a copy now
   a  b
0  0  0
1  1  1
2  2  2
3  3  3

and btw, a copy just mean that q will be a new object in memory. where a slice share the same original object in memory

imo, using .copy()is very safe. as an example df.loc[:, 'a'] return a slice but df.loc[df.index, 'a'] return a copy. Jeff told me that this was an unexpected behavior and : or df.index should have the same behavior as an indexer in .loc[], but using .copy() on both will return a copy, better be safe. so use .copy() if you don't want to affect the original dataframe.

now using .copy() return a deepcopy of the DataFrame, which is a very safe approach not to get the phone call you are talking about.

but using df.is_copy = None, is just a trick that does not copy anything which is a very bad idea, you will still be working on a slice of the original DataFrame

one more thing that people tend not to know:

df[columns] may return a view.

df.loc[indexer, columns] also may return a view, but almost always does not in practice. emphasis on the may here

Colum answered 22/4, 2017 at 12:57 Comment(1)
+1 for example of df.loc[]. However you seem to emphasize that copy is better than view, but view is the only ways to set values to df (via df.loc[] = new_values). One more thing, I think df.is_copy is a flag to raise the warning, not to make it a copy or view.Largehearted
D
8

While the other answers provide good information about why one shouldn't simply ignore the warning, I think your original question has not been answered, yet.

@thn points out that using copy() completely depends on the scenario at hand. When you want that the original data is preserved, you use .copy(), otherwise you don't. If you are using copy() to circumvent the SettingWithCopyWarning you are ignoring the fact that you may introduce a logical bug into your software. As long as you are absolutely certain that this is what you want to do, you are fine.

However, when using .copy() blindly you may run into another issue, which is no longer really pandas specific, but occurs every time you are copying data.

I slightly modified your example code to make the problem more apparent:

@profile
def foo():
    df = pd.DataFrame(np.random.randn(2 * 10 ** 7))

    d1 = df[:]
    d1 = d1.copy()

if __name__ == '__main__':
    foo()

When using memory_profile one can clearly see that .copy() doubles our memory consumption:

> python -m memory_profiler demo.py 
Filename: demo.py

Line #    Mem usage    Increment   Line Contents
================================================
     4   61.195 MiB    0.000 MiB   @profile
     5                             def foo():
     6  213.828 MiB  152.633 MiB    df = pd.DataFrame(np.random.randn(2 * 10 ** 7))
     7                             
     8  213.863 MiB    0.035 MiB    d1 = df[:]
     9  366.457 MiB  152.594 MiB    d1 = d1.copy()

This relates to the fact, that there is still a reference (df) which points to the original data frame. Thus, df is not cleaned up by the garbage collector and is kept in memory.

When you are using this code in a production system, you may or may not get a MemoryError depending on the size of the data you are dealing with and your available memory.

To conclude, it is not a wise idea to use .copy() blindly. Not just because you may introduce a logical bug in your software, but also because it may expose runtime dangers such as a MemoryError.


Edit: Even if you are doing df = df.copy(), and you can ensure that there are no other references to the original df, still copy() is evaluated before the assignment. Meaning that for a short time both data frames will be in memory.

Example (notice that you cannot see this behavior in the memory summary):

> mprof run -T 0.001 demo.py
Line #    Mem usage    Increment   Line Contents
================================================
     7     62.9 MiB      0.0 MiB   @profile
     8                             def foo():
     9    215.5 MiB    152.6 MiB    df = pd.DataFrame(np.random.randn(2 * 10 ** 7))
    10    215.5 MiB      0.0 MiB    df = df.copy()

But if you visualise memory consumption over time, at 1.6s both data frames are in memory:

enter image description here

Disarticulate answered 27/4, 2017 at 19:1 Comment(2)
And what happens to memory if I overwrite the name df.. as in df = df.copy(). I suspect that since there are now zero references to the object df used to point to, it is cleaned up.Lachrymal
@Lachrymal Yes, that is perfectly right. However, keep in mind that df.copy() is evaluated before the assignment to df. So for a short time, there will be two copies of the data frame in memory. I will update my answer.Disarticulate
B
2

EDIT:

After our comment exchange and from reading around a bit (I even found @Jeff's answer), I may bring owls to Athens, but in panda-docs exists this code example:

Sometimes a SettingWithCopy warning will arise at times when there’s no obvious chained indexing going on. These are the bugs that SettingWithCopy is designed to catch! Pandas is probably trying to warn you that you’ve done this:

def do_something(df):    
      foo = df[['bar', 'baz']]  # Is foo a view? A copy? Nobody knows! 
      # ... many lines here ...    
      foo['quux'] = value  # We don't know whether this will modify df or not!   
      return foo

That maybe an easily avoided problem, for an experienced user/developer but pandas is not only for the experienced...

Still you probably will not get a phone call in the middle of the night on a Sunday about this but it may damage your data integrity in the long if you don't catch it early.
Also as the Murphy's law states, the most time consuming and complex data manipulation that you will do it WILL be on a copy which will get discarded before it is used and you will spend hours try to debug it!

Note: All that are hypothetical because the very definition in the docs is a hypothesis based on probability of (unfortunate) events... SettingWithCopy is a new-user-friendly warning which exists to warn new users of a potentially random and unwanted behavior of their code.


There exists this issue from 2014.
The code that causes the warning in this case looks like this:
from pandas import DataFrame
# create example dataframe:
df = DataFrame ({'column1':['a', 'a', 'a'], 'column2': [4,8,9] })
df
# assign string to 'column1':
df['column1'] = df['column1'] + 'b'
df
# it works just fine - no warnings
#now remove one line from dataframe df:
df = df [df['column2']!=8]
df
# adding string to 'column1' gives warning:
df['column1'] = df['column1'] + 'c'
df

And jreback make some comments on the matter:

You are in fact setting a copy.

You prob don't care; it is mainly to address situations like:

df['foo'][0] = 123... 

which sets the copy (and thus is not visible to the user)

This operation, make the df now point to a copy of the original

df = df [df['column2']!=8]

If you don't care about the 'original' frame, then its ok

If you are expecting that the

df['column1'] = df['columns'] + 'c'

would actually set the original frame (they are both called 'df' here which is confusing) then you would be suprised.

and

(this warning is mainly for new users to avoid setting the copy)

Finally he concludes:

Copies don't normally matter except when you are then trying to set them in a chained manner.

From the above we can draw this conclusions:

  1. SettingWithCopyWarning has a meaning and there are (as presented by jreback) situations in which this warning matters and the complications may be avoided.
  2. The warning is mainly a "safety net" for newer users to make them pay attention to what they are doing and that it may cause unexpected behavior on chained operations. Thus a more advanced user can turn of the warning (from jreback's answer):
pd.set_option('chained_assignement',None)

or you could do:

df.is_copy = False
Benedetto answered 21/4, 2017 at 7:55 Comment(4)
But as this link clearly demonstrates (that is why I copy pasted) it is not bad! It may lead to unexpected behavior when the assignments are chained, especially if you are new to pandas and programming in general. So SettingWithCopyWarning only acts as a warning to newer users (which I summarize in the conclusion of my answer)Benedetto
If that was a possibility, then this wouldn't classify as just a protective warning, but as a serious warning or even an error! It is only a reminder of shorts: "Are you sure about what you are doing champ? Take another look and don't be a stranger!!" that's how I would classify this warning from what I read :)Benedetto
Well it says something about unexpected behavior when chaining assignments, but I cannot imagine how to screw it up that badly as to qualify for a "bad idea"...Benedetto
I have edited my answer with some new findings, but not with a 'crush&burn' behavior...Benedetto
L
2

Update:

TL;DR: I think how to treat the SettingWithCopyWarning depends on the purposes. If one wants to avoid modifying df, then working on df.copy() is safe and the warning is redundant. If one wants to modify df, then using .copy() means wrong way and the warning need to be respected.

Disclaimer: I don't have private/personal communications with Pandas' experts like other answerers. So this answer is based on the official Pandas docs, what a typical user would base on, and my own experiences.


SettingWithCopyWarning is not the real problem, it warns about the real problem. User need to understand and solve the real problem, not bypass the warning.

The real problem is that, indexing a dataframe may return a copy, then modifying this copy will not change the original dataframe. The warning asks users to check and avoid that logical bug. For example:

import pandas as pd, numpy as np
np.random.seed(7)  # reproducibility
df = pd.DataFrame(np.random.randint(1, 10, (3,3)), columns=['a', 'b', 'c'])
print(df)
   a  b  c
0  5  7  4
1  4  8  8
2  8  9  9
# Setting with chained indexing: not work & warning.
df[df.a>4]['b'] = 1
print(df)
   a  b  c
0  5  7  4
1  4  8  8
2  8  9  9
# Setting with chained indexing: *may* work in some cases & no warning, but don't rely on it, should always avoid chained indexing.
df['b'][df.a>4] = 2
print(df)
   a  b  c
0  5  2  4
1  4  8  8
2  8  2  9
# Setting using .loc[]: guarantee to work.
df.loc[df.a>4, 'b'] = 3
print(df)
   a  b  c
0  5  3  4
1  4  8  8
2  8  3  9

About wrong way to bypass the warning:

df1 = df[df.a>4]['b']
df1.is_copy = None
df1[0] = -1  # no warning because you trick pandas, but will not work for assignment
print(df)
   a  b  c
0  5  7  4
1  4  8  8
2  8  9  9

df1 = df[df.a>4]['b']
df1 = df1.copy()
df1[0] = -1  # no warning because df1 is a separate dataframe now, but will not work for assignment
print(df)
   a  b  c
0  5  7  4
1  4  8  8
2  8  9  9

So, setting df1.is_copy to False or None is just a way to bypass the warning, not to solve the real problem when assigning. Setting df1 = df1.copy() also bypass the warning in another even more wrong way, because df1 is not a weakref of df, but a totally independent dataframe. So if the users want to change values in df, they will receive no warning, but a logical bug. The inexperienced users will not understand why df does not change after being assigned new values. That is why it is advisable to avoid these approaches completely.

If the users only want to work on the copy of the data, that is, strictly not modifying the original df, then it's perfectly correct to call .copy() explicitly. But if they want to modify the data in the original df, they need to respect the warning. The point is, users need to understand what they are doing.

In case of warning because of chained indexing assignment, the correct solution is to avoid assigning values to a copy produced by df[cond1][cond2], but to use the view produced by df.loc[cond1, cond2] instead.

More examples of setting with copy warning/error and solutions are shown in the docs: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

Largehearted answered 21/4, 2017 at 11:26 Comment(5)
There is nothing concrete about this. I want to see an example of how it would cause someone a problem. The warning is intended to warn you. Warn you of what? Show me as clearly as you can, what we are being warned of. I upvoted you to encourage you to try and improve the answer.Lachrymal
I hope it's clearer now. Actually you should distinct 2 use cases: you want to work on a copy of data or you want to modify the data. If you want to modify the data, you need to respect the warning.Largehearted
To keep the coding style consistent, I always use df.loc[] instead of df[], although sometimes it works as expected.Largehearted
@thn it is not sure that .loc[] will return view, it almost always does not in practice. Same thing iwth df[] it may or may not return a view. please be carefulColum
@StevenG I usually use df.loc[] to set values to df, such as df.loc[] = new_values, I always see it works as a view. I agree that df[] may return a view, but usually a copy, hence the warning and unit test are needed. I think relying on the view to modify data is bad, as one should avoid mutable in general. One should explicitly set values back to df using df.loc[]Largehearted

© 2022 - 2024 — McMap. All rights reserved.