Unpredictable pandas slice assignment behavior with no SettingWithCopyWarning
Asked Answered
B

1

15

It's well known (and understandable) that pandas behavior is essentially unpredictable when assigning to a slice. But I'm used to being warned about it by SettingWithCopy warning.

Why is the warning not generated in either of the following two code snippets, and what techniques could reduce the chance of writing such code unintentionally?

# pandas 0.18.1, python 3.5.1
import pandas as pd
data = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})
new_data = data[['a', 'b']]
data = data['a']
new_data.loc[0, 'a'] = 100 # no warning, doesn't propagate to data

data[0] == 1
True


data = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})
new_data = data['a']
data = data['a']
new_data.loc[0] = 100 # no warning, propagates to data

data[0] == 100
True

I thought the explanation was that pandas only produces the warning when the parent DataFrame is still reachable from the current context. (This would be a weakness of the detection algorithm, as my previous examples show.)

In the next snippet, AFAIK the original two-column DataFrame is no longer reachable, and yet pandas warning mechanism manages to trigger (luckily):

data = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})
new_data = data['a']
data = data[['a']]
new_data.loc[0] = 100 # warning, so we're safe

Edit:

While investigating this, I found another case of a missing warning:

data = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})
data = data.groupby('a')
new_data = data.filter(lambda g: len(g)==1)
new_data.loc[0, 'a'] = 100 # no warning, does not propagate to data
assert data.filter(lambda g: True).loc[0, 'a'] == 1

Even though an almost identical example does trigger a warning:

data = pd.DataFrame({'a': [1, 2, 2], 'b': ['a', 'b', 'c']})
data = data.groupby('a')
new_data = data.filter(lambda g: len(g)==1)
new_data.loc[0, 'a'] = 100 # warning, does not propagate to data
assert data.filter(lambda g: True).loc[0, 'a'] == 1

Update: I'm responding to the answer by @firelynx here because it's hard to put it in the comment.

In the answer, @firelynx says that the first code snippet results in no warning because I'm taking the entire dataframe. But even if I took part of it, I still don't get a warning:

# pandas 0.18.1, python 3.5.1
import pandas as pd
data = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c'], c: range(3)})
new_data = data[['a', 'b']]
data = data['a']
new_data.loc[0, 'a'] = 100 # no warning, doesn't propagate to data

data[0] == 1
True
Bonaventure answered 4/9, 2016 at 22:7 Comment(8)
With regard to your second example, there is no copy. new_data and data are the same object. The assignment doesn't "propagate"; it just occurs in one object that has two names pointing at it. More generally, I don't think there is any guarantee that SettingWithCopyWarning will or won't be raised in any particular situation (especially more complicated situations like the groupby examples you added). It's just a rough safeguard to prevent the most easily-catchable errors.Fionnula
@Fionnula can you clarify what you mean by new_data and data are the same object? new_data has type DataFrame, and data has type DataFrameGroupBy.Bonaventure
I'm referring to your second example (the second code snippet in your first block of code), in which both data and new_data are set to data['a'].Fionnula
@Fionnula I see, yes. Given that pandas kinda assumes the programmer tries to modify the parent DataFrame when assigning to a slice, this examples is the least problematic - after all, the parent is properly modified and no warning is issued. The real issue is that there's no warning when the parent is not modified.Bonaventure
Just a practical note -- I'd suggest using copy() anytime you want to be sure a "copy" is really a copy. E.g. new_data = data['a'].copy()Ivied
@Ivied I would, but unfortunately if it's already a copy, calling .copy on it would be very inefficient: it would result in a second copy being made.Bonaventure
True, though only temporarily, but of course if you are near memory limits that could be a problem. In other cases, some inefficiency that increases safety is a good tradeoff.Ivied
And in the case of datasets that aren't taxing your PC's memory, the loss of CPU/memory efficiency is completely trivial compared to spending 15 minutes of programmer time trying to figure out if it is a copy or not. ;-)Ivied
P
2

Explaining what you're doing, step by step

The Dataframe you create, is not a view

data = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})
data._is_view
False

new_data is also not a view, because you are taking all columns

new_data = data[['a', 'b']]
new_data._is_view
False

now you are assigning data to be the Series 'a'

data = data['a']
type(data)
pandas.core.series.Series

Which is a view

data._is_view
True

Now you update a value in the non-copy new_data

new_data.loc[0, 'a'] = 100 # no warning, doesn't propagate to data

This should not give a warning. It is the whole dataframe.

The Series you've created flags itself as a view, but it's not a DataFrame and does not behave as a DataFrame view.

Avoiding writing code like this

The Series vs. Dataframe problem is a very common one in pandas[citation not needed if you've worked with pandas for a while]

The problem is really that you should always be writing

data[['a']] not data['a']

Left creates a dataframe view, right creates a series.

Some people may argue to never ever write data['a'] but do data.a instead. Thus you can add warnings to your environment for data['a'] code.

This does not work. First of all using data.a syntax causes cognitive dissonance.

A dataframe is a collection of columns. In python we access members of collections with the [] operator. We access attributes by the . operator. Switching these around causes cognitive dissonance for anyone who is a python programmer. Especially when you start doing things like del data.a and notice that it does not work. See this answer for more extensive explaination

Clean code to the rescue

It is hard to see the difference between data[['a']] and data['a']

This is a smell. We should be doing neither.

The proper way using clean code principles and the zen of python "Explicit is better than implicit"

is this:

columns = ['a']
data[columns]

This may not be so mind boggling, but take a look at the following example:

data[['ad', 'cpc', 'roi']]

What does this mean? What are these columns? What data are you getting here?

These are the first questions to arrive in anyone's head when reading this line of code.

How to solve it? Don't say a comment.

ad_performance_columns = ['ad', 'cpc', 'roi']
data[ad_performance_columns]

More explicit is always better.

For more, please consider buying a book on clean code. Maybe this one

Philipp answered 29/5, 2017 at 8:38 Comment(1)
See my update at the bottom. I don't understand your explanation about why the client should not receive a warning in the code snippet I just added. And I don't see how your clean code approach would prevent someone from writing the code in that last snippet.Bonaventure

© 2022 - 2024 — McMap. All rights reserved.