pandas views vs copy : the docs says "nobody knows"?
Asked Answered
C

3

15

There's lots of questions on StackOverflow about chained indexing and whether a particular operation makes a view or a copy. (for instance, here or here). I still don't fully get it, but the amazing part is the official docs say "nobody knows". (!?!??) Here's an example from the docs; can you tell me if they really meant that, or if they're just being flippant?

From https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#why-does-assignment-fail-when-using-chained-indexing

def do_something(df):
   foo = df[['bar', 'baz']]  # Is foo a view? A copy? Nobody knows!
   # ... many lines here ...
   foo['quux'] = value       # We don't know whether this will modify df or not!
   return foo

Seriously? For that specific example, is it really true that "nobody knows" and this is non-deterministic? Will that really behave differently on two different dataframes? The rules are really that complex? Or did the guy mean there is a definite answer but just that most people aren't aware of it?

Conaway answered 23/8, 2016 at 12:58 Comment(1)
Yes, this is frustrating. To add to the pain, that same page later says: > "This can work at times, but it is not guaranteed to, and therefore should be avoided:" dfc = dfc.copy() So, how are we supposed to ensure that a DataFrame which is passed to a function is not just a copy or slice of another DataFrame??Ore
P
6

I think I can demonstrate something to clarify your situation, in your example, initially it will be a view but once you try to modify by adding a column it turns into a copy. You can test this by looking at the attribute ._is_view:

In [29]:
df = pd.DataFrame(np.random.randn(5,3), columns=list('abc'))
def doSomething(df):
    a = df[['b','c']]
    print('before ', a._is_view)
    a['d'] = 0
    print('after ', a._is_view)

doSomething(df)
df

before  True
after  False
Out[29]:
          a         b         c
0  0.108790  0.580745  1.820328
1  1.066503 -0.238707 -0.655881
2 -1.320731  2.038194 -0.894984
3 -0.962753 -3.961181  0.109476
4 -1.887774  0.909539  1.318677

So here we can see that initially a is a view on the original subsection of the original df, but once you add a column to this, this is no longer true and we can see that the original df is not modified.

Pooi answered 23/8, 2016 at 14:29 Comment(2)
But in your example, it's working a particular way, with rules that you can understand and explain. But if that were the case for all dataframes, why would the official docs say "no one knows"? Are they implying the behavior may be different for other data frames? Cause if it always works the way you said, then the docs could just offer a rule of "if you want X, then always do Y".Conaway
I don't think that example in the docs is a good example to me, in the case of chained indexing then a warning will be raised, here it's ambiguous as to whether the fact you take a reference to a view of the original df should it add a new column to the original df or not. In this case it doesn't.Pooi
B
4

Here's the core bit of documentation that I think you may have missed:

Outside of simple cases, it’s very hard to predict whether it will return a view or a copy (it depends on the memory layout of the array, about which pandas makes no guarantees)

So there's an underlying numpy array that has some sort of memory layout. pandas is not concerned with having any sort of knowledge about that. I didn't read the docs too thoroughly besides that, but I assume they have some kind of approach that you should be taking instead, if you're actually wanting to set values.

Belgae answered 23/8, 2016 at 13:5 Comment(10)
Yes, I saw that line; but that's not helpful. What is this alternative that I should be taking instead? The above example looks like a very reasonable thing to do, so if that's not allowed, then what should we do instead? Call .copy() after every single method just in case?!?Conaway
but in your code example you take a subset of your original df and then you try to add a new column, so what's the intention here? A new column to the original df or to a copy of the df? I don't think that this should be regarded as unambiguousPooi
@Pooi The code is also from the docs. I am surprised actually, because I thought df[['bar', 'baz']] always returns a copy (based on this)Pompidou
@ayhan I guess it depends on the underlying np array and memory layout, but codewise the semantics of the code snippet are unclear to me and I'd always explicitly call copy() on the subset to ensure I'm working on a copy without relying on any assumptionsPooi
@ayhan additionally you could use the attribute ._is_view which will return True or False which on my system returns True when you take a subselection of the df: In [25]: a = df[['b','c']] a._is_view Out[25]: True versus: In [26]: a = df[['b','c']].copy() a._is_view Out[26]: False see related: #26879573Pooi
@Pooi When I tried, a._is_view returned False (before calling copy()) so it seems that is also uncertain. I guess the best thing to do is to be explicit like you said.Pompidou
@ayhan I don't see that but it depends on whether you try to modify the view, see my answerPooi
@ayhan yes, the code is from the docs, but it's an illustration of something that you probably shouldn't do - or at least an ambiguous case.Belgae
@WayneWerner -- why do you think that example is something you shouldn't do? If you're not supposed to do that thing in the example, then what should you do instead? Call .copy() after every single method just in case? The docs don't offer anything better, other than "nobody knows".Conaway
Because it says (at the top) Whether a copy or a reference is returned for a setting operation, may depend on the context. This is sometimes called chained assignment and should be avoided.. But you're right that it's not explicit about telling you what you should do instead. There may be some other questions on SO about that, at least glancing at the related questions there's some reading there. If you can't find anything, though, I'd highly recommend actually asking that question: "here's what the docs say, but they don't tell me what to do. What should I do?" (obviously with a minimal reproducible example)Belgae
L
3

Here's an example I thought did a good job of illustrating the inconsistency.

I subset the dataframe, which returns a view. I can then overwrite the values in an entire column, but depending on how I do that syntactically, I get different results.

df = pd.DataFrame(np.random.randn(100, 100))
x = df[(df > 2).any(axis=1)]
print x._is_view
>>> True

# Prove that below we are referring to the exact same slice of the dataframe
assert (x.iloc[:len(x), 1] == x.iloc[:, 1]).all()

# Assign using equivalent notation to below
x.iloc[:len(x), 1] = 1
print x._is_view
>>> True

# Assign using slightly different syntax
x.iloc[:, 1] = 1
print x._is_view
>>> False
Ladino answered 25/5, 2017 at 22:18 Comment(1)
Nice example. Have you found consistent reasoning as to why this is?Uraeus

© 2022 - 2024 — McMap. All rights reserved.