What is the point of views in pandas if it is undefined whether an indexing operation returns a view or a copy?
Asked Answered
C

2

9

I have switched from R to pandas. I routinely get SettingWithCopyWarnings, when I do something like

df_a = pd.DataFrame({'col1': [1,2,3,4]})    

# Filtering step, which may or may not return a view
df_b = df_a[df_a['col1'] > 1]

# Add a new column to df_b
df_b['new_col'] = 2 * df_b['col1']

# SettingWithCopyWarning!!

I think I understand the problem, though I'll gladly learn what I got wrong. In the given example, it is undefined whether df_b is a view on df_a or not. Thus, the effect of assigning to df_b is unclear: does it affect df_a? The problem can be solved by explicitly making a copy when filtering:

df_a = pd.DataFrame({'col1': [1,2,3,4]})    

# Filtering step, definitely a copy now
df_b = df_a[df_a['col1'] > 1].copy()

# Add a new column to df_b
df_b['new_col'] = 2 * df_b['col1']

# No Warning now

I think there is something that I am missing: if we can never really be sure whether we create a view or not, what are views good for? From the pandas documentation (http://pandas-docs.github.io/pandas-docs-travis/indexing.html?highlight=view#indexing-view-versus-copy)

Outside of simple cases, it’s very hard to predict whether it [getitem] will return a view or a copy (it depends on the memory layout of the array, about which pandas makes no guarantees)

Similar warnings can be found for different indexing methods.

I find it very cumbersome and errorprone to sprinkle .copy() calls throughout my code. Am I using the wrong style for manipulating my DataFrames? Or is the performance gain so high that it justifies the apparent awkwardness?

Chlorite answered 19/1, 2016 at 18:40 Comment(3)
You can safely disable this new warning with the following assignment. pd.options.mode.chained_assignment = NoneJada
Hmmm, maybe help reset index df_b = df_a[df_a['col1'] > 1].reset_index(drop=True).Rectitude
@GeorgePetrov I would strongly suggest against disabling that! The warning comes up for a good reason -- if anything, I would actually suggest you promote it to an exception instead of a warning.Hanging
H
11

Great question!

The short answer is: this is a flaw in pandas that's being remedied.

You can find a longer discussion of the nature of the problem here, but the main take-away is that we're now moving to a "copy-on-write" behavior in which any time you slice, you get a new copy, and you never have to think about views. The fix will soon come through this refactoring project. I actually tried to fix it directly (see here), but it just wasn't feasible in the current architecture.

In truth, we'll keep views in the background -- they make pandas SUPER memory efficient and fast when they can be provided -- but we'll end up hiding them from users so, from the user perspective, if you slice, index, or cut a DataFrame, what you get back will effectively be a new copy.

(This is accomplished by creating views when the user is only reading data, but whenever an assignment operation is used, the view will be converted to a copy before the assignment takes place.)

Best guess is the fix will be in within a year -- in the mean time, I'm afraid some .copy() may be necessary, sorry!

Hanging answered 20/1, 2016 at 19:24 Comment(1)
Update from 2024: Copy-on-Write mode is available, and will become the default in pandas 3.0.Agraphia
H
2

I agree this is a bit funny. My current practice is to look for a "functional" method for whatever I want to do (in my experience these almost always exist with the exception of renaming columns and series). Sometimes it makes the code more elegant, sometimes it makes it worse (I don't like assign with lambda), but at least I don't have to worry about mutability.

So for indexing, instead of using the slice notation, you can use query which will return a copy by default:

In [5]: df_a.query('col1 > 1')
Out[5]:
   col1
1     2
2     3
3     4

I expand on it a little in this blog post.

Edit: As raised in the comments, it looks like I'm wrong about query returning a copy by default, however if you use the assign style, then assign will make a copy before returning your result, and you're all good:

df_b = (df_a.query('col1 > 1')
            .assign(newcol = 2*df_a['col1']))
Hileman answered 19/1, 2016 at 22:29 Comment(3)
Why does the sequence: df_b = df_a.query('col1 > 1') followed by df_b['new_col'] = 2 * df_b['col1'] still give the SettingWithCopyWarning?Dichromate
@maxymoo: This answers the second part of my question: a programming style to avoid SettingWithCopy issues. Thanks! I really liked your blog post! Can you answer the question from screenpaver? I think most of the suggestions in your blog post work very well, but .query() does not seem to return a copy in all cases! So I can I do filtering in a method chain?Chlorite
Yes, thanks. Or this will work also (without query): df_b = df_a[df_a.col1>1].assign(newcol = 2*df_a['col1'])Dichromate

© 2022 - 2024 — McMap. All rights reserved.