I love pandas and have been using it for years and feel pretty confident I have a good handle on how to subset dataframes and deal with views vs copies appropriately (though I use a lot of assertions to be sure). I also know that there have been tons of questions about SettingWithCopyWarning, e.g. How to deal with SettingWithCopyWarning in Pandas? and some great recent guides on wrapping your head around when it happens, e.g. Understanding SettingWithCopyWarning in pandas.
But I also know specific things like the quote from this answer are no longer in the most recent docs (0.22.0
) and that many things have been deprecated over the years (leading to some inappropriate old SO answers), and that things are continuing to change.
Recently after teaching pandas to complete newcomers with very basic general Python knowledge about things like avoiding chained-indexing (and using .iloc
/.loc
), I've still struggled to provide general rules of thumb to know when it's important to pay attention to the SettingWithCopyWarning
(e.g. when it's safe to ignore it).
I've personally found that the specific pattern of subsetting a dataframe according so some rule (e.g. slicing or boolean operation) and then modifying that subset, independent of the original dataframe, is a much more common operation than the docs suggest. In this situation we want to modify the copy not the original and the warning is confusing/scary to newcomers.
I know it's not trivial to know ahead of time when a view vs a copy is returned, e.g.
What rules does Pandas use to generate a view vs a copy?
Checking whether data frame is copy or view in Pandas
So instead I'm looking for the answer to a more general (beginner friendly) question: when does performing an operation on a subsetted dataframe affect the original dataframe from which it was created, and when are they independent?.
I've created some cases below that I think seem reasonable, but I'm not sure if there's a "gotcha" I'm missing or if there's any easier way to think/check this. I was hoping someone could confirm that my intuitions about the following use cases are correct as the pertain to my question above.
import pandas as pd
df1 = pd.DataFrame({'A':[2,4,6,8,10],'B':[1,3,5,7,9],'C':[10,20,30,40,50]})
1) Warning: No
Original changed: No
# df1 will be unaffected because we use .copy() method explicitly
df2 = df1.copy()
#
# Reference: docs
df2.iloc[0,1] = 100
2) Warning: Yes (I don't really understood why)
Original changed: No
# df1 will be unaffected because .query() always returns a copy
#
# Reference:
# https://mcmap.net/q/65033/-what-rules-does-pandas-use-to-generate-a-view-vs-a-copy
df2 = df1.query('A < 10')
df2.iloc[0,1] = 100
3) Warning: Yes
Original changed: No
# df1 will be unaffected because boolean indexing with .loc
# always returns a copy
#
# Reference:
# https://mcmap.net/q/66157/-pandas-subindexing-dataframes-copies-vs-views
df2 = df1.loc[df1['A'] < 10,:]
df2.iloc[0,1] = 100
4) Warning: No
Original changed: No
# df1 will be unaffected because list indexing with .loc (or .iloc)
# always returns a copy
#
# Reference:
# Same as 4)
df2 = df1.loc[[0,3,4],:]
df2.iloc[0,1] = 100
5) Warning: No
Original changed: Yes (confusing to newcomers but makes sense)
# df1 will be affected because scalar/slice indexing with .iloc/.loc
# always references the original dataframe, but may sometimes
# provide a view and sometimes provide a copy
#
# Reference: docs
df2 = df1.loc[:10,:]
df2.iloc[0,1] = 100
tl;dr
When creating a new dataframe from the original, changing the new dataframe:
Will change the original when scalar/slice indexing with .loc/.iloc is used to create the new dataframe.
Will not change the original when boolean indexing with .loc, .query()
, or .copy()
is used to create the new dataframe