Confusion about pandas copy of slice of dataframe warning
Asked Answered
P

2

35

I've looked through a bunch of questions and answers related to this issue, but I'm still finding that I'm getting this copy of slice warning in places where I don't expect it. Also, it's cropping up in code that was running fine for me previously, leading me to wonder if some sort of update may be the culprit.

For example, this is a set of code where all I'm doing is reading in an Excel file into a pandas DataFrame, and cutting down the set of columns included with the df[[]] syntax.

df = pd.read_excel(filepath)
df1 = df[['Gender','Age','Date to Delivery','Date to insert']]

Now, any further changes I make to this df1 file raise the copy of slice warning. For example, the following code

df1['Age'] = df1.Age.fillna(0)
df1['Age'] = df1.Age.astype(int)

raises the following warning

/Users/samlilienfeld/anaconda/lib/python3.5/site-packages/ipykernel/__main__.py:2: SettingWithCopyWarning:   
A value is trying to be set on a copy of a slice from a DataFrame.   
Try using .loc[row_indexer,col_indexer] = value instead

I'm confused because I thought the df[[]] column subsetting returned a copy by default. The only way I've found to suppress the errors is by explicitly adding df[[]].copy(). I could have sworn that in the past I did not have to do that and did not raise the copy of slice error.

Similarly, I have some other code that runs a function on a dataframe to filter it in certain ways:

def lim(df):
    if (geography == "All"):
        df1 = df
    else:
        df1 = df[df.center_JO == geography]
    df_date = df1[(df1.date >= start) & (df1.date <= end)]
    return df_date

df_lim = lim(df)

From this point forward, any changes I make to any of the values of df_lim raise the copy of slice error. The only way around it that I've found is to change the function call to:

df_lim = lim(df).copy()

This just seems wrong to me. What am I missing? It seems like these use cases should return copies by default, and I could have sworn that the last time I ran these scripts I was not running into these errors.
Do I just need to start adding .copy() all over the place? Seems like there should be a cleaner way to do this.

Petroleum answered 8/8, 2016 at 17:45 Comment(1)
That warning acts like a reminder that izmir_lim is a copy. The changes you do in izmir_lim will not be reflected in izmir. You are doing nothing wrong. You can set izmir_lim.is_copy = None to get rid of the warning.Salas
Q
40
 izmir = pd.read_excel(filepath)
 izmir_lim = izmir[['Gender','Age','MC_OLD_M>=60','MC_OLD_F>=60',
                    'MC_OLD_M>18','MC_OLD_F>18','MC_OLD_18>M>5',
                    'MC_OLD_18>F>5','MC_OLD_M_Child<5','MC_OLD_F_Child<5',
                    'MC_OLD_M>0<=1','MC_OLD_F>0<=1','Date to Delivery',
                    'Date to insert','Date of Entery']]

izmir_lim is a view/copy of izmir. You subsequently attempt to assign to it. This is what is throwing the error. Use this instead:

 izmir_lim = izmir[['Gender','Age','MC_OLD_M>=60','MC_OLD_F>=60',
                    'MC_OLD_M>18','MC_OLD_F>18','MC_OLD_18>M>5',
                    'MC_OLD_18>F>5','MC_OLD_M_Child<5','MC_OLD_F_Child<5',
                    'MC_OLD_M>0<=1','MC_OLD_F>0<=1','Date to Delivery',
                    'Date to insert','Date of Entery']].copy()

Whenever you 'create' a new dataframe from another in the following fashion:

new_df = old_df[list_of_columns_names]

new_df will have a truthy value in it's is_copy attribute. When you attempt to assign to it, pandas throws the SettingWithCopyWarning.

new_df.iloc[0, 0] = 1  # Should throw an error

You can overcome this in several ways.

Option #1

new_df = old_df[list_of_columns_names].copy()

Option #2 (as @ayhan suggested in comments)

new_df = old_df[list_of_columns_names]
new_df.is_copy = None

Option #3

new_df = old_df.loc[:, list_of_columns_names]
Quezada answered 8/8, 2016 at 17:48 Comment(6)
Can you help me understand the logic of necessitating this? Is izmir_lim a separate dataframe alltogether, or a view of a subset of izmir? And if it's just a view, why is pandas set up to work this way? In my workflows I would always want to make a completely separate dataframe once I subset and be able to manipulate the filtered dataframe as I please. I guess it just seems like requiring the .copy() all over the place just should not be necessary, but maybe I'm just not understanding the other use cases.Petroleum
@SamLilienfeld pandas is set up to work this way to be memory efficient where it can be. If you always want to new independent dataframe, then use one of the options and you'll have that. I only notice it happening when I create a subset via old_df[list_O_cols]. Often I old_df.loc[:, list_O_cols] and I have no issue. That is now option 3.Quezada
Nice - the .loc approach works perfectly, as do the others. Still finding my way around the .loc, .ix indexers etc hasn't always been clear which is the right one in which context. Partly I think because the names are a bit cryptic. Thanks a bunch!!Petroleum
I'm still continuing to run into this error in confusing ways, even when using .loc to index into dataframes. For example I am creating a filtered data frame that drops rows with null values as follows: df_no_none = df_trans.loc[df_trans.value.notnull()]. I am continuing to get copy of slice errors any time I then manipulate df_no_none. Any ideas?Petroleum
At Pandas it would be appreciated if they change that to more comfortable English: A value was set on a copy of a slice from a DataFrame, including due to the fact that it doesn't just try, it succeeds. :>Depressive
Options #1 (df[cols].copy()) and #3 (df.loc[:, cols]) may both work but they do different things. While the former makes a copy, the latter provides a direct slice to the original dataframe. So using #3 to change a value will also change the original dataframe, whereas using #1 won't.Ardussi
A
3

Since pandas 1.5.0, you have copy-on-write mode, which removes a lot of these uncertainties by ensuring that any dataframe or Series derived from another always behaves like a copy. It is disabled by default for now but will be enabled by default by pandas 3.0.

A direct consequence is that if you turn it on, you won't see SettingWithCopyWarning.

pd.options.mode.copy_on_write    # False by default for now (will be True by pandas 3.0)

df = pd.DataFrame({'A': [1, 2], 'B': ['a', pd.NA]})
df1 = df[df['A'] > 1]
df1['B'] = df1['B'].fillna('')          # <---- SettingWithCopyWarning

Now, with copy-on-write, you no longer see the warning because every operation on a dataframe produces a copy.

pd.options.mode.copy_on_write = True    # enable copy-on-write
df = pd.DataFrame({'A': [1, 2], 'B': ['a', pd.NA]})
df1 = df[df['A'] > 1]
df1['B'] = df1['B'].fillna('')          # <---- no warning

Note that pd.options.mode.copy_on_write = True enables copy-on-write everywhere. You can also use context manager to enable it in certain contexts.

df = pd.DataFrame({'A': [1, 2], 'B': ['a', pd.NA]})

# copy-on-write enabled only in the context below
with pd.option_context('mode.copy_on_write', True):
    df1 = df[df['A'] > 1]
    df1['B'] = df1['B'].fillna('')          # <---- no warning
Alarmist answered 8/5, 2023 at 0:14 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.