Checking whether data frame is copy or view in Pandas
Asked Answered
E

3

76

Is there an easy way to check whether two data frames are different copies or views of the same underlying data that doesn't involve manipulations? I'm trying to get a grip on when each is generated, and given how idiosyncratic the rules seem to be, I'd like an easy way to test.

For example, I thought "id(df.values)" would be stable across views, but they don't seem to be:

# Make two data frames that are views of same data.
df = pd.DataFrame([[1,2,3,4],[5,6,7,8]], index = ['row1','row2'], 
       columns = ['a','b','c','d'])
df2 = df.iloc[0:2,:]

# Demonstrate they are views:
df.iloc[0,0] = 99
df2.iloc[0,0]
Out[70]: 99

# Now try and compare the id on values attribute
# Different despite being views! 

id(df.values)
Out[71]: 4753564496

id(df2.values)
Out[72]: 4753603728

# And we can of course compare df and df2
df is df2
Out[73]: False

Other answers I've looked up that try to give rules, but don't seem consistent, and also don't answer this question of how to test:

And of course: - http://pandas.pydata.org/pandas-docs/stable/indexing.html#returning-a-view-versus-a-copy

UPDATE: Comments below seem to answer the question -- looking at the df.values.base attribute rather than df.values attribute does it, as does a reference to the df._is_copy attribute (though the latter is probably very bad form since it's an internal).

Evin answered 12/11, 2014 at 4:5 Comment(5)
Hmmm, df2._is_view returns True but given that it's marked as private/internal, there may be a better way to go about it.Jens
For your case, you can use: df2.values.base is df.values.baseSnatchy
In general doing df.values will create a copy, unless its a single dtype (as from being computationally expensive). Why do you care if its a view and what are you actually trying to do?Trepang
Great! Thanks both HYRY and Marius! Those definitely do it -- I had not discovered the values.base, and also did not know about the _is_view attribute (though as you say, probably best to avoid using it given it's an internal).Evin
@Snatchy And what about id's? Why they are different if there is only one object? Or view is another object?Redpencil
E
54

Answers from HYRY and Marius in comments!

One can check either by:

  • testing equivalence of the values.base attribute rather than the values attribute, as in:

    df.values.base is df2.values.base instead of df.values is df2.values.

  • or using the (admittedly internal) _is_view attribute (df2._is_view is True).

Evin answered 12/11, 2014 at 17:38 Comment(2)
What if the relationship is nested? Does this still work correctly? That is, if DF1 is maybe a view maybe a copy of DF2; and DF2 maybe a view maybe a copy of DF3?Thermosiphon
What about id's ?Redpencil
C
28

I've elaborated on this example with pandas 1.0.1. There's not only a boolean _is_view attribute, but also _is_copy which can be None or a reference to the original DataFrame:

df = pd.DataFrame([[1,2,3,4],[5,6,7,8]], index = ['row1','row2'], 
        columns = ['a','b','c','d'])
df2 = df.iloc[0:2, :]
df3 = df.loc[df['a'] == 1, :]

# df is neither copy nor view
df._is_view, df._is_copy
Out[1]: (False, None)

# df2 is a view AND a copy
df2._is_view, df2._is_copy
Out[2]: (True, <weakref at 0x00000236635C2228; to 'DataFrame' at 0x00000236635DAA58>)

# df3 is not a view, but a copy
df3._is_view, df3._is_copy
Out[3]: (False, <weakref at 0x00000236635C2228; to 'DataFrame' at 0x00000236635DAA58>)

So checking these two attributes should tell you not only if you're dealing with a view or not, but also if you have a copy or an "original" DataFrame.

See also this thread for a discussion explaining why you can't always predict whether your code will return a view or not.

Chiller answered 15/4, 2020 at 23:29 Comment(1)
Note that the docs for _is_view say "return a boolean if I am possibly a view", so I believe it may return False Negatives. Kind of like when I use it on this example: foo = pd.Series(['a', 'b', 'c', 'd'], dtype='string'); bar = foo.iloc[:2]; bar._is_view; bar.iloc[0] = 'z'; print(foo)Blackbird
M
0

You might trace the memory your pandas/python environment is consuming, and, on the assumption that a copy will utilise more memory than a view, be able to decide one way or another.

I believe there are libraries out there that will present the memory usage within the python environment itself - e.g. Heapy/Guppy.

There ought to be a metric you can apply that takes a baseline picture of memory usage prior to creating the object under inspection, then another picture afterwards. Comparison of the two memory maps (assuming nothing else has been created and we can isolate the change is due to the new object) should provide an idea of whether a view or copy has been produced.

We'd need to get an idea of the different memory profiles of each type of implementation, but some experimentation should yield results.

Muller answered 12/11, 2014 at 16:51 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.