Understanding pandas dataframe indexing
Asked Answered
H

2

15

Summary: This doesn't work:

df[df.key==1]['D'] = 1

but this does:

df.D[df.key==1] = 1

Why?

Reproduction:

In [1]: import pandas as pd

In [2]: from numpy.random import randn

In [4]: df = pd.DataFrame(randn(6,3),columns=list('ABC'))

In [5]: df
Out[5]: 
          A         B         C
0  1.438161 -0.210454 -1.983704
1 -0.283780 -0.371773  0.017580
2  0.552564 -0.610548  0.257276
3  1.931332  0.649179 -1.349062
4  1.656010 -1.373263  1.333079
5  0.944862 -0.657849  1.526811

In [6]: df['D']=0.0

In [7]: df['key']=3*[1]+3*[2]

In [8]: df
Out[8]: 
          A         B         C  D  key
0  1.438161 -0.210454 -1.983704  0    1
1 -0.283780 -0.371773  0.017580  0    1
2  0.552564 -0.610548  0.257276  0    1
3  1.931332  0.649179 -1.349062  0    2
4  1.656010 -1.373263  1.333079  0    2
5  0.944862 -0.657849  1.526811  0    2

This doesn't work:

In [9]: df[df.key==1]['D'] = 1

In [10]: df
Out[10]: 
          A         B         C  D  key
0  1.438161 -0.210454 -1.983704  0    1
1 -0.283780 -0.371773  0.017580  0    1
2  0.552564 -0.610548  0.257276  0    1
3  1.931332  0.649179 -1.349062  0    2
4  1.656010 -1.373263  1.333079  0    2
5  0.944862 -0.657849  1.526811  0    2

but this does:

In [11]: df.D[df.key==1] = 3.4

In [12]: df
Out[12]: 
          A         B         C    D  key
0  1.438161 -0.210454 -1.983704  3.4    1
1 -0.283780 -0.371773  0.017580  3.4    1
2  0.552564 -0.610548  0.257276  3.4    1
3  1.931332  0.649179 -1.349062  0.0    2
4  1.656010 -1.373263  1.333079  0.0    2
5  0.944862 -0.657849  1.526811  0.0    2

Link to notebook

My question is:

Why does only the 2nd way work? I can't seem to see a difference in selection/indexing logic.

Version is 0.10.0

Edit: This should not be done like this anymore. Since version 0.11, there is .loc . See here: http://pandas.pydata.org/pandas-docs/stable/indexing.html

Hellespont answered 7/1, 2013 at 9:3 Comment(2)
As said in the answers it seems to be a numpy problem: have a look at this question for a similar problem. I'm not sure if it is a problem of view vs. copy.Callison
I understand now that it is cleary (and actually simply) the difference of view vs copy. First method only provides a copy that is garbage collected. Second method provides a view therefore the setting is done at the original dataframe. (see Dougal's comments below)Hellespont
S
17

The pandas documentation says:

Returning a view versus a copy

The rules about when a view on the data is returned are entirely dependent on NumPy. Whenever an array of labels or a boolean vector are involved in the indexing operation, the result will be a copy. With single label / scalar indexing and slicing, e.g. df.ix[3:6] or df.ix[:, 'A'], a view will be returned.

In df[df.key==1]['D'] you first do boolean slicing (leading to a copy of the Dataframe), then you choose a column ['D'].

In df.D[df.key==1] = 3.4, you first choose a column, then do boolean slicing on the resulting Series.

This seems to make the difference, although I must admit that it is a little counterintuitive.

Edit: The difference was identified by Dougal, see his comment: With version 1, the copy is made as the __getitem__ method is called for the boolean slicing. For version 2, only the __setitem__ method is accessed - thus not returning a copy but just assigning.

Sultana answered 7/1, 2013 at 9:32 Comment(4)
That's what I thought at first too, but there must be something else going on. df[df.key==1] = 1000 will actually assign 1000 to all of the values in the slice, so it can't be a copy. I guess there is some magic happening in the setattr or setitem methods.Viscera
but as I do a boolean slicing on the resulting Series, that should be a copy as well, shouldn't it? So why does the assignment work that way?Hellespont
Look at Dougals comment above. With version 1, the copy is made as the getitem-method is called for the boolean slicing. For version 2, only the setitem-method is accessed - thus not returning a copy but just assigning.Sultana
@K.-MichaelAye In the first way, you first construct a copy with __getitem__ and then call __setitem__ on that copy, which is then immediately garbage-collected. In the second way, you construct a view with __getitem__ and then call __setitem__ on the view.Boylan
V
4

I am pretty sure that your 1st way is returning a copy, instead of a view, and so assigning to it does not change the original data. I am not sure why this is happening though.

It seems to be related to the order in which you select rows and columns, NOT the syntax for getting columns. These both work:

df.D[df.key == 1] = 1
df['D'][df.key == 1] = 1

And neither of these works:

df[df.key == 1]['D'] = 1
df[df.key == 1].D = 1

From this evidence, I would assume that the slice df[df.key == 1] is returning a copy. But this is not the case! df[df.key == 1] = 0 will actually change the original data, as if it were a view.

So, I'm not sure. My sense is that this behavior has changed with the version of pandas. I seem to remember that df.D used to return a copy and df['D'] used to return a view, but this doesn't appear to be true anymore (pandas 0.10.0).

If you want a more complete answer, you should post in the pystatsmodels forum: https://groups.google.com/forum/?fromgroups#!forum/pystatsmodels

Viscera answered 7/1, 2013 at 9:27 Comment(1)
df[df.key == 1] does actually return a copy (as Thorsten's answer points out). The reason df[df.key == 1] = 0 modifies the original is that, although the syntax is a bit misleading, that's not actually doing the same thing at all; the non-assignment version calls __getitem__ and the assignment version __setitem__. It's like how if we have l = [0, 1, 2], then l[1] returns the int 1 but l[1] = 5 modifies the original.Boylan

© 2022 - 2024 — McMap. All rights reserved.