pandas pivot_table aggfunc/values parameters behaving oddly
Asked Answered
B

1

6

I have this data set:

np.random.seed(0)

test = pd.DataFrame({
    'a' : np.random.randint(0, 10, size=(10,)),
    'b' : np.random.randint(0, 10, size=(10,)),
    'c' : np.random.randint(0, 10, size=(10,)),
    'd' : np.random.randint(0, 10, size=(10,)),
})

print(test)

   a  b  c  d
0  5  7  5  2
1  0  6  9  3
2  3  8  8  8
3  3  8  9  1
4  7  1  4  3
5  9  6  3  3
6  3  7  0  3
7  5  7  3  7
8  2  8  5  0
9  4  1  0  1

When I run the following code I get far more columns than I thought I should.

tp = test.pivot_table(index=[
    'a',
], columns=[
    'b',
], values=[
    'c',
], aggfunc=[
    'nunique'
])

print(tp)

  nunique                                                       
        a                   b                   c               
b       1    6    7    8    1    6    7    8    1    6    7    8
a                                                               
0     NaN  1.0  NaN  NaN  NaN  1.0  NaN  NaN  NaN  1.0  NaN  NaN
2     NaN  NaN  NaN  1.0  NaN  NaN  NaN  1.0  NaN  NaN  NaN  1.0
3     NaN  NaN  1.0  1.0  NaN  NaN  1.0  1.0  NaN  NaN  1.0  2.0
4     1.0  NaN  NaN  NaN  1.0  NaN  NaN  NaN  1.0  NaN  NaN  NaN
5     NaN  NaN  1.0  NaN  NaN  NaN  1.0  NaN  NaN  NaN  2.0  NaN
7     1.0  NaN  NaN  NaN  1.0  NaN  NaN  NaN  1.0  NaN  NaN  NaN
9     NaN  1.0  NaN  NaN  NaN  1.0  NaN  NaN  NaN  1.0  NaN  NaN

I would expect to only get the subset of c columns, not the a and b columns as well. If I run this next code:

tp1 = test.pivot_table(index=[
    'a',
], columns=[
    'b',
], values='c', aggfunc=[
    'nunique'
])

print(tp1)

  nunique               
b       1    6    7    8
a                       
0     NaN  1.0  NaN  NaN
2     NaN  NaN  NaN  1.0
3     NaN  NaN  1.0  2.0
4     1.0  NaN  NaN  NaN
5     NaN  NaN  2.0  NaN
7     1.0  NaN  NaN  NaN
9     NaN  1.0  NaN  NaN

I get what I would've expected with the previous code. I can also get the expected output if I modify 'nunique' to pd.Series.nunique:

tp2 = test.pivot_table(index=[
    'a',
], columns=[
    'b',
], values=[
    'c',
], aggfunc=[
    pd.Series.nunique
])

print(tp2)

  nunique               
        c               
b       1    6    7    8
a                       
0     NaN  1.0  NaN  NaN
2     NaN  NaN  NaN  1.0
3     NaN  NaN  1.0  2.0
4     1.0  NaN  NaN  NaN
5     NaN  NaN  2.0  NaN
7     1.0  NaN  NaN  NaN
9     NaN  1.0  NaN  NaN

Question

Is this a bug? Or is there some underlying code that causes this? Shouldn't all three versions of the code produce the same (aside from column levels) output?

Another example with different aggfunc

When I run similar code but use count instead of nunique I get the expected results every time:

cp = test.pivot_table(index=[
    'a',
], columns=[
    'b',
], values=[
    'c',
], aggfunc=[
    'count'
])

cp2 = test.pivot_table(index=[
    'a',
], columns=[
    'b',
], values=[
    'c',
], aggfunc=[
    pd.Series.count
])

# both return the same thing
  count               
      c               
b     1    6    7    8
a                     
0   NaN  1.0  NaN  NaN
2   NaN  NaN  NaN  1.0
3   NaN  NaN  1.0  2.0
4   1.0  NaN  NaN  NaN
5   NaN  NaN  2.0  NaN
7   1.0  NaN  NaN  NaN
9   NaN  1.0  NaN  NaN
Borate answered 18/10, 2019 at 19:36 Comment(5)
values='c' gives the expected output.Make
I know. But not values=['c']. My question is why doesn't values=['c'] produce the same output.Borate
That's a good question. I would open an issue in pandas github and let pivot_table developers and maintainers explain the difference in behaviorLynnell
Looks very much like a bug, since a,b should not be there. Digging into the code, it shows that ['c'] is treated as multi-value, while 'c' is not.Make
I have logged an issue hereBorate
B
1

This was in a fact a bug that was last present in pandas version 1.0.5 and was fixed in version 1.1.0.

Solution

pip install -U pandas if pandas.__version__ <= "1.1.0"

Borate answered 14/8, 2022 at 20:52 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.