Pandas GroupBy - Show only groups with more than one unique feature-value
Asked Answered
C

3

5

I have a DataFrame df_things that looks like this and i want to predict the quality of the classification before the training

A    B     C      CLASS
-----------------------
al1  bal1  cal1   Ship
al1  bal1  cal1   Ship
al1  bal2  cal2   Ship
al2  bal2  cal2   Cow
al3  bal3  cal3   Car
al1  bal2  cal3   Car
al3  bal3  cal3   Car

I want to group rows by classes so that i have an idea of the distribution of the features. I do this with (for example, on col "B"),

df_B = df_things.groupby('CLASS').B.value_counts()

which gives me the results

CLASS  B 
-------------
ship   bal1  2 
       bal2  1
cow    bal2  2
car    bal2  1
       bal3  2

What I want to to is to visualize only groups that have more than one value so that it looks like this:

CLASS  B 
-------------
ship   bal1  2 
       bal2  1
car    bal2  1
       bal3  2

I'm a little bit stuck, so any ideas?

Cleanthes answered 30/12, 2018 at 16:5 Comment(0)
P
4

You can use groupby to filter groups that have an nunique count over 1.

v = df_things.groupby('CLASS').B.value_counts()
v[v.groupby(level=0).transform('nunique').gt(1)]

CLASS  B   
Car    bal3    2
       bal2    1
Ship   bal1    2
       bal2    1
Name: B, dtype: int64
Pontone answered 30/12, 2018 at 16:9 Comment(3)
Thank you so much - exactly what i was looking for!Cleanthes
And thank you for the editing of my post. It helps me to ask my questions in a more precise way!Cleanthes
@FelTry2 Just being a good citizen of the site. Keep the questions rolling!Pontone
S
2

Solution from crosstab

s=pd.crosstab(df.CLASS,df.B)
s[s.ne(0).sum(1)>1].replace(0,np.nan).stack()
CLASS  B   
Car    bal2    1.0
       bal3    2.0
Ship   bal1    2.0
       bal2    1.0
dtype: float64
Striate answered 30/12, 2018 at 16:19 Comment(0)
R
0

Here is another approach.

Setup the input data:

In [1]:
import pandas as pd
df_things = pd.DataFrame({
    'A': ['al1', 'al1', 'al1', 'al2', 'al3', 'al1', 'al3'],
    'B': ['bal1', 'bal1', 'bal2', 'bal2', 'bal3', 'bal2', 'bal3'],
    'C': ['cal1', 'cal1', 'cal2', 'cal2', 'cal3', 'cal3', 'cal3'],
    'CLASS': ['Ship', 'Ship', 'Ship', 'Cow', 'Car', 'Car', 'Car']
})
print(df_things)
     A     B     C CLASS
0  al1  bal1  cal1  Ship
1  al1  bal1  cal1  Ship
2  al1  bal2  cal2  Ship
3  al2  bal2  cal2   Cow
4  al3  bal3  cal3   Car
5  al1  bal2  cal3   Car
6  al3  bal3  cal3   Car

Reduce it to groups that have more than one unique value

In [2]:
df_reduced = df_things.groupby(['CLASS']).filter(lambda grp: grp['B'].nunique() > 1)
print(df_reduced)
     A     B     C CLASS
0  al1  bal1  cal1  Ship
1  al1  bal1  cal1  Ship
2  al1  bal2  cal2  Ship
4  al3  bal3  cal3   Car
5  al1  bal2  cal3   Car
6  al3  bal3  cal3   Car

Apply groupby to get the desired output

In [3]:
df_reduced.groupby(['CLASS'])['B'].value_counts()
Out[3]:
CLASS  B
Car    bal3    2
       bal2    1
Ship   bal1    2
       bal2    1
Name: B, dtype: int64

BTW, you have a typo in df_B in your question. It should be

In [4]:
df_B = df_things.groupby('CLASS').B.value_counts()
print(df_B)
CLASS  B
Car    bal3    2
       bal2    1
Cow    bal2    1
Rattler answered 29/7 at 23:54 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.