When is it appropriate to use df.value_counts() vs df.groupby('...').count()?
Asked Answered
B

4

29

I've heard in Pandas there's often multiple ways to do the same thing, but I was wondering –

If I'm trying to group data by a value within a specific column and count the number of items with that value, when does it make sense to use df.groupby('colA').count() and when does it make sense to use df['colA'].value_counts() ?

Betake answered 25/11, 2017 at 15:49 Comment(0)
R
33

There is difference value_counts return:

The resulting object will be in descending order so that the first element is the most frequently-occurring element.

but count not, it sort output by index (created by column in groupby('col')).


df.groupby('colA').count() 

is for aggregate all columns of df by function count. So it count values excluding NaNs.

So if need count only one column need:

df.groupby('colA')['colA'].count() 

Sample:

df = pd.DataFrame({'colB':list('abcdefg'),
                   'colC':[1,3,5,7,np.nan,np.nan,4],
                   'colD':[np.nan,3,6,9,2,4,np.nan],
                   'colA':['c','c','b','a',np.nan,'b','b']})

print (df)
  colA colB  colC  colD
0    c    a   1.0   NaN
1    c    b   3.0   3.0
2    b    c   5.0   6.0
3    a    d   7.0   9.0
4  NaN    e   NaN   2.0
5    b    f   NaN   4.0
6    b    g   4.0   NaN

print (df['colA'].value_counts())
b    3
c    2
a    1
Name: colA, dtype: int64

print (df.groupby('colA').count())
      colB  colC  colD
colA                  
a        1     1     1
b        3     2     2
c        2     2     1

print (df.groupby('colA')['colA'].count())
colA
a    1
b    3
c    2
Name: colA, dtype: int64
Radborne answered 25/11, 2017 at 15:55 Comment(0)
R
23

Groupby and value_counts are totally different functions. You cannot perform value_counts on a dataframe.

Value Counts are limited only for a single column or series and it's sole purpose is to return the series of frequencies of values

Groupby returns a object so one can perform statistical computations over it. So when you do df.groupby(col).count() it will return the number of true values present in columns with respect to the specific columns in groupby.

When should be value_counts used and when should groupby.count be used : Lets take an example

df = pd.DataFrame({'id': [1, 2, 3, 4, 2, 2, 4], 'color': ["r","r","b","b","g","g","r"], 'size': [1,2,1,2,1,3,4]})

Groupby count:

df.groupby('color').count()
       id  size
color          
b       2     2
g       2     2
r       3     3

Groupby count is generally used for getting the valid number of values present in all the columns with reference to or with respect to one or more columns specified. So not a number (nan) will be excluded.

To find the frequency using groupby you need to aggregate against the specified column itself like @jez did. (maybe to avoid this and make developers life easy value_counts is implemented ).

Value Counts:

df['color'].value_counts()

r    3
g    2
b    2
Name: color, dtype: int64

Value count is generally used for finding the frequency of the values present in one particular column.

In conclusion :

.groupby(col).count() should be used when you want to find the frequency of valid values present in columns with respect to specified col.

.value_counts() should be used to find the frequencies of a series.

Redhead answered 26/11, 2017 at 11:43 Comment(0)
R
3

There are a lot of good answers here, but I just wanted to add a more concise one:

df.value_counts('col')  # and its syntactic twin df['col'].value_counts()

is exactly identical to

df.groupby('col')['col'].count().sort_values(ascending=False)

Both approaches have some additional keyword parameters, but as I see it, the gist is that the former is pretty much just syntactic sugar for the latter, when you want to return a Series of the counts of each distinct item in df[col] in descending order.

The reasons to use groupby(...).count() are when you want to be able to count across multiple columns, or as part of a more complex aggregation.

Rambling answered 15/3, 2023 at 0:54 Comment(0)
E
1

in simple words: .value_counts() Return a Series containing counts of unique rows in the DataFrame which means it counts up the individual values in a specific row and reports how many of the values are in the column: imagine we have a dataframe like:

df = pd.DataFrame({'first_name': ['John', 'Anne', 'John', 'Beth'],

                   'middle_name': ['Smith', pd.NA, pd.NA, 'Louise']})

first_name  middle_name
0   John    Smith
1   Anne    <NA>
2   John    <NA>
3   Beth    Louise

then we apply value_counts on it:

    df.value_counts()

first_name  middle_name
Beth        Louise         1
John        Smith          1
dtype: int64

as you can see it didn't count rows with NA values. however count() count non-NA cells for each column or row. in our example:

df.count()

first_name     4
middle_name    2
dtype: int64
Elston answered 11/2, 2022 at 7:25 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.