How to count nan values in a pandas DataFrame?
Asked Answered
R

6

28

What is the best way to account for (not a number) nan values in a pandas DataFrame?

The following code:

import numpy as np
import pandas as pd
dfd = pd.DataFrame([1, np.nan, 3, 3, 3, np.nan], columns=['a'])
dfv = dfd.a.value_counts().sort_index()
print("nan: %d" % dfv[np.nan].sum())
print("1: %d" % dfv[1].sum())
print("3: %d" % dfv[3].sum())
print("total: %d" % dfv[:].sum())

Outputs:

nan: 0
1: 1
3: 3
total: 4

While the desired output is:

nan: 2
1: 1
3: 3
total: 6

I am using pandas 0.17 with Python 3.5.0 with Anaconda 2.4.0.

Rush answered 30/12, 2015 at 20:50 Comment(0)
D
25

If you want to count only NaN values in column 'a' of a DataFrame df, use:

len(df) - df['a'].count()

Here count() tells us the number of non-NaN values, and this is subtracted from the total number of values (given by len(df)).

To count NaN values in every column of df, use:

len(df) - df.count()

If you want to use value_counts, tell it not to drop NaN values by setting dropna=False (added in 0.14.1):

dfv = dfd['a'].value_counts(dropna=False)

This allows the missing values in the column to be counted too:

 3     3
NaN    2
 1     1
Name: a, dtype: int64

The rest of your code should then work as you expect (note that it's not necessary to call sum; just print("nan: %d" % dfv[np.nan]) suffices).

Dubonnet answered 30/12, 2015 at 20:53 Comment(2)
And after using the method above dfv.values.sum() Counts all the values, i.e. 6 Thanks. ;)Rush
No problem! Yep, that works. In fact, you could just write dfv.sum() to count all the values. Or even more efficiently, just check len(dfd).Dubonnet
A
33

To count just null values, you can use isnull():

In [11]:
dfd.isnull().sum()

Out[11]:
a    2
dtype: int64

Here a is the column name, and there are 2 occurrences of the null value in the column.

Aardvark answered 30/12, 2015 at 20:56 Comment(1)
this is the easier approachPolik
D
25

If you want to count only NaN values in column 'a' of a DataFrame df, use:

len(df) - df['a'].count()

Here count() tells us the number of non-NaN values, and this is subtracted from the total number of values (given by len(df)).

To count NaN values in every column of df, use:

len(df) - df.count()

If you want to use value_counts, tell it not to drop NaN values by setting dropna=False (added in 0.14.1):

dfv = dfd['a'].value_counts(dropna=False)

This allows the missing values in the column to be counted too:

 3     3
NaN    2
 1     1
Name: a, dtype: int64

The rest of your code should then work as you expect (note that it's not necessary to call sum; just print("nan: %d" % dfv[np.nan]) suffices).

Dubonnet answered 30/12, 2015 at 20:53 Comment(2)
And after using the method above dfv.values.sum() Counts all the values, i.e. 6 Thanks. ;)Rush
No problem! Yep, that works. In fact, you could just write dfv.sum() to count all the values. Or even more efficiently, just check len(dfd).Dubonnet
U
4

A good clean way to count all NaN's in all columns of your dataframe would be ...

import pandas as pd 
import numpy as np


df = pd.DataFrame({'a':[1,2,np.nan], 'b':[np.nan,1,np.nan]})
print(df.isna().sum().sum())

Using a single sum, you get the count of NaN's for each column. The second sum, sums those column sums.

Undervalue answered 20/10, 2018 at 20:36 Comment(0)
J
3

This one worked for me best!

If you wanna get a simple summary use (great for data science to count missing values and their type):

df.info(verbose=True, null_counts=True)

Or another cool one is:

df['<column_name>'].value_counts(dropna=False)

Example:

df = pd.DataFrame({'a': [1, 2, 1, 2, np.nan],
   ...:                    'b': [2, 2, np.nan, 1, np.nan],
   ...:                    'c': [np.nan, 3, np.nan, 3, np.nan]})

This is the df:

    a    b    c
0  1.0  2.0  NaN
1  2.0  2.0  3.0
2  1.0  NaN  NaN
3  2.0  1.0  3.0
4  NaN  NaN  NaN

Run Info:

df.info(verbose=True, null_counts=True)
   ...:
<class 'pandas.core.frame.DataFrame'>

RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
a    4 non-null float64
b    3 non-null float64
c    2 non-null float64
dtypes: float64(3)

So you see for C you get, out of 5 rows 2 non-nulls, b/c you have null at rows: [0,2,4]

And this is what you get using value_counts for each column:

In [17]: df['a'].value_counts(dropna=False)
Out[17]:
 2.0    2
 1.0    2
NaN     1
Name: a, dtype: int64

In [18]: df['b'].value_counts(dropna=False)
Out[18]:
NaN     2
 2.0    2
 1.0    1
Name: b, dtype: int64

In [19]: df['c'].value_counts(dropna=False)
Out[19]:
NaN     3
 3.0    2
Name: c, dtype: int64
Jarad answered 2/9, 2020 at 15:59 Comment(1)
null_counts is deprecated, rather use show_counts.Amateurish
A
2

if you only want the summary of null value for each column, using the following code df.isnull().sum() if you want to know how many null values in the data frame using following code df.isnull().sum().sum() # calculate total

Ablepsia answered 7/11, 2018 at 10:58 Comment(0)
A
2

Yet another way to count all the nans in a df:

num_nans = df.size - df.count().sum()

Timings:

import timeit

import numpy as np
import pandas as pd

df_scale = 100000
df = pd.DataFrame(
    [[1, np.nan, 100, 63], [2, np.nan, 101, 63], [2, 12, 102, 63],
     [2, 14, 102, 63], [2, 14, 102, 64], [1, np.nan, 200, 63]] * df_scale,
    columns=['group', 'value', 'value2', 'dummy'])

repeat = 3
numbers = 100

setup = """import pandas as pd
from __main__ import df
"""

def timer(statement, _setup=None):
    print (min(
        timeit.Timer(statement, setup=_setup or setup).repeat(
            repeat, numbers)))

timer('df.size - df.count().sum()')
timer('df.isna().sum().sum()')
timer('df.isnull().sum().sum()')

prints:

3.998805362999999
3.7503365439999996
3.689461442999999

so pretty much equivalent

Autobahn answered 5/12, 2018 at 4:27 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.