How do I get a summary count of missing/NaN data by column in 'pandas'?
Asked Answered
A

11

51

In R I can quickly see a count of missing data using the summary command, but the equivalent pandas DataFrame method, describe does not report these values.

I gather I can do something like

len(mydata.index) - mydata.count()

to compute the number of missing values for each column, but I wonder if there's a better idiom (or if my approach is even right).

Applaud answered 7/3, 2014 at 18:8 Comment(1)
Similar question for rows: Python/Pandas: counting the number of missing/NaN in each rowSweatshop
B
60

Both describe and info report the count of non-missing values.

In [1]: df = DataFrame(np.random.randn(10,2))

In [2]: df.iloc[3:6,0] = np.nan

In [3]: df
Out[3]: 
          0         1
0 -0.560342  1.862640
1 -1.237742  0.596384
2  0.603539 -1.561594
3       NaN  3.018954
4       NaN -0.046759
5       NaN  0.480158
6  0.113200 -0.911159
7  0.990895  0.612990
8  0.668534 -0.701769
9 -0.607247 -0.489427

[10 rows x 2 columns]

In [4]: df.describe()
Out[4]: 
              0          1
count  7.000000  10.000000
mean  -0.004166   0.286042
std    0.818586   1.363422
min   -1.237742  -1.561594
25%   -0.583795  -0.648684
50%    0.113200   0.216699
75%    0.636036   0.608839
max    0.990895   3.018954

[8 rows x 2 columns]


In [5]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 9
Data columns (total 2 columns):
0    7 non-null float64
1    10 non-null float64
dtypes: float64(2)

To get a count of missing, your soln is correct

In [20]: len(df.index)-df.count()
Out[20]: 
0    3
1    0
dtype: int64

You could do this too

In [23]: df.isnull().sum()
Out[23]: 
0    3
1    0
dtype: int64
Barbbarba answered 7/3, 2014 at 18:13 Comment(0)
L
12

As a tiny addition, to get percentage missing by DataFrame column, combining @Jeff and @userS's answers above gets you:

100*(df.isnull().sum())/len(df)
Lightner answered 24/3, 2017 at 11:22 Comment(0)
P
8

Following one will do the trick and will return counts of nulls for every column:

df.isnull().sum(axis=0)

df.isnull() returns a dataframe with True / False values
sum(axis=0) sums the values across all rows for a column

Playtime answered 31/8, 2019 at 2:13 Comment(0)
T
3

This isnt quite a full summary, but it will give you a quick sense of your column level data

def getPctMissing(series):
    num = series.isnull().sum()
    den = series.count()
    return 100*(num/den)
Thalassic answered 28/1, 2016 at 20:0 Comment(0)
C
3

If you want to see not null summary of each column , just use df.info(null_counts=True):

Example 1:

df = pd.DataFrame(np.random.randn(10,5), columns=list('abcde'))
df.iloc[:4,0] = np.nan
df.iloc[:3,1] = np.nan
df.iloc[:2,2] = np.nan
df.iloc[:1,3] = np.nan

df.info(null_counts=True)

output:


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   a       6 non-null      float64
 1   b       7 non-null      float64
 2   c       8 non-null      float64
 3   d       9 non-null      float64
 4   e       10 non-null     float64
dtypes: float64(5)
memory usage: 528.0 bytes

In addition, if you want to customize the result , such as add nan_rate , I wrote a method


def describe_nan(df):
    return pd.DataFrame([(i, df[df[i].isna()].shape[0],df[df[i].isna()].shape[0]/df.shape[0]) for i in df.columns], columns=['column', 'nan_counts', 'nan_rate'])

describe_nan(df)

>>> column  nan_counts  nan_rate
0   a   4   0.4
1   b   3   0.3
2   c   2   0.2
3   d   1   0.1
4   e   0   0.0

Curable answered 6/5, 2021 at 2:52 Comment(1)
null_counts is deprecated nowStomatology
O
1

If you didn't care which columns had Nan's and you just wanted to check overall, just add a second .sum() to get a single value.

result = df.isnull().sum().sum()
result > 0

a Series would only need one .sum() and a Panel() would need three

Octangular answered 26/4, 2017 at 15:52 Comment(0)
I
1

This is code makes your life easy

import sidetable

df.stb.missing()

Check this out : https://github.com/chris1610/sidetable

enter image description here

Islean answered 3/3, 2023 at 23:14 Comment(0)
F
0

I had to process numerous large datasets to get NaNs information (counts and portions per column) and timing was an issue. So I timed various methods for getting summary counts of NaNs per column in a separate dataframe with column names, NaN counts and NaN portions as columns:

# create random dataframe
dfa = pd.DataFrame(np.random.randn(100000,300))
# add 30% random NaNs
dfa = dfa.mask(np.random.random(dfa.shape) < 0.3)

With pandas methods only:

%%timeit
nans_dfa = dfa.isna().sum().rename_axis('Columns').reset_index(name='Counts')
nans_dfa["NaNportions"] = nans_dfa["Counts"] / dfa.shape[0]

# Output:
# 10 loops, best of 5: 57.8 ms per loop

Using list comprehension, based on the fine answer from @Mithril:

%%timeit
nan_dfa_loop2 = pd.DataFrame([(col, dfa[dfa[col].isna()].shape[0], dfa[dfa[col].isna()].shape[0]/dfa.shape[0]) for col in dfa.columns], columns=('Columns', 'Counts', 'NaNportions'))

# Output:
# 1 loop, best of 5: 13.9 s per loop

Using list comprehension with a second for loop to store the result of method calls to reduce calls to these methods:

%%timeit
nan_dfa_loop1 = pd.DataFrame([(col, n, n/dfa.shape[0]) for col in dfa.columns for n in (dfa[col].isna().sum(),) if n], columns=('Columns', 'Counts', 'NaNportions'))

# Output:
# 1 loop, best of 5: 373 ms per loop

All the above will produce the same dataframe:

    Columns Counts  NaNportions
0   0   29902   0.29902
1   1   30101   0.30101
2   2   30008   0.30008
3   3   30194   0.30194
4   4   29856   0.29856
... ... ... ...
295 295 29823   0.29823
296 296 29818   0.29818
297 297 29979   0.29979
298 298 30050   0.30050
299 299 30192   0.30192

('Columns' is redundant with this test dataframe. It is just used as placeholder where in a real life dataset it would probably represent the names of the attributes in the initial dataframe.)

Fibril answered 13/10, 2021 at 12:37 Comment(0)
O
0

More precise one:

missed_values = df.isnull()

for col in missed_values.columns.values.tolist():

   if True in missed_values[col].values:
       print(missed_values[col].name, missed_values[col].value_counts())
O answered 22/12, 2022 at 10:41 Comment(0)
K
0

If you want to check what of the columns have missing values, you can go for:

mydata.isna().any()

Which will print a True in case the column have any missing value

col1  False
col2  False
col3  True

If you want the count of missing values, then you can type:

mydata.isna().sum()

Which will print the number of rows with missing value for each column

col1  0
col2  0
col3  5
Khudari answered 30/10, 2023 at 13:2 Comment(0)
D
-1

I'd recommend using the missingno package (https://github.com/ResidentMario/missingno), which allows you to quickly and easily visualize missing data from a pandas dataframe. My preferred visualization is a bar chart, but they have others.

import missingno as msno
import pandas as pd

df = pd.read_csv('some_data.csv')

msno.bar(df.sample(1000))
Devalue answered 6/10, 2022 at 7:57 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.