How to find which columns contain any NaN value in Pandas dataframe
Asked Answered
J

16

277

Given a pandas dataframe containing possible NaN values scattered here and there:

Question: How do I determine which columns contain NaN values? In particular, can I get a list of the column names containing NaNs?

Jidda answered 25/3, 2016 at 18:50 Comment(1)
df.isna().any()[lambda x: x] works for meBaronage
R
423

UPDATE: using Pandas 0.22.0

Newer Pandas versions have new methods 'DataFrame.isna()' and 'DataFrame.notna()'

In [71]: df
Out[71]:
     a    b  c
0  NaN  7.0  0
1  0.0  NaN  4
2  2.0  NaN  4
3  1.0  7.0  0
4  1.0  3.0  9
5  7.0  4.0  9
6  2.0  6.0  9
7  9.0  6.0  4
8  3.0  0.0  9
9  9.0  0.0  1

In [72]: df.isna().any()
Out[72]:
a     True
b     True
c    False
dtype: bool

as list of columns:

In [74]: df.columns[df.isna().any()].tolist()
Out[74]: ['a', 'b']

to select those columns (containing at least one NaN value):

In [73]: df.loc[:, df.isna().any()]
Out[73]:
     a    b
0  NaN  7.0
1  0.0  NaN
2  2.0  NaN
3  1.0  7.0
4  1.0  3.0
5  7.0  4.0
6  2.0  6.0
7  9.0  6.0
8  3.0  0.0
9  9.0  0.0

OLD answer:

Try to use isnull():

In [97]: df
Out[97]:
     a    b  c
0  NaN  7.0  0
1  0.0  NaN  4
2  2.0  NaN  4
3  1.0  7.0  0
4  1.0  3.0  9
5  7.0  4.0  9
6  2.0  6.0  9
7  9.0  6.0  4
8  3.0  0.0  9
9  9.0  0.0  1

In [98]: pd.isnull(df).sum() > 0
Out[98]:
a     True
b     True
c    False
dtype: bool

or as @root proposed clearer version:

In [5]: df.isnull().any()
Out[5]:
a     True
b     True
c    False
dtype: bool

In [7]: df.columns[df.isnull().any()].tolist()
Out[7]: ['a', 'b']

to select a subset - all columns containing at least one NaN value:

In [31]: df.loc[:, df.isnull().any()]
Out[31]:
     a    b
0  NaN  7.0
1  0.0  NaN
2  2.0  NaN
3  1.0  7.0
4  1.0  3.0
5  7.0  4.0
6  2.0  6.0
7  9.0  6.0
8  3.0  0.0
9  9.0  0.0
Regulator answered 25/3, 2016 at 18:54 Comment(8)
Thanks for the response! I am looking to get a list of the column names (I updated my question accordingly), do you know how?Jidda
Do you know a good a way to select all columns with a specific value instead of null values?Zoography
Nevermind! Simply replace .isnull() with .isin(['xxx']) to search for values instead of nulls: df.columns[df.isin['xxx'].any()].tolist()Zoography
@gregorio099, i'd do it this way: df.columns[df.eq(search_for_value).any()].tolist()Regulator
Nice answer, already upvoted. Idea - can you add new functions isna, notna ?Transcurrent
How might one go about saving the results of df.columns[df.isna().any()].tolist() to a new column in df by row? I tried df['new_col']=df.columns[df.isna().any()].tolist() but got an error.Standice
Nice! BUT HOW TO DO IT FOR EVERY ROW? Creating additional column with string of column names with nan in this row?Cyn
@AntonMakarov, I think you may want to open a new question with a small reproducible example ;)Regulator
J
45

You can use df.isnull().sum(). It shows all columns and the total NaNs of each feature.

Jungly answered 21/11, 2017 at 17:18 Comment(1)
Do you have a quick approach for using and setting conditions based on this method.? For example, if col4 and col5 and col6 is null: df=df[["col1","col2","col3"]]Osculate
S
27

I had a problem where I had to many columns to visually inspect on the screen so a shortlist comp that filters and returns the offending columns is

nan_cols = [i for i in df.columns if df[i].isnull().any()]

if that's helpful to anyone

Adding to that if you want to filter out columns having more nan values than a threshold, say 85% then use

nan_cols85 = [i for i in df.columns if df[i].isnull().sum() > 0.85*len(data)]

Stranglehold answered 7/8, 2019 at 7:25 Comment(0)
L
18

This worked for me,

1. For getting Columns having at least 1 null value. (column names)

data.columns[data.isnull().any()]

2. For getting Columns with count, with having at least 1 null value.

data[data.columns[data.isnull().any()]].isnull().sum()

[Optional] 3. For getting percentage of the null count.

data[data.columns[data.isnull().any()]].isnull().sum() * 100 / data.shape[0]
Lyris answered 17/6, 2020 at 16:25 Comment(0)
S
9
df.columns[df.isnull().any()].tolist()

it will return name of columns that contains null rows

Sassenach answered 9/1, 2021 at 2:3 Comment(0)
B
8

I know this is a very well-answered question but I wanted to add a slight adjustment. This answer only returns columns containing nulls, and also still shows the count of the nulls.

As 1-liner:

pd.isnull(df).sum()[pd.isnull(df).sum() > 0]

Description

  1. Count nulls in each column
null_count_ser = pd.isnull(df).sum()
  1. True|False series describing if that column had nulls
is_null_ser = null_count_ser > 0
  1. Use the T|F series to filter out those without
null_count_ser[is_null_ser]

Example Output

name          5
phone         187
age           644
Brachiate answered 22/11, 2021 at 16:14 Comment(0)
F
6

In datasets having large number of columns its even better to see how many columns contain null values and how many don't.

print("No. of columns containing null values")
print(len(df.columns[df.isna().any()]))

print("No. of columns not containing null values")
print(len(df.columns[df.notna().all()]))

print("Total no. of columns in the dataframe")
print(len(df.columns))

For example in my dataframe it contained 82 columns, of which 19 contained at least one null value.

Further you can also automatically remove cols and rows depending on which has more null values
Here is the code which does this intelligently:

df = df.drop(df.columns[df.isna().sum()>len(df.columns)],axis = 1)
df = df.dropna(axis = 0).reset_index(drop=True)

Note: Above code removes all of your null values. If you want null values, process them before.

Florrie answered 7/10, 2019 at 5:2 Comment(0)
H
3

i use these three lines of code to print out the column names which contain at least one null value:

for column in dataframe:
    if dataframe[column].isnull().any():
       print('{0} has {1} null values'.format(column, dataframe[column].isnull().sum()))
Homeric answered 7/12, 2018 at 16:48 Comment(0)
C
3

This is one of the methods..

import pandas as pd
df = pd.DataFrame({'a':[1,2,np.nan], 'b':[np.nan,1,np.nan],'c':[np.nan,2,np.nan], 'd':[np.nan,np.nan,np.nan]})
print(pd.isnull(df).sum())

enter image description here

Cheapjack answered 23/6, 2021 at 12:33 Comment(0)
F
2

Both of these should work:

df.isnull().sum()
df.isna().sum()

DataFrame methods isna() or isnull() are completely identical.

Note: Empty strings '' is considered as False (not considered NA)

Fibro answered 6/5, 2019 at 22:0 Comment(0)
G
1

df.isna() return True values for NaN, False for the rest. So, doing:

df.isna().any()

will return True for any column having a NaN, False for the rest

Goaltender answered 4/11, 2020 at 14:21 Comment(0)
T
0

To see just the columns containing NaNs and just the rows containing NaNs:

isnulldf = df.isnull()
columns_containing_nulls = isnulldf.columns[isnulldf.any()]
rows_containing_nulls = df[isnulldf[columns_containing_nulls].any(axis='columns')].index
only_nulls_df = df[columns_containing_nulls].loc[rows_containing_nulls]
print(only_nulls_df)
Tridimensional answered 9/7, 2021 at 15:50 Comment(0)
M
0

features_with_na=[features for features in dataframe.columns if dataframe[features].isnull().sum()>0]

for feature in features_with_na: print(feature, np.round(dataframe[feature].isnull().mean(), 4), '% missing values') print(features_with_na)

it will give % of missing value for each column in dataframe

Manes answered 8/8, 2021 at 17:19 Comment(0)
W
0

If you want to write it as a one-liner (could be useful if functions need to be called sequentially in a pipeline), then you can do so using either pipe() or passing a callable to loc[]. pipe() can be used to get the columns with NaN values as well.

df.isna().any().pipe(lambda x: x.index[x])

df.isna().any().loc[lambda x: x].index

A working example:

df = pd.DataFrame({
    'a': [1, 2, pd.NA],
    'b': [10, 20, 30],
    'c': [pd.NA, 'B', 'C']
})


df.isna().any().pipe(lambda x: x.index[x])  # Index(['a', 'c'], dtype='object')
df.isna().any().loc[lambda x: x].index      # Index(['a', 'c'], dtype='object')


df.isna().any().pipe(lambda x: df.loc[:, x])


      a     c
0     1  <NA>
1     2     B
2  <NA>     C

If you want to opposite, i.e. columns without any NaN, then notna().all() could be used instead of isna().any().

df.notna().all().pipe(lambda x: x.index[x])  # Index(['b'], dtype='object')
Wilterdink answered 18/9, 2023 at 18:21 Comment(0)
M
0

If you are looking for printing columns alongside their respective null_values:

null_cols = [i for i in df.columns if df[i].isnull().any()] 
for i in null_cols:
    print(i,df[i].isnull().sum())
Max answered 20/2 at 17:59 Comment(0)
I
-2

The code works if you want to find columns containing NaN values and get a list of the column names.

na_names = df.isnull().any()
list(na_names.where(na_names == True).dropna().index)

If you want to find columns whose values are all NaNs, you can replace any with all.

Iva answered 26/1, 2022 at 6:50 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.