Pandas: print column name with missing values
Asked Answered
P

8

38

I am trying to print or to get list of columns name with missing values. E.g.

data1 data2 data3  
1     3     3  
2     NaN   5  
3     4     NaN  

I want to get ['data2', 'data3']. I wrote following code:

print('\n'.join(map(
    lambda x : str(x[1])
    ,(filter(lambda z: z[0] != False, zip(train.isnull().any(axis=0), train.columns.values)))
)))

It works well, but I think should be simpler way.

Pervert answered 21/5, 2016 at 18:37 Comment(1)
Thank you all for answers. They all work, but I decided to check execution time. I got following values: [col for col in df.columns if df[col].isnull().any()] Time elapsed: 0:00:00.391752 df.columns[df.isnull().any()] Time elapsed: 0:00:00.396459 df.loc[:, df.isnull().any()] Time elapsed: 0:00:00.430103 I decided to use second variant, because it is short and fast both.Pervert
F
113

df.isnull().any() generates a boolean array (True if the column has a missing value, False otherwise). You can use it to index into df.columns:

df.columns[df.isnull().any()]

will return a list of the columns which have missing values.


df = pd.DataFrame({'A': [1, 2, 3], 
                   'B': [1, 2, np.nan], 
                   'C': [4, 5, 6], 
                   'D': [np.nan, np.nan, np.nan]})

df
Out: 
   A    B  C   D
0  1  1.0  4 NaN
1  2  2.0  5 NaN
2  3  NaN  6 NaN

df.columns[df.isnull().any()]
Out: Index(['B', 'D'], dtype='object')

df.columns[df.isnull().any()].tolist()  # to get a list instead of an Index object
Out: ['B', 'D']
Freebooter answered 21/5, 2016 at 19:50 Comment(0)
S
18

Oneliner -

[col for col in df.columns if df[col].isnull().any()]
Sucking answered 21/5, 2016 at 18:47 Comment(0)
D
14
import numpy as np
import pandas as pd

raw_data = {'first_name': ['Jason', np.nan, 'Tina', 'Jake', 'Amy'], 
        'last_name': ['Miller', np.nan, np.nan, 'Milner', 'Cooze'], 
        'age': [22, np.nan, 23, 24, 25], 
        'sex': ['m', np.nan, 'f', 'm', 'f'], 
        'Test1_Score': [4, np.nan, 0, 0, 0],
        'Test2_Score': [25, np.nan, np.nan, 0, 0]}
results = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'sex', 'Test1_Score', 'Test2_Score'])

results 
'''
  first_name last_name   age  sex  Test1_Score  Test2_Score
0      Jason    Miller  22.0    m          4.0         25.0
1        NaN       NaN   NaN  NaN          NaN          NaN
2       Tina       NaN  23.0    f          0.0          NaN
3       Jake    Milner  24.0    m          0.0          0.0
4        Amy     Cooze  25.0    f          0.0          0.0
'''

You can use following function, which will give you output in Dataframe

  • Zero Values
  • Missing Values
  • % of Total Values
  • Total Zero Missing Values
  • % Total Zero Missing Values
  • Data Type

Just copy and paste following function and call it by passing your pandas Dataframe

def missing_zero_values_table(df):
        zero_val = (df == 0.00).astype(int).sum(axis=0)
        mis_val = df.isnull().sum()
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        mz_table = pd.concat([zero_val, mis_val, mis_val_percent], axis=1)
        mz_table = mz_table.rename(
        columns = {0 : 'Zero Values', 1 : 'Missing Values', 2 : '% of Total Values'})
        mz_table['Total Zero Missing Values'] = mz_table['Zero Values'] + mz_table['Missing Values']
        mz_table['% Total Zero Missing Values'] = 100 * mz_table['Total Zero Missing Values'] / len(df)
        mz_table['Data Type'] = df.dtypes
        mz_table = mz_table[
            mz_table.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns and " + str(df.shape[0]) + " Rows.\n"      
            "There are " + str(mz_table.shape[0]) +
              " columns that have missing values.")
#         mz_table.to_excel('D:/sampledata/missing_and_zero_values.xlsx', freeze_panes=(1,0), index = False)
        return mz_table

missing_zero_values_table(results)

Output

Your selected dataframe has 6 columns and 5 Rows.
There are 6 columns that have missing values.

             Zero Values  Missing Values  % of Total Values  Total Zero Missing Values  % Total Zero Missing Values Data Type
last_name              0               2               40.0                          2                         40.0    object
Test2_Score            2               2               40.0                          4                         80.0   float64
first_name             0               1               20.0                          1                         20.0    object
age                    0               1               20.0                          1                         20.0   float64
sex                    0               1               20.0                          1                         20.0    object
Test1_Score            3               1               20.0                          4                         80.0   float64

If you want to keep it simple then you can use following function to get missing values in %

def missing(dff):
    print (round((dff.isnull().sum() * 100/ len(dff)),2).sort_values(ascending=False))


missing(results)
'''
Test2_Score    40.0
last_name      40.0
Test1_Score    20.0
sex            20.0
age            20.0
first_name     20.0
dtype: float64
'''
Damask answered 19/6, 2019 at 15:38 Comment(0)
V
8

Another alternative:

df.loc[:, df.isnull().any()]
Vedda answered 21/5, 2016 at 20:21 Comment(0)
F
0
# Developing a loop to identify and remove columns where more than 50% of the values are missing#

 i = 0

 count_of_columns_removed = 0

 a = np.array([50,60,70,80,90,100])

 percent_NA = 0

for i in app2.columns:

    percent_NA = round(100*(app2[i].isnull().sum()/len(app2.index)),2)     
    # Replace app2 with relevant name

    if percent_NA >= a.all():
        print(i)
        app2 = app2.drop(columns=i)
        count_of_columns_removed += 1

print(count_of_columns_removed)
Fusiform answered 15/6, 2019 at 21:55 Comment(1)
Hi Vinit, welcome! Be sure to add some text with the code as it might help other understand your answer.Recondition
S
0

For a dataframe df

missing = df.isnull().sum()
print(missing)
Scalise answered 22/7, 2020 at 19:56 Comment(0)
B
0
df.columns[df.isnull().any()].index
Brumaire answered 27/8, 2020 at 13:14 Comment(1)
Welcome to Stack Overflow! Please read How to Answer. Refrain from answering code-only and always remember that you are not only solving the problem at hand, but also educate any future readers of this question, such as yourself given the question was posted 4 years ago. Please edit the answer to contain some explanation as to why this would solve the problem at hand.Crackbrained
W
0

To get the names of the column names which does NOT have any missing values:

set(df.columns[df.isnull().mean()==0])
Wenda answered 17/4, 2021 at 19:25 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.