Remove NaN/NULL columns in a Pandas dataframe?

Asked 1/6, 2012 at 22:5 Answered 12/5, 2021 at 23:33

I have a dataFrame in pandas and several of the columns have all null values. Is there a built in function which will let me remove those columns?

Holmen answered 1/6, 2012 at 22:5 Comment(1)

could you maybe accept the answer? This will mark the question as resolved and help other users as well. – Gaitan 1/11, 2016 at 9:16

123

Yes, dropna. See http://pandas.pydata.org/pandas-docs/stable/missing_data.html and the DataFrame.dropna docstring:

Definition: DataFrame.dropna(self, axis=0, how='any', thresh=None, subset=None)
Docstring:
Return object with labels on given axis omitted where alternately any
or all of the data are missing

Parameters
----------
axis : {0, 1}
how : {'any', 'all'}
    any : if any NA values are present, drop that label
    all : if all values are NA, drop that label
thresh : int, default None
    int value : require that many non-NA values
subset : array-like
    Labels along other axis to consider, e.g. if you are dropping rows
    these would be a list of columns to include

Returns
-------
dropped : DataFrame

The specific command to run would be:

df=df.dropna(axis=1,how='all')

Kaine answered 2/6, 2012 at 4:52 Comment(4)

can you specify the 'dropna' value? for example could you drop rows that are all zeros? – Joelie 10/10, 2012 at 19:15

you could either define with the pandas io parsers that your NaN value in given input tabels is 0, OR, you could prepare your step like this: df[df==0] = np.nan ; df=df.dropna(axis=1,how='all') – Gaygaya 11/12, 2012 at 1:50

For inplace: df.dropna(axis=1,how='all',inplace=True) – Lissettelissi 22/11, 2018 at 0:33

I used df=df.dropna(axis=1,how='all') but it removed all my df columns. Other columns are not entirely empty. – Jonajonah 6/1, 2020 at 23:17

Another solution would be to create a boolean dataframe with True values at not-null positions and then take the columns having at least one True value. This removes columns with all NaN values.

df = df.loc[:,df.notna().any(axis=0)]

If you want to remove columns having at least one missing (NaN) value;

df = df.loc[:,df.notna().all(axis=0)]

This approach is particularly useful in removing columns containing empty strings, zeros or basically any given value. For example;

df = df.loc[:,(df!='').all(axis=0)]

removes columns having at least one empty string.

Extemporaneous answered 12/5, 2021 at 23:33 Comment(0)

Here is a simple function which you can use directly by passing dataframe and threshold

df
'''
     pets   location     owner     id
0     cat  San_Diego     Champ  123.0
1     dog        NaN       Ron    NaN
2     cat        NaN     Brick    NaN
3  monkey        NaN     Champ    NaN
4  monkey        NaN  Veronica    NaN
5     dog        NaN      John    NaN
'''

def rmissingvaluecol(dff,threshold):
    l = []
    l = list(dff.drop(dff.loc[:,list((100*(dff.isnull().sum()/len(dff.index))>=threshold))].columns, 1).columns.values)
    print("# Columns having more than %s percent missing values:"%threshold,(dff.shape[1] - len(l)))
    print("Columns:\n",list(set(list((dff.columns.values))) - set(l)))
    return l


rmissingvaluecol(df,1) #Here threshold is 1% which means we are going to drop columns having more than 1% of missing values

#output
'''
# Columns having more than 1 percent missing values: 2
Columns:
 ['id', 'location']
'''

Now create new dataframe excluding these columns

l = rmissingvaluecol(df,1)
df1 = df[l]

PS: You can change threshold as per your requirement

Bonus step

You can find the percentage of missing values for each column (optional)

def missing(dff):
    print (round((dff.isnull().sum() * 100/ len(dff)),2).sort_values(ascending=False))

missing(df)

#output
'''
id          83.33
location    83.33
owner        0.00
pets         0.00
dtype: float64
'''

Belsen answered 19/6, 2019 at 15:15 Comment(1)

This answer is inferior to df.dropna(..., thresh) implements this, we just need to calculate the right value. And you don't need to create any new dataframe, you just do df.dropna(..., inplace=True). – Gyrostat 9/9, 2019 at 23:59

-2

Function for removing all null columns from the data frame:

def Remove_Null_Columns(df):
    dff = pd.DataFrame()
    for cl in fbinst:
        if df[cl].isnull().sum() == len(df[cl]):
            pass
        else:
            dff[cl] = df[cl]
    return dff

This function will remove all Null columns from the df.

Hardigg answered 29/6, 2018 at 6:41 Comment(1)

Please, if you answer something, atleast use a correct guidestyle like pep8... Also, pandas offers the dropna() function, so this is not a good answer... – Groundmass 4/9, 2018 at 11:38

Here is a simple function which you can use directly by passing dataframe and threshold

Bonus step

Recommended topics

Hot tags