I have a dataFrame
in pandas and several of the columns have all null values. Is there a built in function which will let me remove those columns?
Yes, dropna
. See http://pandas.pydata.org/pandas-docs/stable/missing_data.html and the DataFrame.dropna
docstring:
Definition: DataFrame.dropna(self, axis=0, how='any', thresh=None, subset=None)
Docstring:
Return object with labels on given axis omitted where alternately any
or all of the data are missing
Parameters
----------
axis : {0, 1}
how : {'any', 'all'}
any : if any NA values are present, drop that label
all : if all values are NA, drop that label
thresh : int, default None
int value : require that many non-NA values
subset : array-like
Labels along other axis to consider, e.g. if you are dropping rows
these would be a list of columns to include
Returns
-------
dropped : DataFrame
The specific command to run would be:
df=df.dropna(axis=1,how='all')
df[df==0] = np.nan ; df=df.dropna(axis=1,how='all')
–
Gaygaya df.dropna(axis=1,how='all',inplace=True)
–
Lissettelissi df=df.dropna(axis=1,how='all')
but it removed all my df columns. Other columns are not entirely empty. –
Jonajonah Another solution would be to create a boolean dataframe with True values at not-null positions and then take the columns having at least one True value. This removes columns with all NaN values.
df = df.loc[:,df.notna().any(axis=0)]
If you want to remove columns having at least one missing (NaN) value;
df = df.loc[:,df.notna().all(axis=0)]
This approach is particularly useful in removing columns containing empty strings, zeros or basically any given value. For example;
df = df.loc[:,(df!='').all(axis=0)]
removes columns having at least one empty string.
Here is a simple function which you can use directly by passing dataframe and threshold
df
'''
pets location owner id
0 cat San_Diego Champ 123.0
1 dog NaN Ron NaN
2 cat NaN Brick NaN
3 monkey NaN Champ NaN
4 monkey NaN Veronica NaN
5 dog NaN John NaN
'''
def rmissingvaluecol(dff,threshold):
l = []
l = list(dff.drop(dff.loc[:,list((100*(dff.isnull().sum()/len(dff.index))>=threshold))].columns, 1).columns.values)
print("# Columns having more than %s percent missing values:"%threshold,(dff.shape[1] - len(l)))
print("Columns:\n",list(set(list((dff.columns.values))) - set(l)))
return l
rmissingvaluecol(df,1) #Here threshold is 1% which means we are going to drop columns having more than 1% of missing values
#output
'''
# Columns having more than 1 percent missing values: 2
Columns:
['id', 'location']
'''
Now create new dataframe excluding these columns
l = rmissingvaluecol(df,1)
df1 = df[l]
PS: You can change threshold as per your requirement
Bonus step
You can find the percentage of missing values for each column (optional)
def missing(dff):
print (round((dff.isnull().sum() * 100/ len(dff)),2).sort_values(ascending=False))
missing(df)
#output
'''
id 83.33
location 83.33
owner 0.00
pets 0.00
dtype: float64
'''
df.dropna(..., thresh)
implements this, we just need to calculate the right value. And you don't need to create any new dataframe, you just do df.dropna(..., inplace=True)
. –
Gyrostat Function for removing all null columns from the data frame:
def Remove_Null_Columns(df):
dff = pd.DataFrame()
for cl in fbinst:
if df[cl].isnull().sum() == len(df[cl]):
pass
else:
dff[cl] = df[cl]
return dff
This function will remove all Null columns from the df.
© 2022 - 2024 — McMap. All rights reserved.