I had to process numerous large datasets to get NaNs information (counts and portions per column) and timing was an issue. So I timed various methods for getting summary counts of NaNs per column in a separate dataframe with column names, NaN counts and NaN portions as columns:
# create random dataframe
dfa = pd.DataFrame(np.random.randn(100000,300))
# add 30% random NaNs
dfa = dfa.mask(np.random.random(dfa.shape) < 0.3)
With pandas methods only:
%%timeit
nans_dfa = dfa.isna().sum().rename_axis('Columns').reset_index(name='Counts')
nans_dfa["NaNportions"] = nans_dfa["Counts"] / dfa.shape[0]
# Output:
# 10 loops, best of 5: 57.8 ms per loop
Using list comprehension, based on the fine answer from @Mithril:
%%timeit
nan_dfa_loop2 = pd.DataFrame([(col, dfa[dfa[col].isna()].shape[0], dfa[dfa[col].isna()].shape[0]/dfa.shape[0]) for col in dfa.columns], columns=('Columns', 'Counts', 'NaNportions'))
# Output:
# 1 loop, best of 5: 13.9 s per loop
Using list comprehension with a second for loop to store the result of method calls to reduce calls to these methods:
%%timeit
nan_dfa_loop1 = pd.DataFrame([(col, n, n/dfa.shape[0]) for col in dfa.columns for n in (dfa[col].isna().sum(),) if n], columns=('Columns', 'Counts', 'NaNportions'))
# Output:
# 1 loop, best of 5: 373 ms per loop
All the above will produce the same dataframe:
Columns Counts NaNportions
0 0 29902 0.29902
1 1 30101 0.30101
2 2 30008 0.30008
3 3 30194 0.30194
4 4 29856 0.29856
... ... ... ...
295 295 29823 0.29823
296 296 29818 0.29818
297 297 29979 0.29979
298 298 30050 0.30050
299 299 30192 0.30192
('Columns' is redundant with this test dataframe. It is just used as placeholder where in a real life dataset it would probably represent the names of the attributes in the initial dataframe.)