How to find the no. of nulls in every column in a polars dataframe?

Asked 10/5, 2023 at 14:26 Answered 10/9, 2023 at 10:50

Solved python pandas dataframe null python-polars

In pandas, one can do:

import pandas as pd

d = {"foo":[1,2,3, None], "bar":[4,None, None, 6]}
df_pandas = pd.DataFrame.from_dict(d)
dict(df_pandas.isnull().sum())

[out]:

{'foo': 1, 'bar': 2}

In polars it's possible to do the same by looping through the columns:

import polars as pl

d = {"foo":[1,2,3, None], "bar":[4,None, None, 6]}
df_polars = pl.from_dict(d)

{col:df_polars[col].is_null().sum() for col in df_polars.columns}

Looping through the columns in polars is particularly painful when using LazyFrame, then the .collect() has to be done in chunks to do the aggregation.

Is there a way to find no. of nulls in every column in a polars dataframe without looping through each columns?

Crinite answered 10/5, 2023 at 14:26 Comment(2)

Maybe df_polars.collect().null_count()? How does that work with LazyFrame? – Crinite 10/5, 2023 at 14:36

Anyway to speed it up esp. when df_polars.collect() is not the best thing to do for large dataset. – Crinite 10/5, 2023 at 14:40

Assuming you're not married to the output format the idiomatic way to do it is...

df.select(pl.all().is_null().sum())

However if you really like the dict output you can easily get it...

df.select(pl.all().is_null().sum()).to_dicts()[0]

The way this works is that inside the select we start with pl.all() which means all of the columns and then, much like in the pandas version, we apply is_null which would return True/False. From that we chain sum which turns the Trues into 1s and gives you the number of nulls in each column.

There's also the dedicated null_count() so you don't have to chain is_null().sum() thanks to @jqurious for that tip.

Faceless answered 10/5, 2023 at 15:0 Comment(2)

Cool! And the collect comes after the select. Awesome! – Crinite 10/5, 2023 at 15:7

There is also a dedicated .null_count – Furfur 10/5, 2023 at 15:25

If you want row wise counts use this instead: df.hstack(df.transpose().select(pl.all().is_null().sum()).transpose().rename({"column_0": "null_count"}))

Shend answered 10/9, 2023 at 10:50 Comment(1)

.sum_horizontal() would be the idiomatic way to do it e.g. df.with_columns(null_count = pl.sum_horizontal(pl.all().is_null())) - transpose is a costly operation and is best avoided if possible. – Furfur 10/9, 2023 at 11:48

Recommended topics

Hot tags