Identifying consecutive NaNs with Pandas
Asked Answered
U

3

23

I am reading in a bunch of CSV files (measurement data for water levels over time) to do various analysis and visualizations on them.

Due to various reasons beyond my control, these time series often have missing data, so I do two things:

I count them in total with

Rlength = len(RainD)   # Counts everything, including NaN
Rcount = RainD.count() # Counts only valid numbers
NaN_Number = Rlength - Rcount

and discard the dataset if I have more missing data than a certain threshold:

Percent_Data = Rlength/100
Five_Percent = Percent_Data*5
if NaN_Number > Five_Percent:
    ...

If the number of NaN is sufficiently small, I would like to fill the gaps with

RainD.level = RainD.level.fillna(method='pad', limit=2)

And now for the issue: It's monthly data, so if I have more than two consecutive NaNs, I also want to discard the data, since that would mean that I "guess" a whole season, or even more.

The documentation for fillna doesn't really mention what happens when there is more consecutive NaNs than my specified limit=2, but when I look at RainD.describe() before and after ...fillna... and compare it with the base CSV, it's clear that it fills the first two NaNs, and then leaves the rest as it is, instead of erroring out.

So, long story short:

How do I identify a number of consecutive NaNs with Pandas, without some complicated and time consuming non-Pandas loop?

Unconquerable answered 12/3, 2015 at 10:54 Comment(0)
K
36

You can use multiple boolean conditions to test if the current value and previous value are NaN:

In [3]:

df = pd.DataFrame({'a':[1,3,np.NaN, np.NaN, 4, np.NaN, 6,7,8]})
df
Out[3]:
    a
0   1
1   3
2 NaN
3 NaN
4   4
5 NaN
6   6
7   7
8   8
In [6]:

df[(df.a.isnull()) & (df.a.shift().isnull())]
Out[6]:
    a
3 NaN

If you wanted to find where consecutive NaNs occur where you are looking for more than 2 you could do the following:

In [38]:

df = pd.DataFrame({'a':[1,2,np.NaN, np.NaN, np.NaN, 6,7,8,9,10,np.NaN,np.NaN,13,14]})
df
Out[38]:
     a
0    1
1    2
2  NaN
3  NaN
4  NaN
5    6
6    7
7    8
8    9
9   10
10 NaN
11 NaN
12  13
13  14

In [41]:

df.a.isnull().astype(int).groupby(df.a.notnull().astype(int).cumsum()).sum()
Out[41]:
a
1    0
2    3
3    0
4    0
5    0
6    0
7    2
8    0
9    0
Name: a, dtype: int32
Kwashiorkor answered 12/3, 2015 at 11:10 Comment(1)
How can I get the index of these?Sleety
J
13

If you wish to map this back to the original index, or have a consective count of NaNs use Ed's answer with cumsum instead of sum. This is particularly useful for visualising NaN groups in time series:

df = pd.DataFrame({'a':[
    1,2,np.NaN, np.NaN, np.NaN, 6,7,8,9,10,np.NaN,np.NaN,13,14
]})

df.a.isnull().astype(int).groupby(df.a.notnull().astype(int).cumsum()).cumsum()


0     0
1     0
2     1
3     2
4     3
5     0
6     0
7     0
8     0
9     0
10    1
11    2
12    0
13    0
Name: a, dtype: int64

for example,

pd.concat([
        df,
        (
            df.a.isnull().astype(int)
            .groupby(df.a.notnull().astype(int).cumsum())
            .cumsum().to_frame('consec_count')
        )
    ],
    axis=1
)

    a       consec_count
0   1.0     0
1   2.0     0
2   NaN     1
3   NaN     2
4   NaN     3
5   6.0     0
6   7.0     0
7   8.0     0
8   9.0     0
9   10.0    0
10  NaN     1
11  NaN     2
12  13.0    0
13  14.0    0
Jardena answered 30/10, 2019 at 15:16 Comment(2)
I am working on something similar. How can I fill forward any gap of length x or less? I have 56000 rows and 29 columns so not keen to do manually. The column with the most least number of Nans still has 10%!! ======= Sorry just saw this: #48800845, However it does not Fill Forward, merely removes themChipman
@AndyThompson - great question. It is probably worth its own question here on SO.Jardena
W
3

If you just want to find the lengths of the consecutive NaNs ...

# usual imports
import pandas as pd
import numpy as np

# fake data
data = pd.Series([np.nan,1,1,1,1,1,np.nan,np.nan,np.nan,1,1,np.nan,np.nan])

# code 
na_groups = data.notna().cumsum()[data.isna()]
lengths_consecutive_na = na_groups.groupby(na_groups).agg(len)
longest_na_gap = lengths_consecutive_na.max()
Wellheeled answered 15/6, 2021 at 0:3 Comment(1)
missing_groups ?Genie

© 2022 - 2024 — McMap. All rights reserved.