Collapsing rows with NaN entries in pandas dataframe

Asked 19/10, 2018 at 21:0 Answered 19/10, 2018 at 22:13

I have a pandas DataFrame with rows of data::

# objectID        grade  OS     method
object_id_0001    AAA    Mac    organic
object_id_0001    AAA    Mac    NA
object_id_0001    AAA    NA     organic
object_id_0002    NA     NA     NA
object_id_0002    ABC    Win    NA

i.e. there are often multiple entries for the same objectID but sometimes/often the entries have NAs.

As such, I'm just looking for a way that would combine on ObjectID, and report the non-NA entries e.g. the above collapses down to::

object_id_0001    AAA    Mac    organic
object_id_0002    ABC    Win    NA

Hydromagnetics answered 19/10, 2018 at 21:0 Comment(1)

Why don't you use dropna() with the subset arg? pandas.pydata.org/pandas-docs/stable/generated/… – Darmstadt 19/10, 2018 at 21:4

Quick and Dirty

This works and has for a long time. However, some claim that this is a bug that may be fixed. As it is currently implemented, first returns the first non-null element if it exists per column.

df.groupby('objectID', as_index=False).first()

         objectID grade   OS   method
0  object_id_0001   AAA  Mac  organic
1  object_id_0002   ABC  Win      NaN

`pd.concat`

pd.concat([
    pd.DataFrame([d.lookup(d.notna().idxmax(), d.columns)], columns=d.columns)
    for _, d in df.groupby('objectID')
], ignore_index=True)

         objectID grade   OS   method
0  object_id_0001   AAA  Mac  organic
1  object_id_0002   ABC  Win      NaN

`stack`

df.set_index('objectID').stack().groupby(level=[0, 1]).head(1).unstack()

               grade   OS   method
objectID                          
object_id_0001   AAA  Mac  organic
object_id_0002   ABC  Win     None

If by chance those are strings ('NA')

df.mask(df.astype(str).eq('NA')).groupby('objectID', as_index=False).first()

Camellia answered 19/10, 2018 at 21:4 Comment(4)

Nice! quick and dirty indeed ;} (not the down voter, of course) – Coverdale 19/10, 2018 at 21:10

@Camellia okay, interesting. I've just checked the pd.concat method on my real-life datafile that I have, and this runs, but it doesn't chose numerical data over the NaN. I think this is when the row with the NaNs is before the row with the data you want. – Hydromagnetics 19/10, 2018 at 21:19

Are your 'NA' actually null values? Or are they strings 'NA'? – Camellia 19/10, 2018 at 21:21

Okay. Pretty sure the pd.concat and stack methods work here. Mega-thanks. – Hydromagnetics 19/10, 2018 at 21:25

This will work bfill+ drop_duplicates

df.groupby('objectID',as_index=False).bfill().drop_duplicates('objectID')
Out[939]: 
         objectID grade   OS   method
0  object_id_0001   AAA  Mac  organic
3  object_id_0002   ABC  Win      NaN

Serialize answered 19/10, 2018 at 22:13 Comment(1)

Nice answer (-: – Camellia 19/10, 2018 at 22:13

One alternative, more mechanical way

def aggregate(s):
    u = s[s.notnull()].unique()
    if not u.size: return np.nan
    return u

df.groupby('objectID').agg(aggregate)

                grade   OS      method
objectID            
object_id_0001  AAA     Mac     organic
object_id_0002  ABC     Win     NaN

Coverdale answered 19/10, 2018 at 21:12 Comment(0)

Quick and Dirty

`pd.concat`

`stack`

Recommended topics

Hot tags