Collapsing rows with NaN entries in pandas dataframe
Asked Answered
H

3

5

I have a pandas DataFrame with rows of data::

# objectID        grade  OS     method
object_id_0001    AAA    Mac    organic
object_id_0001    AAA    Mac    NA
object_id_0001    AAA    NA     organic
object_id_0002    NA     NA     NA
object_id_0002    ABC    Win    NA

i.e. there are often multiple entries for the same objectID but sometimes/often the entries have NAs.

As such, I'm just looking for a way that would combine on ObjectID, and report the non-NA entries e.g. the above collapses down to::

object_id_0001    AAA    Mac    organic
object_id_0002    ABC    Win    NA
Hydromagnetics answered 19/10, 2018 at 21:0 Comment(1)
Why don't you use dropna() with the subset arg? pandas.pydata.org/pandas-docs/stable/generated/…Darmstadt
C
8

Quick and Dirty

This works and has for a long time. However, some claim that this is a bug that may be fixed. As it is currently implemented, first returns the first non-null element if it exists per column.

df.groupby('objectID', as_index=False).first()

         objectID grade   OS   method
0  object_id_0001   AAA  Mac  organic
1  object_id_0002   ABC  Win      NaN

pd.concat

pd.concat([
    pd.DataFrame([d.lookup(d.notna().idxmax(), d.columns)], columns=d.columns)
    for _, d in df.groupby('objectID')
], ignore_index=True)

         objectID grade   OS   method
0  object_id_0001   AAA  Mac  organic
1  object_id_0002   ABC  Win      NaN

stack

df.set_index('objectID').stack().groupby(level=[0, 1]).head(1).unstack()

               grade   OS   method
objectID                          
object_id_0001   AAA  Mac  organic
object_id_0002   ABC  Win     None

If by chance those are strings ('NA')

df.mask(df.astype(str).eq('NA')).groupby('objectID', as_index=False).first()
Camellia answered 19/10, 2018 at 21:4 Comment(4)
Nice! quick and dirty indeed ;} (not the down voter, of course)Coverdale
@Camellia okay, interesting. I've just checked the pd.concat method on my real-life datafile that I have, and this runs, but it doesn't chose numerical data over the NaN. I think this is when the row with the NaNs is before the row with the data you want.Hydromagnetics
Are your 'NA' actually null values? Or are they strings 'NA'?Camellia
Okay. Pretty sure the pd.concat and stack methods work here. Mega-thanks.Hydromagnetics
S
3

This will work bfill+ drop_duplicates

df.groupby('objectID',as_index=False).bfill().drop_duplicates('objectID')
Out[939]: 
         objectID grade   OS   method
0  object_id_0001   AAA  Mac  organic
3  object_id_0002   ABC  Win      NaN
Serialize answered 19/10, 2018 at 22:13 Comment(1)
Nice answer (-:Camellia
C
2

One alternative, more mechanical way

def aggregate(s):
    u = s[s.notnull()].unique()
    if not u.size: return np.nan
    return u

df.groupby('objectID').agg(aggregate)

                grade   OS      method
objectID            
object_id_0001  AAA     Mac     organic
object_id_0002  ABC     Win     NaN
Coverdale answered 19/10, 2018 at 21:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.