Computing MAD(mean absolute deviation) GroupBy Pandas
Asked Answered
H

2

9

I have a dataframe:

Type Name Cost
  A   X    545
  B   Y    789
  C   Z    477
  D   X    640
  C   X    435
  B   Z    335
  A   X    850
  B   Y    152

I have all such combinations in my dataframe with Type ['A','B','C','D'] and Names ['X','Y','Z'] . I used the groupby method to get stats on a specific combination together like A-X , A-Y , A-Z .Here's some code:

df = pd.DataFrame({'Type':['A','B','C','D','C','B','A','B'] ,'Name':['X','Y','Z','X','X','Z','X','Y'], 'Cost':[545,789,477,640,435,335,850,152]})
df.groupby(['Name','Type']).agg([mean,std])  
#need to use mad instead of std  

I need to eliminate the observations that are more than 3 MADs away ; something like:

test = df[np.abs(df.Cost-df.Cost.mean())<=(3*df.Cost.mad())]

I am confused with this as df.Cost.mad() returns the MAD for the Cost on the entire data rather than a specific Type-Name category. How could I combine both?

Holding answered 24/4, 2015 at 11:44 Comment(0)
H
4

You can use groupby and transform to create new data series that can be used to filter out your data.

groups = df.groupby(['Name','Type'])
mad = groups['Cost'].transform(lambda x: x.mad())
dif = groups['Cost'].transform(lambda x: np.abs(x - x.mean()))
df2 = df[dif <= 3*mad]

However, in this case, no row is filtered out since the difference is equal to the mean absolute deviation (the groups have only two rows at most).

Harelip answered 24/4, 2015 at 12:11 Comment(1)
thanks , yours and unutbu's answer yield the same result , though yours is more compact.. There is a typo on the last line, should end with a square bracket.. Thanks!!Holding
E
0

You can get your aggregate function on the grouped object:

df["mad"] = df.groupby(['Name','Type'])["Cost"].transform("mad")
df = df.loc[df.mad<3]
Eliseelisee answered 1/1, 2019 at 10:26 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.