I am stuck on how to apply the custom function to calculate the p-value for two groups obtained from pandas groupby.
vocabulary
test = 0 ==> test
test = 1 ==> control
problem setup
import numpy as np
import pandas as pd
import scipy.stats as ss
np.random.seed(100)
N = 15
df = pd.DataFrame({'country': np.random.choice(['A','B','C'],N),
'test': np.random.choice([0,1], N),
'conversion': np.random.choice([0,1], N),
'sex': np.random.choice(['M','F'], N)
})
ans = df.groupby(['country','test'])['conversion'].agg(['size','mean']).unstack('test')
ans.columns = ['test_size','control_size','test_mean','control_mean']
test_size control_size test_mean control_mean
country
A 3 3 0.666667 0.666667
B 1 1 1.000000 1.000000
C 4 3 0.750000 1.000000
Question
Now I want to add two more columns to get the p-value between test and control group. But in my groupby I can only operate on one series at a time and I am not sure how to use two series to get the p-value.
Done so far:
def get_ttest(x,y):
return stats.ttest_ind(x, y, equal_var=False).pvalue
pseudo code:
df.groupby(['country','test'])['conversion'].agg(
['size','mean', some_function_to_get_pvalue])
How to get the p-values columns?
Required Answer
I need the get the values for the column pvalue
test_size control_size test_mean control_mean pvalue
country
A 3 3 0.666667 0.666667 ?
B 1 1 1.000000 1.000000 ?
C 4 3 0.750000 1.000000 ?