t test on Pandas dataframes and make a new matrix of resulting p values
Asked Answered
S

1

7

I have 3 dataframes containing 7 columns.

df_a
df_b
df_c
df_a.head()

  VSPD1_perc  VSPD2_perc  VSPD3_perc  VSPD4_perc  VSPD5_perc  VSPD6_perc  \
0          NaN         NaN         NaN         NaN         NaN         NaN   
3     0.189588    0.228052    0.268460    0.304063    0.009837           0   
5     0.134684    0.242556    0.449054    0.168816    0.004890           0   
9     0.174806    0.232150    0.381936    0.211108    0.000000           0   
11         NaN         NaN         NaN         NaN         NaN         NaN   

    VSPD7_perc  
0          NaN  
3            0  
5            0  
9            0  
11         NaN 

My goal is to produce a matrix or a dataframe with the resulting p values from a t-test, and test dataframes df_b and df_c against df_a, column for column. That is test column 1 in df_b and df_c against column 1 in df_a. I would like to use dataframe (df_a) as a standard to make a statistical t test against. I have found the statistical test in statsmodels (stat.ttest_ind(x1, x2)), but I need help on making a matrix out of the p values from the test. Does anyone know how to do this...

Skindeep answered 19/12, 2013 at 7:13 Comment(0)
P
7

Leaving aside proper NaN management, you can do it as simply as t, p = scipy.stats.ttest_ind(df_a.dropna(axis=0), df_b.dropna(axis=0)).

See demo:

>>> import pandas as pd
>>> import scipy.stats
>>> import numpy as np
>>> df_a = pd.read_clibpoard()
>>> df_b = df_a + np.random.randn(5, 7) 
>>> df_c = df_a + np.random.randn(5, 7) 
>>> _, p_b = scipy.stats.ttest_ind(df_a.dropna(axis=0), df_b.dropna(axis=0))
>>> _, p_c = scipy.stats.ttest_ind(df_a.dropna(axis=0), df_c.dropna(axis=0))
>>> pd.DataFrame([p_b, p_c], columns = df_a.columns, index = ['df_b', 'df_c'])
      VSPD1_perc  VSPD2_perc  VSPD3_perc  VSPD4_perc  VSPD5_perc  VSPD6_perc  \
df_b    0.425286    0.987956    0.644236    0.552244    0.432640    0.624528
df_c    0.947182    0.911384    0.189283    0.828780    0.697709    0.166956

      VSPD7_perc
df_b    0.546648
df_c    0.206950
Pogge answered 19/12, 2013 at 7:37 Comment(4)
Thank you, The building of the new frame works perfect, though I get different p values from when I run the t-test manualy on for example column 1 from df_a against df_b...hmmmSkindeep
@Skindeep the reason might be NaN's; for your head, where NaN fill all the row, results are ofcourse identique.Pogge
Any reason to use vstack and not just pd.DataFrame([p_b, p_c], ...) ?Gilliam
@AndyHayden None exactly, legacy. That DataFrame part simply was added a bit laterPogge

© 2022 - 2024 — McMap. All rights reserved.