I built a Pandas
dataframe (example below) indexed by gene name that has sample names for columns and integers as cell values. What I want to do is run an ANOVA (f_oneway()
, from scipy.stats
) for lists of row values as defined by lists of the columns corresponding to groups of samples. Those results would then be stored in a new Pandas
dataframe with group names as columns and the same genes for index.
An example of the dataframe (it's returned from another function in my ):
import pandas as pd
counts = {'sample1' : [0, 1, 5, 0, 10],
'sample2' : [2, 0, 10, 0, 0],
'sample3' : [0, 0, 0, 1, 0],
'sample4' : [10, 0, 1, 4, 0]}
data = pd.DataFrame(counts, columns = ['sample1', 'sample2', 'sample3', 'sample4'],
index = ['gene1', 'gene2', 'gene3', 'gene4', 'gene5'])
Groups are imported as arguments by main()
, so in this function I have:
def compare(out_prefix, pops, data):
import scipy.stats as stats
sig = pd.DataFrame(index=data.index)
#groups will look like:
#groups = [['sample1', 'sample2'],['sample3', 'sample4']]
for group in groups:
with open(group) as infile:
groups_s = []
for spl in infile:
group_s.append(spl.replace("\n",""))
mean_col = pop.split(".")[0]+"_mean"
std_col = pop.split(".")[0]+"_std"
stat_col = pop.split(".")[0]+"_stat"
p_col = pop.split(".")[0]+"_sig"
sig[mean_col] = data[group_s].mean(axis=1)
sig[std_col] = data[group_s].std(axis=1)
sig[[stat_col, p_col]] = data.apply(lambda row : stats.f_oneway(data.loc[group_s].values.tolist()))
This last line doesn't work and I can't see how it could be done from some googling in the last few days - could someone point me in the right direction?
Ideally, it would write the results of the ANOVA test (power, significance) per row for the samples in each group by group to columns stat_col
and p_col
in sig
. For gene1 it would feed stats.f_oneway a list of lists of the values for samples in each group e.g. [[0,2],[0, 10]]
.
Thanks in advance!