Returning multiple values from pandas apply on a DataFrame
Asked Answered
I

2

59

I'm using a Pandas DataFrame to do a row-wise t-test as per this example:

import numpy as np
import pandas as pd
  
df = pd.DataFrame(np.log2(np.randn(1000, 4), columns=["a", "b", "c", "d"]).dropna()

Now, suppose I have "a" and "b" as one group, and "c" and "d" at the other, I'm performing the t-test row-wise. This is fairly trivial with pandas, using apply with axis=1. However, I can either return a DataFrame of the same shape if my function doesn't aggregate, or a Series if it aggregates.

Normally I would just output the p-value (so, aggregation) but I would like to generate an additional value based on other calculations (in other words, return two values). I can of course do two runs, aggregating the p-values first, then doing the other work, but I was wondering if there is a more efficient way to do so as the data is reasonably large.

As an example of the calculation, a hypothetical function would be:

from scipy.stats import ttest_ind

def t_test_and_mean(series, first, second):
    first_group = series[first]
    second_group = series[second]
    _, pvalue = ttest_ind(first_group, second_group)

    mean_ratio = second_group.mean() / first_group.mean()
    
    return (pvalue, mean_ratio)

Then invoked with

df.apply(t_test_and_mean, first=["a", "b"], second=["c", "d"], axis=1)

Of course in this case it returns a single Series with the two tuples as value.

Instead, my expected output would be a DataFrame with two columns, one for the first result, and one for the second. Is this possible or I have to do two runs for the two calculations, then merge them together?

Inductile answered 25/5, 2012 at 8:35 Comment(2)
Why are you using apply in the first place? Your result is a new DataFrame with a shape different from the input (both rows and columns), therefore it's a completely new obj. You could just have t_test_and_mean accept your input dataframe (and the columns to group by) and return a 1-row-2-columns dataframe, without using apply.Carrizales
@Carrizales Right, I ended up doing that in my code, eventually.Inductile
R
104

Returning a Series, rather than tuple, should produce a new multi-column DataFrame. For example,

return pandas.Series({'pvalue': pvalue, 'mean_ratio': mean_ratio})
Retiring answered 25/5, 2012 at 23:48 Comment(6)
I will retry on Monday, but if I recall correctly it tries to coerce to the original column structure (thus ending up with NAs).Inductile
@garrett - How can I make sure that the seried returned from a function will retain its "intended" order. My use case is- post returning this series from a function, I am saving it to a csv file using df.to_csv. Other than ofcourse being dumb, and naming them as A, B, C,D to retain its natural ordering in the csv file.Torpid
to specify column order, try constructing the series with lists rather than a dict, e.g.: pandas.Series([pvalue, mean_ratio], index=['pvalue', 'mean_ratio'])Retiring
This works, but I cannot understand why passing a Series successfully returns a DataFrame, but passing a DataFramne back does not...Nebulosity
This appears to only work if every column in the "row" being returned as a Series has the same dtype! A series can only hold 1 dtype in its column.Disproportion
This can be done only if we aggregate with df.groupby("col").apply(fct) and not by df.groupby("col")["col2"].agg(fct), if not we get a Must produce aggregated value error...Vino
C
1

"Better" solutions for apply(axis=1)

apply has result_type= parameter that can expand a result into a dataframe. For OP's case, that would look like the following (note that the original function doesn't need to be touched):

df[['pvalue', 'mean_ratio']] = df.apply(t_test_and_mean, first=["a", "b"], second=["c", "d"], result_type='expand', axis=1)

Casting each row into a pandas Series is painfully slow (for a frame with 10k rows, it takes 20 seconds). A faster solution is to convert the values returned from an apply call into a list and cast into a DataFrame once (or assign back to the dataframe). Or use a Python loop for an even faster solution (how that can be written is shown at the end of this post).

For the case in the OP, that would look like the following (again, the original function shouldn't be altered).

df[['pvalue', 'mean_ratio']] = df.apply(t_test_and_mean, first=["a", "b"], second=["c", "d"], axis=1).values.tolist()

# or create a new frame
new_df = pd.DataFrame(df.apply(t_test_and_mean, first=["a", "b"], second=["c", "d"], axis=1).values.tolist(), index=df.index, columns=['pvalue', 'mean_ratio'])

Expanding groupby.apply

The same can done for functions called via groupby.apply as well. Simply convert the result into a list and cast into a new dataframe. For example, if we call the function in the OP in a groupby call, the result could be fixed up as follows:

# sample data
df = pd.DataFrame(np.random.randn(1000, 4), columns=["a", "b", "c", "d"])
df['grouper'] = list(range(10))*100

# perform groupby
x = df.groupby('grouper').apply(t_test_and_mean, first="a", second="c")
# convert to the groupby result into a list and cast into a dataframe
# in order to not lose any information, assign index and axis name appropriately
agg_df = pd.DataFrame(x.tolist(), index=x.index, columns=['pvalue', 'mean_ratio']).rename_axis(x.index.name)


Other solutions for apply(axis=1)

Another solution (slower than converting to list) is to chain an .apply(pd.Series) call:

df[['pvalue', 'mean_ratio']] = df.apply(t_test_and_mean, first=["a", "b"], second=["c", "d"], axis=1).apply(pd.Series)

Since .apply(axis=1) is syntactic sugar for a Python loop, the biggest speed up would be to convert the frame into a list, re-write the function into one that works with Python lists and just use a list comprehension (this would speed up the process about 6 times). For the example in the OP, that would look like:

def t_test_and_mean_on_lists(first_group, second_group):
    _, pvalue = ttest_ind(first_group, second_group)
    mean_ratio = np.mean(second_group) / np.mean(first_group)
    return (pvalue, mean_ratio)

df[['pvalue', 'mean_ratio']] = [t_test_and_mean_on_lists(ab, cd) for ab, cd in zip(df[['a','b']].values.tolist(), df[['c','d']].values.tolist())]

Decorating this function with numba.njit would make it even faster but that's outside the scope of this question.

Checkerwork answered 3/3, 2023 at 23:48 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.