Returning multiple values from pandas apply on a DataFrame

Asked 25/5, 2012 at 8:35 Answered 3/3, 2023 at 23:48

python pandas dataframe function pandas-apply

I'm using a Pandas DataFrame to do a row-wise t-test as per this example:

import numpy as np
import pandas as pd
  
df = pd.DataFrame(np.log2(np.randn(1000, 4), columns=["a", "b", "c", "d"]).dropna()

Now, suppose I have "a" and "b" as one group, and "c" and "d" at the other, I'm performing the t-test row-wise. This is fairly trivial with pandas, using apply with axis=1. However, I can either return a DataFrame of the same shape if my function doesn't aggregate, or a Series if it aggregates.

Normally I would just output the p-value (so, aggregation) but I would like to generate an additional value based on other calculations (in other words, return two values). I can of course do two runs, aggregating the p-values first, then doing the other work, but I was wondering if there is a more efficient way to do so as the data is reasonably large.

As an example of the calculation, a hypothetical function would be:

from scipy.stats import ttest_ind

def t_test_and_mean(series, first, second):
    first_group = series[first]
    second_group = series[second]
    _, pvalue = ttest_ind(first_group, second_group)

    mean_ratio = second_group.mean() / first_group.mean()
    
    return (pvalue, mean_ratio)

Then invoked with

df.apply(t_test_and_mean, first=["a", "b"], second=["c", "d"], axis=1)

Of course in this case it returns a single Series with the two tuples as value.

Instead, my expected output would be a DataFrame with two columns, one for the first result, and one for the second. Is this possible or I have to do two runs for the two calculations, then merge them together?

Inductile answered 25/5, 2012 at 8:35 Comment(2)

Why are you using apply in the first place? Your result is a new DataFrame with a shape different from the input (both rows and columns), therefore it's a completely new obj. You could just have t_test_and_mean accept your input dataframe (and the columns to group by) and return a 1-row-2-columns dataframe, without using apply. – Carrizales 28/5, 2012 at 10:48

@Carrizales Right, I ended up doing that in my code, eventually. – Inductile 28/5, 2012 at 14:21

104

Returning a Series, rather than tuple, should produce a new multi-column DataFrame. For example,

return pandas.Series({'pvalue': pvalue, 'mean_ratio': mean_ratio})

Retiring answered 25/5, 2012 at 23:48 Comment(6)

I will retry on Monday, but if I recall correctly it tries to coerce to the original column structure (thus ending up with NAs). – Inductile 26/5, 2012 at 8:9

@garrett - How can I make sure that the seried returned from a function will retain its "intended" order. My use case is- post returning this series from a function, I am saving it to a csv file using df.to_csv. Other than ofcourse being dumb, and naming them as A, B, C,D to retain its natural ordering in the csv file. – Torpid 21/5, 2014 at 1:37

to specify column order, try constructing the series with lists rather than a dict, e.g.: pandas.Series([pvalue, mean_ratio], index=['pvalue', 'mean_ratio']) – Retiring 21/5, 2014 at 2:34

This works, but I cannot understand why passing a Series successfully returns a DataFrame, but passing a DataFramne back does not... – Nebulosity 26/5, 2015 at 19:31

This appears to only work if every column in the "row" being returned as a Series has the same dtype! A series can only hold 1 dtype in its column. – Disproportion 13/4, 2016 at 21:14

This can be done only if we aggregate with df.groupby("col").apply(fct) and not by df.groupby("col")["col2"].agg(fct), if not we get a Must produce aggregated value error... – Vino 30/6, 2021 at 14:5

"Better" solutions for `apply(axis=1)`

apply has result_type= parameter that can expand a result into a dataframe. For OP's case, that would look like the following (note that the original function doesn't need to be touched):

df[['pvalue', 'mean_ratio']] = df.apply(t_test_and_mean, first=["a", "b"], second=["c", "d"], result_type='expand', axis=1)

Casting each row into a pandas Series is painfully slow (for a frame with 10k rows, it takes 20 seconds). A faster solution is to convert the values returned from an apply call into a list and cast into a DataFrame once (or assign back to the dataframe). Or use a Python loop for an even faster solution (how that can be written is shown at the end of this post).

For the case in the OP, that would look like the following (again, the original function shouldn't be altered).

df[['pvalue', 'mean_ratio']] = df.apply(t_test_and_mean, first=["a", "b"], second=["c", "d"], axis=1).values.tolist()

# or create a new frame
new_df = pd.DataFrame(df.apply(t_test_and_mean, first=["a", "b"], second=["c", "d"], axis=1).values.tolist(), index=df.index, columns=['pvalue', 'mean_ratio'])

Expanding `groupby.apply`

The same can done for functions called via groupby.apply as well. Simply convert the result into a list and cast into a new dataframe. For example, if we call the function in the OP in a groupby call, the result could be fixed up as follows:

# sample data
df = pd.DataFrame(np.random.randn(1000, 4), columns=["a", "b", "c", "d"])
df['grouper'] = list(range(10))*100

# perform groupby
x = df.groupby('grouper').apply(t_test_and_mean, first="a", second="c")
# convert to the groupby result into a list and cast into a dataframe
# in order to not lose any information, assign index and axis name appropriately
agg_df = pd.DataFrame(x.tolist(), index=x.index, columns=['pvalue', 'mean_ratio']).rename_axis(x.index.name)

"Better" solutions for `apply(axis=1)`

Expanding `groupby.apply`

Other solutions for `apply(axis=1)`

Recommended topics

Hot tags

"Better" solutions for apply(axis=1)

Expanding groupby.apply

Other solutions for apply(axis=1)

Recommended topics

Hot tags

"Better" solutions for `apply(axis=1)`

Expanding `groupby.apply`

Other solutions for `apply(axis=1)`