Broadcast groupby result weighted sum(a)/sum(b) as new column in original DataFrame
Asked Answered
W

6

5

I am trying to create a new column 'ab_weighted' in a Pandas dataframe based on two columns 'a','b' in that dataframe grouped by 'c'.

Specifically, I am trying to replicate the output from this R code:

library(data.table)

df = data.table(a = 1:6, 
            b = 7:12,
            c = c('q', 'q', 'q', 'q', 'w', 'w')
            )

df[, ab_weighted := sum(a)/sum(b), by = "c"]
df[, c('c', 'a', 'b', 'ab_weighted')]

Output:

enter image description here

So far, I tried the following in Python:

import pandas as pd

df = pd.DataFrame({'a':[1,2,3,4,5,6],
               'b':[7,8,9,10,11,12],
               'c':['q', 'q', 'q', 'q', 'w', 'w']
              })

df.groupby(['c'])['a', 'b'].apply(lambda x: sum(x['a'])/sum(x['b']))

Output:

enter image description here

When I change the apply in the code above to transform I get an error: TypeError: an integer is required

transform() works fine, if I use only a single column though:

import pandas as pd

df = pd.DataFrame({'a':[1,2,3,4,5,6],
               'b':[7,8,9,10,11,12],
               'c':['q', 'q', 'q', 'q', 'w', 'w']
              })

 df.groupby(['c'])['a', 'b'].transform(lambda x: sum(x))

But obviously, this is not the same answer:

enter image description here

Is there a way to get the result from my R data.table code in Pandas without having to generate intermediate columns (i.e. use pandas transform to directly generate the final column (ab_weighted = sum(a)/sum(b))?

Wadsworth answered 12/12, 2018 at 16:13 Comment(1)
One issue is the first brackets are unwanted but the second ones are missing: df.groupby(['c'])['a', 'b']... should read df.groupby('c')[['a', 'b']]...Tarragona
P
5

You're one step away.

v = df.groupby('c')[['a', 'b']].transform('sum')
df['ab_weighted'] = v.a / v.b

df
   a   b  c  ab_weighted
0  1   7  q     0.294118
1  2   8  q     0.294118
2  3   9  q     0.294118
3  4  10  q     0.294118
4  5  11  w     0.478261
5  6  12  w     0.478261
Planking answered 12/12, 2018 at 16:20 Comment(0)
P
7

Just fixing your code using map,R and pandas still have different , which mean not every R function you can find a replacement in pandas

df.c.map(df.groupby(['c'])['a', 'b'].apply(lambda x: sum(x['a'])/sum(x['b'])))
Out[67]: 
0    0.294118
1    0.294118
2    0.294118
3    0.294118
4    0.478261
5    0.478261
Name: c, dtype: float64
Pinfeather answered 12/12, 2018 at 16:21 Comment(0)
P
5

You're one step away.

v = df.groupby('c')[['a', 'b']].transform('sum')
df['ab_weighted'] = v.a / v.b

df
   a   b  c  ab_weighted
0  1   7  q     0.294118
1  2   8  q     0.294118
2  3   9  q     0.294118
3  4  10  q     0.294118
4  5  11  w     0.478261
5  6  12  w     0.478261
Planking answered 12/12, 2018 at 16:20 Comment(0)
C
2

This would also work. I'm not sure why but I get a bug if I let apply return a Series rather than a Dataframe.

df['ab_weighted'] = \
df.groupby('c', group_keys = False)['a', 'b'].apply(
    lambda x: pd.Series(x.a.sum()/x.b.sum(), 
                        index = x.index).to_frame()
).iloc[:,0]
print(df)

# output 
#    a   b  c  ab_weighted
# 0  1   7  q     0.294118
# 1  2   8  q     0.294118
# 2  3   9  q     0.294118
# 3  4  10  q     0.294118
# 4  5  11  w     0.478261
# 5  6  12  w     0.478261
Cavalryman answered 4/4, 2020 at 14:1 Comment(0)
U
0

This works well:

import numpy as np
import pandas as pd

df = pd.DataFrame({'a':[1,2,3,4,5,6],
               'b':[7,8,9,10,11,12],
               'c':['q', 'q', 'q', 'q', 'w', 'w']
              })

def groupby_transform(df: pd.DataFrame, group_by_column: str, lambda_to_apply) -> np.array:
    """
    Groupby and transform. Returns a column for the original dataframe.
    :param df: Dataframe.
    :param group_by_column: Column(s) to group by.
    :param lambda_to_apply: Lambda.
    :return: Column to append to original dataframe.
    """
    df = df.reset_index(drop=True)  # Dataframe index is now strictly in order of the rows in the original dataframe.
    values = df.groupby(group_by_column).apply(lambda_to_apply)
    values.sort_index(level=1, inplace=True)  # Sorts result into order of original rows in dataframe (as groupby will undo that order when it groups).
    result = np.array(values)  # Sort rows into same order as original dataframe.
    if result.shape[0] == 1:  # e.g. if shape is (1,1003), make it (1003,).
        result = result[0]
    return result  # Return column.


df["result"] = groupby_transform(df, "c", lambda x: x["a"].shift(1) + x["b"].shift(1))

Output:

   a   b  c  result
0  1   7  q     NaN
1  2   8  q     8.0
2  3   9  q    10.0
3  4  10  q    12.0
4  5  11  w     NaN
5  6  12  w    16.0

And the same as above as a Pandas extension:

@pd.api.extensions.register_dataframe_accessor("ex")
class GroupbyTransform:
    """
    Groupby and transform. Returns a column for the original dataframe.
    """
    def __init__(self, pandas_obj):
        self._validate(pandas_obj)
        self._obj = pandas_obj

    @staticmethod
    def _validate(obj):
        # TODO: Check that dataframe is sorted, throw if not.
        pass

    def groupby_transform(self, group_by_column: str, lambda_to_apply):
        """
        Groupby and transform. Returns a column for the original dataframe.
        :param df: Dataframe.
        :param group_by_column: Column(s) to group by.
        :param lambda_to_apply: Lambda.
        :return: Column to append to original dataframe.
        """
        df = self._obj.reset_index(drop=True)  # Dataframe index is now strictly in order of the rows in the original dataframe.
        values = df.groupby(group_by_column).apply(lambda_to_apply)
        values.sort_index(level=1, inplace=True)  # Sorts result into order of original rows in dataframe (as groupby will undo that order when it groups).
        result = np.array(values)
        if result.shape[0] == 1:  # e.g. if shape is (1,1003), make it (1003,).
            result = result[0]
        return result

And this gives the same output as before:

df["result"] = df.ex.groupby_transform("c", lambda x: x["a"].shift(1) + x["b"].shift(1))
Ukrainian answered 28/3, 2021 at 20:3 Comment(1)
This is serious overkill to the OP's question, and pretty confusing. The pandas maintainers know for a long time now that people want better functionality on groupby(); it just seems very hard to implement that enhance.Tarragona
T
0

Here's the solution; see answers below on the pandas quirks why you couldn't get an exact equivalent to R's data.table:

df['ab_weighted'] = df.groupby('c').transform(pd.Series.sum, axis=0).apply(lambda x: x.a/x.b, axis=1)

   a   b  c  ab_weighted
0  1   7  q    0.294118
1  2   8  q    0.294118
2  3   9  q    0.294118
3  4  10  q    0.294118
4  5  11  w    0.478261
5  6  12  w    0.478261

Answers:

  • unlike R's data.table expressions, pd.transform() only works on an individual column at-a-time. So it could individually sum the 'a' column, then the 'b' column, but it couldn't see both the sum of 'a' and sum of 'b', which involves reading across a row.
  • why did we not do df.groupby('c')[['a','b']].sum() like you were trying? Because then .sum() collapses(/aggregates) the entire group into one row, and discards the indices; not what we want if we want to assign back to the source dataframe. Hence we do the .transform(pd.Series.sum, axis=0).
  • then we do .apply(..., axis=1) or .agg() to compute the ratio a/b
  • finally we can assign back to new column df['ab_weighted'] since we kept our original indices
  • we could instead have used .assign(): df.groupby('c').transform(pd.Series.sum, axis=0).assign(ab_weighted = lambda x: x.a/x.b), but that's annoying since 'c' got dropped (pandas currently has an ongoing issue with groupby(..., as_index=False) not working).
  • another pandas trick with complex expressions used to be to df.merge(df_abw, on='c') the intermediate result, say df_abw = df.groupby('c').apply(pd.Series.sum, axis=0).apply(lambda x: x.a/x.b, axis=1).rename('ab_weighted')
  • and here's a short one-liner solution that doesn't preserve indices:

_

df.groupby('c')[['a','b']].sum().assign(ab_weighted = lambda x: x.a/x.b)

    a   b  ab_weighted
c                     
q  10  34     0.294118
w  11  23     0.478261
Tarragona answered 18/11, 2023 at 2:14 Comment(0)
U
-1

Update 2021-03-28: I do not recommend this answer; I would recommend my other as it is much cleaner and more efficient.

Try the answer from @BENY. If it doesn't work, it is probably due to different indexes.

The solution below is ugly, and more complicated, but it should give enough clues to get this working with any dataframe, not just the toy ones. This is one area of pandas where the API is undeniably awkward and error prone, and sometimes there is simply no clean way of getting any valid results without a lot of jumping through hoops.

The trick is to ensure that the common indexes are available and have the same name.

df = pd.DataFrame({'a':[1,2,3,4,5,6],
               'b':[7,8,9,10,11,12],
               'c':['q', 'q', 'q', 'q', 'w', 'w']
              })

df.reset_index(drop=True, inplace=True)

values = df.groupby(['c']).apply(lambda x: sum(x['a'])/sum(x['b']))
# Convert result to dataframe.
df_to_join = values.to_frame()

# Ensure indexes have common names.
df_to_join.index.set_names(["index"], inplace=True)
df.set_index("c", inplace=True)
df.index.set_names(["index"], inplace=True)

# Set column name of result we want.
df_to_join.rename(columns={0: "ab_weighted"}, inplace=True, errors='raise')

# Join result of groupby to original dataframe.
df_result = df.merge(df_to_join, on=["index"])
print(df_result)

# output 
       a   b  ab_weighted
index                    
q      1   7     0.294118
q      2   8     0.294118
q      3   9     0.294118
q      4  10     0.294118
w      5  11     0.478261
w      6  12     0.478261

And to convert index back into a column c:

df_result.reset_index(inplace=True)
df_result.rename(columns={"index": "c"}, inplace=True)
Ukrainian answered 28/3, 2021 at 13:3 Comment(1)
This is a fairly confusing answer that gives a misleadingly bad impression of pandas; people have posted working two-line solutions that work on pandas >=1.6.Tarragona

© 2022 - 2024 — McMap. All rights reserved.