How to group dataframe rows into list in pandas groupby
Asked Answered
S

16

582

Given a dataframe, I want to groupby the first column and get second column as lists in rows, so that a dataframe like:

a b
A 1
A 2
B 5
B 5
B 4
C 6

becomes

A [1,2]
B [5,5,4]
C [6]

How do I do this?

Seppala answered 6/3, 2014 at 8:31 Comment(1)
imagine a scenario where I want to add another A records if the aggregate of A's element list exceeds 10. how to accomplish this ?Udo
B
776

You can do this using groupby to group on the column of interest and then apply list to every group:

In [1]: df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6]})
        df

Out[1]: 
   a  b
0  A  1
1  A  2
2  B  5
3  B  5
4  B  4
5  C  6

In [2]: df.groupby('a')['b'].apply(list)
Out[2]: 
a
A       [1, 2]
B    [5, 5, 4]
C          [6]
Name: b, dtype: object

In [3]: df1 = df.groupby('a')['b'].apply(list).reset_index(name='new')
        df1
Out[3]: 
   a        new
0  A     [1, 2]
1  B  [5, 5, 4]
2  C        [6]
Bushing answered 6/3, 2014 at 10:28 Comment(17)
This takes a lot of time if the dataset is huge, say 10million rows. Is there any faster way to do this? The number of uniques in 'a' is however around 500kSeppala
groupby is notoriously slow and memory hungry, what you could do is sort by column A, then find the idxmin and idxmax (probably store this in a dict) and use this to slice your dataframe would be faster I thinkBushing
When I tried this solution with my problem (having multiple columns to groupBy and to group), it didn't work - pandas sent 'Function does not reduce'. Then I used tuplefollowing the second answer here: #19531068 . See second answer in #27439523 for explanation.Chokedamp
This solution is good, but is there a way to store set of list, meaning can i remove the duplicates and then store it?Moshemoshell
you mean df.groupby('a')['b'].apply(lambda x:list(set(x)))Bushing
You don't need to use groupby. Just take a set on column 'a', and do a subset to the dataframe of 'A', 'B', etc. Then fetch column 'b' in the subset and put those values in a list.Gnotobiotics
But how the code is written if you had another column c which also had numbers which had to be put in a list?Luminal
@PoeteMaudit Sorry I don't understand what you're asking and asking questions in comments is bad form in SO. Are you asking how to concatenate multiple columns into a single list?Bushing
No worries, I was asking for this basically: https://mcmap.net/q/73145/-how-to-group-dataframe-rows-into-list-in-pandas-groupbyLuminal
in pandas 0.23.x, apply does not work. I needed to use 'agg' function.Reclinate
@EdwardAung this still works for me using pandas version '0.24.2', you'd have to post an example where this failsBushing
If you make it df.groupby('a')['b'].apply(list).apply(pd.Series) you get columns with on entry each instead of one column with lists, , which can be very useful.Grackle
Empirically, I found .apply(np.array) to be slightly faster on my 25K dataset.Jordanjordana
I understand how groupby and apply work in this example. You get SeriesGroupBy object. But can someone explain why calling reset_index on this magically creates a dataframe?Sweetmeat
is there a way to do it multiple columns at a time? df1 = df.groupby('a')['b','c'].apply(list).reset_index(name='new')Phenol
How can I select a subset of the elements of rows? for example, I have some nan values in cells and I don't want to .apply(list) add all elements include nan. How can I do this?Hogwash
Is there a way to do the opposite? Where you have the dataframe grouped by and you want to ungroup?Hellenist
B
122

A handy way to achieve this would be:

df.groupby('a').agg({'b':lambda x: list(x)})

Look into writing Custom Aggregations: https://www.kaggle.com/akshaysehgal/how-to-group-by-aggregate-using-py

Belamy answered 27/9, 2018 at 6:28 Comment(3)
lambda args: f(args) is equivalent to fSkaw
Actually, just agg(list) is enough. Also see here.Kick
!! I was just googling for some syntax and realised my own notebook was referenced for the solution lol. Thanks for linking this. Just to add, since 'list' is not a series function, you will have to either use it with apply df.groupby('a').apply(list) or use it with agg as part of a dict df.groupby('a').agg({'b':list}). You could also use it with lambda (which I recommend) since you can do so much more with it. Example: df.groupby('a').agg({'c':'first', 'b': lambda x: x.unique().tolist()}) which lets you apply a series function to the col c and a unique then a list function to col b.Friedrich
S
74

If performance is important go down to numpy level:

import numpy as np

df = pd.DataFrame({'a': np.random.randint(0, 60, 600), 'b': [1, 2, 5, 5, 4, 6]*100})

def f(df):
         keys, values = df.sort_values('a').values.T
         ukeys, index = np.unique(keys, True)
         arrays = np.split(values, index[1:])
         df2 = pd.DataFrame({'a':ukeys, 'b':[list(a) for a in arrays]})
         return df2

Tests:

In [301]: %timeit f(df)
1000 loops, best of 3: 1.64 ms per loop

In [302]: %timeit df.groupby('a')['b'].apply(list)
100 loops, best of 3: 5.26 ms per loop
Stiffen answered 2/3, 2017 at 8:42 Comment(2)
How could we use this if we are grouping by two or more keys e.g. with .groupby([df.index.month, df.index.day]) instead of just .groupby('a')?Aeolic
@Aeolic I have added an answer below which you might want to check out. It does also handle grouping with multiple columnsBaez
R
61

To solve this for several columns of a dataframe:

In [5]: df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6],'c'
   ...: :[3,3,3,4,4,4]})

In [6]: df
Out[6]: 
   a  b  c
0  A  1  3
1  A  2  3
2  B  5  3
3  B  5  4
4  B  4  4
5  C  6  4

In [7]: df.groupby('a').agg(lambda x: list(x))
Out[7]: 
           b          c
a                      
A     [1, 2]     [3, 3]
B  [5, 5, 4]  [3, 4, 4]
C        [6]        [4]

This answer was inspired from Anamika Modi's answer. Thank you!

Roborant answered 31/10, 2018 at 16:25 Comment(1)
What should I do if some of the columns is the list!Cuttie
K
43

Use any of the following groupby and agg recipes.

# Setup
df = pd.DataFrame({
  'a': ['A', 'A', 'B', 'B', 'B', 'C'],
  'b': [1, 2, 5, 5, 4, 6],
  'c': ['x', 'y', 'z', 'x', 'y', 'z']
})
df

   a  b  c
0  A  1  x
1  A  2  y
2  B  5  z
3  B  5  x
4  B  4  y
5  C  6  z

To aggregate multiple columns as lists, use any of the following:

df.groupby('a').agg(list)
df.groupby('a').agg(pd.Series.tolist)

           b          c
a                      
A     [1, 2]     [x, y]
B  [5, 5, 4]  [z, x, y]
C        [6]        [z]

To group-listify a single column only, convert the groupby to a SeriesGroupBy object, then call SeriesGroupBy.agg. Use,

df.groupby('a').agg({'b': list})  # 4.42 ms 
df.groupby('a')['b'].agg(list)    # 2.76 ms - faster

a
A       [1, 2]
B    [5, 5, 4]
C          [6]
Name: b, dtype: object
Kick answered 24/4, 2019 at 22:35 Comment(9)
are the methods above guaranteed to preserve order? meaning that elements from the same row (but different columns, b and c in your code above) will have the same index in the resulting lists?Improve
@Improve oh, good question. Yes and no. GroupBy sorts the output by the grouper key values. However the sort is generally stable so the relative ordering per group is preserved. To disable the sorting behavior entirely, use groupby(..., sort=False). Here, it'd make no difference since I'm grouping on column A which is already sorted.Kick
i'm sorry, i don't understand your answer. Can you explain in more detail. I think this deserves it's own question..Improve
This is a very good answer! Is there also a way to make the values of the list unique? something like .agg(pd.Series.tolist.unique) maybe?Rintoul
@FedericoGentile you can use a lambda. Here's one way: df.groupby('a')['b'].agg(lambda x: list(set(x)))Kick
@Kick Hi ColdSpeed! Is there a way to get aggregate the list values of each columns list into one column? Instead of separate b and c columns, instead just create a column that has all the values. Thank you.Ulmaceous
@Ulmaceous Not sure, perhaps you want df.groupby('a').agg(lambda x: x.to_numpy().ravel().tolist())Kick
@Federico Gentile df.groupby('a')['b'].agg("unique")Goering
This is the best answer.Gridiron
L
34

It is time to use agg instead of apply .

When

df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6], 'c': [1,2,5,5,4,6]})

If you want multiple columns stack into list , result in pd.DataFrame

df.groupby('a')[['b', 'c']].agg(list)
# or 
df.groupby('a').agg(list)

If you want single column in list, result in ps.Series

df.groupby('a')['b'].agg(list)
#or
df.groupby('a')['b'].apply(list)

Note, result in pd.DataFrame is about 10x slower than result in ps.Series when you only aggregate single column, use it in multicolumns case .

Libbielibbna answered 6/5, 2020 at 8:22 Comment(0)
U
28

As you were saying the groupby method of a pd.DataFrame object can do the job.

Example

 L = ['A','A','B','B','B','C']
 N = [1,2,5,5,4,6]

 import pandas as pd
 df = pd.DataFrame(zip(L,N),columns = list('LN'))


 groups = df.groupby(df.L)

 groups.groups
      {'A': [0, 1], 'B': [2, 3, 4], 'C': [5]}

which gives and index-wise description of the groups.

To get elements of single groups, you can do, for instance

 groups.get_group('A')

     L  N
  0  A  1
  1  A  2

  groups.get_group('B')

     L  N
  2  B  5
  3  B  5
  4  B  4
Unconditional answered 6/3, 2014 at 10:12 Comment(0)
I
19

Just a suplement. pandas.pivot_table is much more universal and seems more convenient:

"""data"""
df = pd.DataFrame( {'a':['A','A','B','B','B','C'],
                    'b':[1,2,5,5,4,6],
                    'c':[1,2,1,1,1,6]})
print(df)

   a  b  c
0  A  1  1
1  A  2  2
2  B  5  1
3  B  5  1
4  B  4  1
5  C  6  6
"""pivot_table"""
pt = pd.pivot_table(df,
                    values=['b', 'c'],
                    index='a',
                    aggfunc={'b': list,
                             'c': set})
print(pt)
           b       c
a                   
A     [1, 2]  {1, 2}
B  [5, 5, 4]     {1}
C        [6]     {6}
Intercede answered 29/3, 2021 at 11:55 Comment(0)
C
9

If looking for a unique list while grouping multiple columns this could probably help:

df.groupby('a').agg(lambda x: list(set(x))).reset_index()
Clown answered 4/7, 2019 at 17:7 Comment(0)
I
9

The easiest way I have found to achieve the same thing, at least for one column, which is similar to Anamika's answer, just with the tuple syntax for the aggregate function.

df.groupby('a').agg(b=('b','unique'), c=('c','unique'))
Iatry answered 22/5, 2020 at 12:34 Comment(0)
B
8

Building upon @B.M answer, here is a more general version and updated to work with newer library version: (numpy version 1.19.2, pandas version 1.2.1) And this solution can also deal with multi-indices:

However this is not heavily tested, use with caution.

If performance is important go down to numpy level:

import pandas as pd
import numpy as np

np.random.seed(0)
df = pd.DataFrame({'a': np.random.randint(0, 10, 90), 'b': [1,2,3]*30, 'c':list('abcefghij')*10, 'd': list('hij')*30})


def f_multi(df,col_names):
    if not isinstance(col_names,list):
        col_names = [col_names]
        
    values = df.sort_values(col_names).values.T

    col_idcs = [df.columns.get_loc(cn) for cn in col_names]
    other_col_names = [name for idx, name in enumerate(df.columns) if idx not in col_idcs]
    other_col_idcs = [df.columns.get_loc(cn) for cn in other_col_names]

    # split df into indexing colums(=keys) and data colums(=vals)
    keys = values[col_idcs,:]
    vals = values[other_col_idcs,:]
    
    # list of tuple of key pairs
    multikeys = list(zip(*keys))
    
    # remember unique key pairs and ther indices
    ukeys, index = np.unique(multikeys, return_index=True, axis=0)
    
    # split data columns according to those indices
    arrays = np.split(vals, index[1:], axis=1)

    # resulting list of subarrays has same number of subarrays as unique key pairs
    # each subarray has the following shape:
    #    rows = number of non-grouped data columns
    #    cols = number of data points grouped into that unique key pair
    
    # prepare multi index
    idx = pd.MultiIndex.from_arrays(ukeys.T, names=col_names) 

    list_agg_vals = dict()
    for tup in zip(*arrays, other_col_names):
        col_vals = tup[:-1] # first entries are the subarrays from above 
        col_name = tup[-1]  # last entry is data-column name
        
        list_agg_vals[col_name] = col_vals

    df2 = pd.DataFrame(data=list_agg_vals, index=idx)
    return df2

Tests:

In [227]: %timeit f_multi(df, ['a','d'])

2.54 ms ± 64.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [228]: %timeit df.groupby(['a','d']).agg(list)

4.56 ms ± 61.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Results:

for the random seed 0 one would get:

enter image description here

Baez answered 2/2, 2021 at 22:12 Comment(1)
Great answer. Please share example, if you need only one column, and not multipleDetraction
C
4

Let us using df.groupby with list and Series constructor

pd.Series({x : y.b.tolist() for x , y in df.groupby('a')})
Out[664]: 
A       [1, 2]
B    [5, 5, 4]
C          [6]
dtype: object
Cite answered 30/11, 2018 at 20:59 Comment(0)
R
1

Sorting consumes O(nlog(n)) time which is the most time consuming operation in the solutions suggested above

For a simple solution (containing single column) pd.Series.to_list would work and can be considered more efficient unless considering other frameworks

e.g.

import pandas as pd
from string import ascii_lowercase
import random

def generate_string(case=4):
    return ''.join([random.choice(ascii_lowercase) for _ in range(case)])

df = pd.DataFrame({'num_val':[random.randint(0,100) for _ in range(20000000)],'string_val':[generate_string() for _ in range(20000000)]})


%timeit df.groupby('string_val').agg({'num_val':pd.Series.to_list})

For 20 million records it takes about 17.2 seconds. compared to apply(list) which takes about 19.2 and lambda function which takes about 20.6s

Ruderal answered 2/9, 2021 at 9:49 Comment(0)
B
1

Just to add up to previous answers, In my case, I want the list and other functions like min and max. The way to do that is:

df = pd.DataFrame({
    'a':['A','A','B','B','B','C'], 
    'b':[1,2,5,5,4,6]
})

df=df.groupby('a').agg({
    'b':['min', 'max',lambda x: list(x)]
})

#then flattening and renaming if necessary
df.columns = df.columns.to_flat_index()
df.rename(columns={('b', 'min'): 'b_min', ('b', 'max'): 'b_max', ('b', '<lambda_0>'): 'b_list'},inplace=True)
Bibbs answered 13/9, 2022 at 14:28 Comment(0)
B
0

Here I have grouped elements with "|" as a separator

    import pandas as pd

    df = pd.read_csv('input.csv')

    df
    Out[1]:
      Area  Keywords
    0  A  1
    1  A  2
    2  B  5
    3  B  5
    4  B  4
    5  C  6

    df.dropna(inplace =  True)
    df['Area']=df['Area'].apply(lambda x:x.lower().strip())
    print df.columns
    df_op = df.groupby('Area').agg({"Keywords":lambda x : "|".join(x)})

    df_op.to_csv('output.csv')
    Out[2]:
    df_op
    Area  Keywords

    A       [1| 2]
    B    [5| 5| 4]
    C          [6]
Bronwynbronx answered 10/6, 2019 at 11:33 Comment(0)
A
0

Answer based on @EdChum's comment on his answer. Comment is this -

groupby is notoriously slow and memory hungry, what you could do is sort by column A, then find the idxmin and idxmax (probably store this in a dict) and use this to slice your dataframe would be faster I think 

Let's first create a dataframe with 500k categories in first column and total df shape 20 million as mentioned in question.

df = pd.DataFrame(columns=['a', 'b'])
df['a'] = (np.random.randint(low=0, high=500000, size=(20000000,))).astype(str)
df['b'] = list(range(20000000))
print(df.shape)
df.head()
# Sort data by first column 
df.sort_values(by=['a'], ascending=True, inplace=True)
df.reset_index(drop=True, inplace=True)

# Create a temp column
df['temp_idx'] = list(range(df.shape[0]))

# Take all values of b in a separate list
all_values_b = list(df.b.values)
print(len(all_values_b))
# For each category in column a, find min and max indexes
gp_df = df.groupby(['a']).agg({'temp_idx': [np.min, np.max]})
gp_df.reset_index(inplace=True)
gp_df.columns = ['a', 'temp_idx_min', 'temp_idx_max']

# Now create final list_b column, using min and max indexes for each category of a and filtering list of b. 
gp_df['list_b'] = gp_df[['temp_idx_min', 'temp_idx_max']].apply(lambda x: all_values_b[x[0]:x[1]+1], axis=1)

print(gp_df.shape)
gp_df.head()

This above code takes 2 minutes for 20 million rows and 500k categories in first column.

Archangel answered 23/8, 2020 at 8:56 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.