Efficient way to get group names in pandas

Asked 14/6, 2018 at 14:34 Answered 14/6, 2018 at 14:37

Solved python python-3.x pandas csv processing-efficiency

I have a .csv file with around 300,000 rows. I have set it to group by a particular column, with each group having around 140 members (2138 total groups).

I am trying to generate a numpy array of the group names. I have used a for loop to generate the names as of now but it takes a while for everything to process.

import numpy as np
import pandas as pd

df = pd.read_csv('file.csv')
grouped = df.groupby('col1')
group_names = []
for name,group in grouped: group_names.append(name)
group_names = np.array(group_names, dtype=object)

I am wondering if there is a more efficient way to do this, whether by using a pandas module or directly converting the names into a numpy array.

Obloquy answered 14/6, 2018 at 14:34 Comment(0)

The fastest way would most likely be just to use unique on the column you are grouping by, which gives you all unique values. The output will be an array of your group names.

group_names = df.col1.unique()

Sevigny answered 14/6, 2018 at 14:35 Comment(0)

groupby objects have a .groups attribute:

groups = df.groupby('col1').groups

this returns a dict of the group name->labels

example:

In[257]:
df = pd.DataFrame({'a':list('aabcccc'), 'b':np.random.randn(7)})
groups = df.groupby('a').groups
groups

Out[257]: 
{'a': Int64Index([0, 1], dtype='int64'),
 'b': Int64Index([2], dtype='int64'),
 'c': Int64Index([3, 4, 5, 6], dtype='int64')}

groups.keys()
Out[258]: dict_keys(['a', 'b', 'c'])

Baalbeer answered 14/6, 2018 at 14:37 Comment(2)

is the dict guaranteed to be ordered the same way as iterating? – Salinas 26/8, 2021 at 13:3

When I have a df with few groups that have a lot of members, df.groupby(*args) takes 4ms for me, while df.groupby(*args).groups takes 240ms. Is this because the first expression doesn't actually separate out the groups yet? If not, and if you're only interested in the group names (the keys of the dict), might there be something that only returns the names and skips returning the indices of each group? @sacuL 's answer is much faster, but it only works if you want to group by a single column. – Williwaw 10/11, 2021 at 10:23

The fastest way would most likely be just to use unique on the column you are grouping by, which gives you all unique values. The output will be an array of your group names.

group_names = df.col1.unique()

Sevigny answered 14/6, 2018 at 14:35 Comment(0)

Recommended topics

Hot tags