Python Pandas: How can I group by and assign an id to all the items in a group?
Asked Answered
H

4

6

I have df:

domain           orgid
csyunshu.com    108299
dshu.com        108299
bbbdshu.com     108299
cwakwakmrg.com  121303
ckonkatsunet.com    121303

I would like to add a new column with replaces domain column with numeric ids per orgid:

domain           orgid   domainid
csyunshu.com    108299      1
dshu.com        108299      2
bbbdshu.com     108299      3
cwakwakmrg.com  121303      1
ckonkatsunet.com 121303     2

I have already tried this line but it does not give the result I want:

df.groupby('orgid').count['domain'].reset_index()

Can anybody help?

Hillhouse answered 17/3, 2016 at 14:19 Comment(1)
"future" duplicate here: #41595203Melisa
W
10

You can call rank on the groupby object and pass param method='first':

In [61]:
df['domainId'] = df.groupby('orgid')['orgid'].rank(method='first')
df

Out[61]:
             domain   orgid  domainId
0      csyunshu.com  108299         1
1          dshu.com  108299         2
2       bbbdshu.com  108299         3
3    cwakwakmrg.com  121303         1
4  ckonkatsunet.com  121303         2

If you want to overwrite the column you can do:

df['domain'] = df.groupby('orgid')['orgid'].rank(method='first')
Windswept answered 17/3, 2016 at 14:21 Comment(2)
Do you really need the lambda here? Isn't rank a method of SeriesGroupBy objects?Skelp
@Skelp good point, it's unnecessary as I just tested so have removed itWindswept
C
1

You can use LabelEncoder from sklearn.preprocessing like :

df["domain"] = LabelEncoder().fit_transform(df.domain)
Corroborate answered 31/10, 2016 at 13:28 Comment(0)
M
0

It's very intuitive with dplyr

df %>% group_by(orgid) %>% mutate(domainid=row_number())

I am porting this to python, with datar:

from datar.all import *

df = tibble(
    domain=['csyunshu.com', 'dshu.com', 'bbbdshu.com', 'cwakwakmrg.com', 'ckonkatsunet.com'],
    orgid=[108299,108299,108299,121303,121303]
)

df >> group_by(f.orgid) >> mutate(domainid=row_number())
# <pandas.core.groupby.generic.DataFrameGroupBy object at 0x7ff728cba490>

df >> group_by(f.orgid) >> mutate(domainid=row_number()) >> showme()
[2021-03-13 00:55:12][datar][   INFO] # [DataFrameGroupBy] Groups: ['orgid'] (2)
#              domain   orgid  domainid
# 0      csyunshu.com  108299         0
# 1          dshu.com  108299         1
# 2       bbbdshu.com  108299         2
# 3    cwakwakmrg.com  121303         0
# 4  ckonkatsunet.com  121303         1

df >> group_by(f.orgid) >> mutate(domainid=row_number()+1) >> showme()
[2021-03-13 00:55:26][datar][   INFO] # [DataFrameGroupBy] Groups: ['orgid'] (2)
#              domain   orgid  domainid
# 0      csyunshu.com  108299         1
# 1          dshu.com  108299         2
# 2       bbbdshu.com  108299         3
# 3    cwakwakmrg.com  121303         1
# 4  ckonkatsunet.com  121303         2
Melosa answered 13/3, 2021 at 7:59 Comment(0)
L
0

A similar approach to EdChum's, except a better function than rank() would be cumcount() for multiple reasons:

  1. You don't have to pull out the grouping column.
  2. You don't have to specify additional parameters.
  3. cumcount() returns integers rather than floats, which is probably what you want for an id.
  4. It starts from 0, which is useful. Of course, if you wanted it to start from one you just need to add a + 1 at the end.

Here's the code:

In [1]:
df['domainId'] = df.groupby('orgid').cumcount()
df

Out[1]:
             domain   orgid  domainId
0      csyunshu.com  108299         0
1          dshu.com  108299         1
2       bbbdshu.com  108299         2
3    cwakwakmrg.com  121303         0
4  ckonkatsunet.com  121303         1
Longwinded answered 29/3 at 22:27 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.