How to number/label data-table by group-number from group_by?

Asked 12/4, 2014 at 4:38 Answered 25/2, 2021 at 19:47

I have a tbl_df where I want to group_by(u, v) for each distinct integer combination observed with (u, v).

EDIT: this was subsequently resolved by adding the (now-deprecated) group_indices() back in dplyr 0.4.0

a) I then want to assign each distinct group some arbitrary distinct number label=1,2,3... e.g. the combination (u,v)==(2,3) could get label 1, (1,3) could get 2, and so on. How to do this with one mutate(), without a three-step summarize-and-self-join?

dplyr has a neat function n(), but that gives the number of elements within its group, not the overall number of the group. In data.table this would simply be called .GRP.

b) Actually what I really want to assign a string/character label ('A','B',...). But numbering groups by integers is good-enough, because I can then use integer_to_label(i) as below. Unless there's a clever way to merge these two? But don't sweat this part.

set.seed(1234)

# Helper fn for mapping integer 1..26 to character label
integer_to_label <- function(i) { substr("ABCDEFGHIJKLMNOPQRSTUVWXYZ",i,i) }

df <- tibble::as_tibble(data.frame(u=sample.int(3,10,replace=T), v=sample.int(4,10,replace=T)))

# Want to label/number each distinct group of unique (u,v) combinations
df %>% group_by(u,v) %>% mutate(label = n()) # WRONG: n() is number of element within its group, not overall number of group

   u v
1  2 3
2  1 3
3  1 2
4  2 3
5  1 2
6  3 3
7  1 3
8  1 2
9  3 1
10 3 4

KLUDGE 1: could do df %>% group_by(u,v) %>% summarize(label = n()) , then self-join

Bellabelladonna answered 12/4, 2014 at 4:38 Comment(7)

@Randy-Lai and I both solved it, separately. Randy's is a cleaner idiom that lends itself to multiple mutate/summarize(...) actions. I found interaction(u,v, drop=T) – Bellabelladonna 12/4, 2014 at 23:30

What do you need this for? – Chromophore 14/4, 2014 at 23:11

@hadley: my particular reason is as stated in the question: I want to assign each distinct (u,v)-group some arbitrary (ordered) numbering=1,2,3... so I can ultimately assign them string labels 'A','B','C'... (my purpose is to subsequently refer to them by shorthand, in modeling and graphing) – Bellabelladonna 18/11, 2014 at 22:49

@hadley: but in general this is a useful feature, and data.table package implements .GRP for this. Any chance we can have something in dplyr please? :) – Bellabelladonna 18/11, 2014 at 22:51

next version will have group_indices() – Chromophore 19/11, 2014 at 15:59

@Chromophore Thanks! New in 0.4.0 (1/2015) – Bellabelladonna 16/3, 2015 at 13:36

@SamFirke: thanks for the updates and answer, but please leave my ancient cave scribblings in the question. Also, don't delete the comparison to data.table, that's all useful too. – Bellabelladonna 25/2, 2021 at 21:59

Updated answer

get_group_number = function(){
    i = 0
    function(){
        i <<- i+1
        i
    }
}
group_number = get_group_number()
df %>% group_by(u,v) %>% mutate(label = group_number())

You can also consider the following slightly unreadable version

group_number = (function(){i = 0; function() i <<- i+1 })()
df %>% group_by(u,v) %>% mutate(label = group_number())

using iterators package

library(iterators)

counter = icount()
df %>% group_by(u,v) %>% mutate(label = nextElem(counter))

Periodontics answered 12/4, 2014 at 5:24 Comment(12)

No, this is wrong. I'm not looking for the row-number within a group. I'm looking for the group-number (the equivalent of data.table .GRP). Since we have 7 unique combinations of (u,v) in this example, the output labels should be 1:7 (in some arbitrary order) – Bellabelladonna 12/4, 2014 at 5:29

Sorry, I didn't pay much attention to your question. I have updated the answer with a dirty solution... – Periodontics 12/4, 2014 at 5:35

not bad but that's essentially just a generator function that returns incrementing integers... surely we can obviate it? – Bellabelladonna 12/4, 2014 at 5:39

^ Does R not do generator functions? (like Python yield?) Without having to manually save state inside your fn? – Bellabelladonna 12/4, 2014 at 7:25

you remind me of iterators package. I have never used it before. (And see the updated solution). But it is essentially equivalent to my original method. – Periodontics 12/4, 2014 at 7:32

^ Wow that's awesome! Best answer. Did you see my new one using interaction(u,v)? Can you figure out how to reorder the levels in increasing order? – Bellabelladonna 12/4, 2014 at 8:1

i think you will get the correct order if you order df. – Periodontics 12/4, 2014 at 8:7

Assume we want to preserve the order of df (we do, my real case is more complicated). It would be clunky to dplyr::arrange(u,v) then do this group-numbering then revert to dplyr::arrange(<previous-variable-ordering>) – Bellabelladonna 12/4, 2014 at 8:9

may be factor(interaction(sort(df$u),sort(df$v))) (I didn't test it). – Periodontics 12/4, 2014 at 8:14

Naw... I've been trying many things unsuccessfully for a while now. Might post as a separate question. – Bellabelladonna 12/4, 2014 at 8:40

Solved - see my updated answer. And question link below. – Bellabelladonna 12/4, 2014 at 9:27

Update: New group_indices_ in 0.4.0 (1/2015) – Bellabelladonna 16/3, 2015 at 13:37

For current dplyr versions (1.0.0 and higher)

Since version 1.0, dplyr has a new cur_group_id function for that:

df %>% 
    group_by(u, v) %>% 
    mutate(label = cur_group_id()) ...

For previous dplyr versions (before 1.0.0, although the function is deprecated but still available in 1.0.10)

dplyr has a group_indices() function that you can use like this:

df %>% 
    mutate(label = group_indices(., u, v)) %>% 
    group_by(label) ...

Raillery answered 16/3, 2015 at 11:13 Comment(2)

group_indices() uses the (alphabetical) ordering of the grouping variable though, is there any way of using it to preserve the ordering in the table, or applying your own? – Vitamin 17/9, 2019 at 12:52

Note that group_indices() was deprecated in dplyr 1.0.0. and has been replaced with cur_group_id(). – Metaprotein 19/4, 2023 at 19:1

Another approach using data.table would be

require(data.table)
setDT(df)[,label:=.GRP, by = c("u", "v")]

which results in:

    u v label
 1: 2 1     1
 2: 1 3     2
 3: 2 1     1
 4: 3 4     3
 5: 3 1     4
 6: 1 1     5
 7: 3 2     6
 8: 2 3     7
 9: 3 2     6
10: 3 4     3

Jarrett answered 23/8, 2016 at 18:9 Comment(0)

As of dplyr version 1.0.4, the function cur_group_id() has replaced the older function group_indices.

Call it on the grouped data.frame:

df %>%
  group_by(u, v) %>%
  mutate(label = cur_group_id())

# A tibble: 10 x 3
# Groups:   u, v [6]
       u     v label
   <int> <int> <int>
 1     2     2     4
 2     2     2     4
 3     1     3     2
 4     3     2     6
 5     1     4     3
 6     1     2     1
 7     2     2     4
 8     2     4     5
 9     3     2     6
10     2     4     5

Nail answered 25/2, 2021 at 19:47 Comment(0)

Updated answer

get_group_number = function(){
    i = 0
    function(){
        i <<- i+1
        i
    }
}
group_number = get_group_number()
df %>% group_by(u,v) %>% mutate(label = group_number())

You can also consider the following slightly unreadable version

group_number = (function(){i = 0; function() i <<- i+1 })()
df %>% group_by(u,v) %>% mutate(label = group_number())

using iterators package

library(iterators)

counter = icount()
df %>% group_by(u,v) %>% mutate(label = nextElem(counter))