How to number/label data-table by group-number from group_by?
Asked Answered
B

6

22

I have a tbl_df where I want to group_by(u, v) for each distinct integer combination observed with (u, v).


EDIT: this was subsequently resolved by adding the (now-deprecated) group_indices() back in dplyr 0.4.0


a) I then want to assign each distinct group some arbitrary distinct number label=1,2,3... e.g. the combination (u,v)==(2,3) could get label 1, (1,3) could get 2, and so on. How to do this with one mutate(), without a three-step summarize-and-self-join?

dplyr has a neat function n(), but that gives the number of elements within its group, not the overall number of the group. In data.table this would simply be called .GRP.

b) Actually what I really want to assign a string/character label ('A','B',...). But numbering groups by integers is good-enough, because I can then use integer_to_label(i) as below. Unless there's a clever way to merge these two? But don't sweat this part.

set.seed(1234)

# Helper fn for mapping integer 1..26 to character label
integer_to_label <- function(i) { substr("ABCDEFGHIJKLMNOPQRSTUVWXYZ",i,i) }

df <- tibble::as_tibble(data.frame(u=sample.int(3,10,replace=T), v=sample.int(4,10,replace=T)))

# Want to label/number each distinct group of unique (u,v) combinations
df %>% group_by(u,v) %>% mutate(label = n()) # WRONG: n() is number of element within its group, not overall number of group

   u v
1  2 3
2  1 3
3  1 2
4  2 3
5  1 2
6  3 3
7  1 3
8  1 2
9  3 1
10 3 4

KLUDGE 1: could do df %>% group_by(u,v) %>% summarize(label = n()) , then self-join
Bellabelladonna answered 12/4, 2014 at 4:38 Comment(7)
@Randy-Lai and I both solved it, separately. Randy's is a cleaner idiom that lends itself to multiple mutate/summarize(...) actions. I found interaction(u,v, drop=T)Bellabelladonna
What do you need this for?Chromophore
@hadley: my particular reason is as stated in the question: I want to assign each distinct (u,v)-group some arbitrary (ordered) numbering=1,2,3... so I can ultimately assign them string labels 'A','B','C'... (my purpose is to subsequently refer to them by shorthand, in modeling and graphing)Bellabelladonna
@hadley: but in general this is a useful feature, and data.table package implements .GRP for this. Any chance we can have something in dplyr please? :)Bellabelladonna
next version will have group_indices()Chromophore
@Chromophore Thanks! New in 0.4.0 (1/2015)Bellabelladonna
@SamFirke: thanks for the updates and answer, but please leave my ancient cave scribblings in the question. Also, don't delete the comparison to data.table, that's all useful too.Bellabelladonna
P
6

Updated answer

get_group_number = function(){
    i = 0
    function(){
        i <<- i+1
        i
    }
}
group_number = get_group_number()
df %>% group_by(u,v) %>% mutate(label = group_number())

You can also consider the following slightly unreadable version

group_number = (function(){i = 0; function() i <<- i+1 })()
df %>% group_by(u,v) %>% mutate(label = group_number())

using iterators package

library(iterators)

counter = icount()
df %>% group_by(u,v) %>% mutate(label = nextElem(counter))
Periodontics answered 12/4, 2014 at 5:24 Comment(12)
No, this is wrong. I'm not looking for the row-number within a group. I'm looking for the group-number (the equivalent of data.table .GRP). Since we have 7 unique combinations of (u,v) in this example, the output labels should be 1:7 (in some arbitrary order)Bellabelladonna
Sorry, I didn't pay much attention to your question. I have updated the answer with a dirty solution...Periodontics
not bad but that's essentially just a generator function that returns incrementing integers... surely we can obviate it?Bellabelladonna
^ Does R not do generator functions? (like Python yield?) Without having to manually save state inside your fn?Bellabelladonna
you remind me of iterators package. I have never used it before. (And see the updated solution). But it is essentially equivalent to my original method.Periodontics
^ Wow that's awesome! Best answer. Did you see my new one using interaction(u,v)? Can you figure out how to reorder the levels in increasing order?Bellabelladonna
i think you will get the correct order if you order df.Periodontics
Assume we want to preserve the order of df (we do, my real case is more complicated). It would be clunky to dplyr::arrange(u,v) then do this group-numbering then revert to dplyr::arrange(<previous-variable-ordering>)Bellabelladonna
may be factor(interaction(sort(df$u),sort(df$v))) (I didn't test it).Periodontics
Naw... I've been trying many things unsuccessfully for a while now. Might post as a separate question.Bellabelladonna
Solved - see my updated answer. And question link below.Bellabelladonna
Update: New group_indices_ in 0.4.0 (1/2015)Bellabelladonna
R
55

For current dplyr versions (1.0.0 and higher)

Since version 1.0, dplyr has a new cur_group_id function for that:

df %>% 
    group_by(u, v) %>% 
    mutate(label = cur_group_id()) ...
    

For previous dplyr versions (before 1.0.0, although the function is deprecated but still available in 1.0.10)

dplyr has a group_indices() function that you can use like this:

df %>% 
    mutate(label = group_indices(., u, v)) %>% 
    group_by(label) ...
Raillery answered 16/3, 2015 at 11:13 Comment(2)
group_indices() uses the (alphabetical) ordering of the grouping variable though, is there any way of using it to preserve the ordering in the table, or applying your own?Vitamin
Note that group_indices() was deprecated in dplyr 1.0.0. and has been replaced with cur_group_id().Metaprotein
J
11

Another approach using data.table would be

require(data.table)
setDT(df)[,label:=.GRP, by = c("u", "v")]

which results in:

    u v label
 1: 2 1     1
 2: 1 3     2
 3: 2 1     1
 4: 3 4     3
 5: 3 1     4
 6: 1 1     5
 7: 3 2     6
 8: 2 3     7
 9: 3 2     6
10: 3 4     3
Jarrett answered 23/8, 2016 at 18:9 Comment(0)
N
9

As of dplyr version 1.0.4, the function cur_group_id() has replaced the older function group_indices.

Call it on the grouped data.frame:

df %>%
  group_by(u, v) %>%
  mutate(label = cur_group_id())

# A tibble: 10 x 3
# Groups:   u, v [6]
       u     v label
   <int> <int> <int>
 1     2     2     4
 2     2     2     4
 3     1     3     2
 4     3     2     6
 5     1     4     3
 6     1     2     1
 7     2     2     4
 8     2     4     5
 9     3     2     6
10     2     4     5
Nail answered 25/2, 2021 at 19:47 Comment(0)
P
6

Updated answer

get_group_number = function(){
    i = 0
    function(){
        i <<- i+1
        i
    }
}
group_number = get_group_number()
df %>% group_by(u,v) %>% mutate(label = group_number())

You can also consider the following slightly unreadable version

group_number = (function(){i = 0; function() i <<- i+1 })()
df %>% group_by(u,v) %>% mutate(label = group_number())

using iterators package

library(iterators)

counter = icount()
df %>% group_by(u,v) %>% mutate(label = nextElem(counter))
Periodontics answered 12/4, 2014 at 5:24 Comment(12)
No, this is wrong. I'm not looking for the row-number within a group. I'm looking for the group-number (the equivalent of data.table .GRP). Since we have 7 unique combinations of (u,v) in this example, the output labels should be 1:7 (in some arbitrary order)Bellabelladonna
Sorry, I didn't pay much attention to your question. I have updated the answer with a dirty solution...Periodontics
not bad but that's essentially just a generator function that returns incrementing integers... surely we can obviate it?Bellabelladonna
^ Does R not do generator functions? (like Python yield?) Without having to manually save state inside your fn?Bellabelladonna
you remind me of iterators package. I have never used it before. (And see the updated solution). But it is essentially equivalent to my original method.Periodontics
^ Wow that's awesome! Best answer. Did you see my new one using interaction(u,v)? Can you figure out how to reorder the levels in increasing order?Bellabelladonna
i think you will get the correct order if you order df.Periodontics
Assume we want to preserve the order of df (we do, my real case is more complicated). It would be clunky to dplyr::arrange(u,v) then do this group-numbering then revert to dplyr::arrange(<previous-variable-ordering>)Bellabelladonna
may be factor(interaction(sort(df$u),sort(df$v))) (I didn't test it).Periodontics
Naw... I've been trying many things unsuccessfully for a while now. Might post as a separate question.Bellabelladonna
Solved - see my updated answer. And question link below.Bellabelladonna
Update: New group_indices_ in 0.4.0 (1/2015)Bellabelladonna
B
2

Updating my answer with three different ways:

A) A neat non-dplyr solution using interaction(u,v):

> df$label <- factor(interaction(df$u,df$v, drop=T))
 [1] 1.3 2.3 2.2 2.4 3.2 2.4 1.2 1.2 2.1 2.1
 Levels: 2.1 1.2 2.2 3.2 1.3 2.3 2.4

> match(df$label, levels(df$label)[ rank(unique(df$label)) ] )
 [1] 1 2 3 4 5 4 6 6 7 7

B) Making Randy's neat fast-and-dirty generator-function answer more compact:

get_next_integer = function(){
  i = 0
  function(u,v){ i <<- i+1 }
}
get_integer = get_next_integer() 

df %>% group_by(u,v) %>% mutate(label = get_integer())

C) Also here is a one-liner using a generator function abusing a global variable assignment from this:

i <- 0
generate_integer <- function() { return(assign('i', i+1, envir = .GlobalEnv)) }

df %>% group_by(u,v) %>% mutate(label = generate_integer())

rm(i)
Bellabelladonna answered 12/4, 2014 at 6:14 Comment(4)
The reason that I used get_group_name is to avoid using global variable. I think it is in general not a good idea to change global variables inside a function...but it works anyway.Periodontics
I compacted yours and put it at the top of my answer. An assignment evaluates to its LHS value, hence we can simply say function(u,v){ i <<- i+1 }Bellabelladonna
I also found a neat three-liner non-dplyr way with interaction(u,v), and added that at top.Bellabelladonna
I also solved the incremental-order issue with interaction(... drop=T) per this subquestionBellabelladonna
P
2

I don't have enough reputation for a comment, so I'm posting an answer instead.

The solution using factor() is a good one, but it has the disadvantage that group numbers are assigned after factor() alphabetizes its levels. The same behaviour happens with dplyr's group_indices(). Perhaps you would like the group numbers to be assigned from 1 to n based on the current group order. In which case, you can use:

my_tibble %>% mutate(group_num = as.integer(factor(group_var, levels = unique(.$group_var))) )
Puncheon answered 26/6, 2018 at 22:13 Comment(1)
Thanks. As I noted in the question, this was all solved by adding group_indices() back in dplyr 0.4.0 in 2015Bellabelladonna

© 2022 - 2024 — McMap. All rights reserved.