Create group names for consecutive values

B

4

8

Looks like an easy task, can't figure out a simpler way. I have an x vector below, and need to create group names for consecutive values. My attempt was using rle, better ideas?

# data
x <- c(1,1,1,2,2,2,3,2,2,1,1)

# make groups
rep(paste0("Group_", 1:length(rle(x)$lengths)), rle(x)$lengths)
# [1] "Group_1" "Group_1" "Group_1" "Group_2" "Group_2" "Group_2" "Group_3" "Group_4"
# [9] "Group_4" "Group_5" "Group_5"

Benitez answered 14/6, 2016 at 10:12 Comment(6)

why non using paste directly ?paste0('groupe_', c(1,1,1,2,2,2,3,2,2,1,1)) – Stomy 14/6, 2016 at 10:18

because the last two groups will be 2 and 1 instead of 4 and 5 if paste directly – Marable 14/6, 2016 at 10:18

@MamounBenghezal please check the expected output, first 1 is a Group_1, and last 1 is a Group_5 – Benitez 14/6, 2016 at 10:19

Nice attempt. A key line in the source code of rle makes use of diff as @Roland did below. – Revulsion 14/6, 2016 at 13:12

But.. having done that, how do you map these Group_x names to the actual values & run lengths? That is, what's the point of this exercise? – Relish 14/6, 2016 at 14:7

@CarlWitthoft names are in the same order as the values, so direct map, i.e.: names(x) <- myGroups. My actual data is data.frame, so I can apply the same and create a Group column for aggregate functions down the line. – Benitez 14/6, 2016 at 14:12

T

10

Using diff and cumsum :

paste0("Group_", cumsum(c(1, diff(x) != 0)))
#[1] "Group_1" "Group_1" "Group_1" "Group_2" "Group_2" "Group_2" "Group_3" "Group_4" "Group_4" "Group_5" "Group_5"

(If your values are floating point values, you might have to avoid != and use a tolerance instead.)

Tetratomic answered 14/6, 2016 at 10:32 Comment(4)

If they might not be numeric - paste0("Group_", cumsum(c(TRUE, head(x,-1)!=tail(x,-1)))) – Cardenas 14/6, 2016 at 10:33

My numbers have no floating points, so != should be OK, but what do you mean by tolerance? – Benitez 14/6, 2016 at 10:39

abs(diff(x)) < tol with tol based on help(".Machine"). – Tetratomic 14/6, 2016 at 10:40

Nice - I'm guessing this is faster than rle(x) and processing the output from that. OTOH, I would want to know how to map the group names to the runs, in which case might as well use rle(x)$lengths . – Relish 14/6, 2016 at 14:8

M

11

Using rleid from data.table,

library(data.table)

rleid(x, prefix = "Group_")
#[1] "Group_1" "Group_1" "Group_1" "Group_2" "Group_2" "Group_2" "Group_3" "Group_4" "Group_4" "Group_5" "Group_5"

Marable answered 14/6, 2016 at 10:25 Comment(0)

T

10

Using diff and cumsum :

paste0("Group_", cumsum(c(1, diff(x) != 0)))
#[1] "Group_1" "Group_1" "Group_1" "Group_2" "Group_2" "Group_2" "Group_3" "Group_4" "Group_4" "Group_5" "Group_5"

(If your values are floating point values, you might have to avoid != and use a tolerance instead.)

Tetratomic answered 14/6, 2016 at 10:32 Comment(4)

If they might not be numeric - paste0("Group_", cumsum(c(TRUE, head(x,-1)!=tail(x,-1)))) – Cardenas 14/6, 2016 at 10:33

My numbers have no floating points, so != should be OK, but what do you mean by tolerance? – Benitez 14/6, 2016 at 10:39

abs(diff(x)) < tol with tol based on help(".Machine"). – Tetratomic 14/6, 2016 at 10:40

Nice - I'm guessing this is faster than rle(x) and processing the output from that. OTOH, I would want to know how to map the group names to the runs, in which case might as well use rle(x)$lengths . – Relish 14/6, 2016 at 14:8

P

3

Using cumsum but not relying on the data being numeric:

paste0("Group_", 1 + c(0, cumsum(x[-length(x)] != x[-1])))


[1] "Group_1" "Group_1" "Group_1" "Group_2" "Group_2" "Group_2" "Group_3" "Group_4" "Group_4" "Group_5" "Group_5"

Pasteurizer answered 14/6, 2016 at 13:4 Comment(0)

B

2

group() from groupdata2 can create groups from a list of group starting points, using the l_starts method. By setting n to auto, it automatically finds group starts:

x <- c(1,1,1,2,2,2,3,2,2,1,1)
groupdata2::group(x, n = "auto", method = "l_starts")

## # A tibble: 11 x 2
## # Groups:   .groups [5]
##     data .groups
##    <dbl> <fct>  
##  1     1 1      
##  2     1 1      
##  3     1 1      
##  4     2 2      
##  5     2 2      
##  6     2 2      
##  7     3 3      
##  8     2 4      
##  9     2 4      
## 10     1 5      
## 11     1 5

There's also the differs_from_previous() function which finds values, or indices of values, that differ from the previous value by some threshold(s).

# The values to start groups at
differs_from_previous(x, threshold = 1,
                      direction = "both")
## [1] 2 3 2 1

# The indices to start groups at
differs_from_previous(x, threshold = 1,
                      direction = "both",
                      return_index = TRUE)
## [1] 4 7 8 10

Beach answered 26/7, 2019 at 1:32 Comment(0)

Recommended topics

Hot tags