How to do range grouping on a column using dplyr?
Asked Answered
I

2

15

I want to group a data.table based on a column's range value, how can I do this with the dplyr library?

For example, my data table is like below:

library(data.table)
library(dplyr)
DT <- data.table(A=1:100, B=runif(100), Amount=runif(100, 0, 100))

Now I want to group DT into 20 groups at 0.05 interval of column B, and count how many rows are in each group. e.g., any rows with a column B value in the range of [0, 0.05) will form a group; any rows with the column B value in the range of [0.05, 0.1) will form another group, and so on. Is there an efficient way of doing this group function?

Thank you very much.

-----------------------------More question on akrun's answer. Thanks akrun for your answer. I got a new question about the "cut" function. If my DT is like below:

DT <- data.table(A=1:10, B=c(0.01, 0.04, 0.06, 0.09, 0.1, 0.13, 0.14, 0.15, 0.17, 0.71)) 

by using the following code:

DT %>% 
  group_by(gr=cut(B, breaks= seq(0, 1, by = 0.05), right=F) ) %>% 
  summarise(n= n()) %>%
  arrange(as.numeric(gr))

I expect to see results like this:

          gr n
1   [0,0.05) 2
2 [0.05,0.1) 2
3 [0.1,0.15) 3
4 [0.15,0.2) 2
5 [0.7,0.75) 1

but the result I got is like this:

          gr n
1   [0,0.05) 2
2 [0.05,0.1) 2
3 [0.1,0.15) 4
4 [0.15,0.2) 1
5 [0.7,0.75) 1 

Looks like the value 0.15 is not correctly allocated. Any thoughts on this?

Illuminator answered 4/9, 2015 at 3:8 Comment(3)
If the initial object is a data.table, we can use data.table methods DT[,.N ,.(gr=cut(B, breaks=seq(0, max(B), by=0.05)))]Katha
Fyi, nice to use set.seed when producing random example data, so that we're all looking at the same data.Elan
Updated the post, check if that works for you.Katha
K
21

We can use cut to do the grouping. We create the 'gr' column within the group_by, use summarise to create the number of elements in each group (n()), and order the output (arrange) based on 'gr'.

library(dplyr)
 DT %>% 
     group_by(gr=cut(B, breaks= seq(0, 1, by = 0.05)) ) %>% 
     summarise(n= n()) %>%
     arrange(as.numeric(gr))

As the initial object is data.table, this can be done using data.table methods (included @Frank's suggestion to use keyby)

library(data.table)
DT[,.N , keyby = .(gr=cut(B, breaks=seq(0, 1, by=0.05)))]

EDIT:

Based on the update in the OP's post, we could substract a small number to the seq

lvls <- levels(cut(DT$B, seq(0, 1, by =0.05)))
DT %>%
   group_by(gr=cut(B, breaks= seq(0, 1, by = 0.05) -
                 .Machine$double.eps, right=FALSE, labels=lvls)) %>% 
   summarise(n=n()) %>% 
   arrange(as.numeric(gr))
#          gr n
#1   (0,0.05] 2
#2 (0.05,0.1] 2
#3 (0.1,0.15] 3
#4 (0.15,0.2] 2
#5 (0.7,0.75] 1
Katha answered 4/9, 2015 at 3:18 Comment(4)
@Elan Thanks, I modified it.Katha
Hi akrun, I add the new question in this post as it is part of my problem. Once this is addressed I will accept the solution. Thanks.Illuminator
@Illuminator Thanks, I thought you were asking a completely new question.Katha
This is great. One useful value I used is Inf for my particular data set, just want to share with everyone. For example, breaks=c(1,5,10,Inf) will include from 5 and beyond.Piave
L
0

Adding another, alternative data.table solution:

I normally prefer to use round_any (from plyr) over cut:

e.g.

DT[, .N, keyby = round_any(B, 0.05, floor)]

This essentially rounds the data to any multiple of a number (i.e. 0.05). The third argument says to use floor when rounding (i.e. 0.04 would be grouped down to (0,0.05] and not (0.05,0.1]). You can also set this third argument to ceiling and round (the default).

For large tables, this solution is faster than akrun's data.table solution (with small tables they are roughly the same speed).

One thing to note, is that the output of the two commands is different - the group column for cut is a range, while with round_any, the group column value is a single number (i.e. the floor number).

Benchmark on a 10M row dataset:

DT <- data.table(A=1:10000000, B=runif(10000000), Amount=runif(100, 0, 10000000))

bench::mark(
  dplyr = DT %>%
    group_by(gr = cut(B, breaks = seq(0, 1, by = 0.05))) %>%
    summarise(n = n()) %>%
    arrange(as.numeric(gr)),
  data_table_cut = DT[, .N, keyby = .(gr = cut(B, breaks = seq(0, 1, by = 0.05)))],
  data_table_round_any = DT[, .N, keyby = round_any(B, 0.05, floor)],
  check = FALSE
)

The output:

# A tibble: 3 × 13
  expression             min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr>        <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 dplyr                654ms    654ms      1.53     445MB        0     1     0      654ms
2 data_table_cut       573ms    573ms      1.75     534MB        0     1     0      573ms
3 data_table_round_any 234ms    236ms      4.21     343MB        0     3     0      712ms

So round_any is roughly 2.5 times faster than the data.table cut solution (and 2.7 times faster than the dplyr solution)...

Lefebvre answered 12/12, 2021 at 21:9 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.