Creating new groups, when the original groups do not have sufficient observations

Asked 29/9, 2022 at 10:59 Answered 3/10, 2022 at 23:16

I have example data as follows:

library(data.table)
sample <- fread("
1,0,2,NA,cat X, type 1
3,4,3,1,cat X, type 2
1,0,2,2,cat X, type 3
3,4,3,0,cat X, type 4
1,0,2,NA,cat Y, type 1
3,4,3,NA,cat Y, type 2
1,0,2,2,cat Y, type 3
3,4,3,35,cat Y, type 4
1,0,2,NA,cat X, type 1
3,4,3,1,cat X, type 2
1,0,2,2,cat X, type 3
3,4,3,NA,cat X, type 4
1,0,2,NA,cat Y, type 1
3,4,3,NA,cat Y, type 2
1,0,2,2,cat Y, type 3
3,4,3,1,cat Y, type 4
1,0,2,4,cat X, type 1
3,4,3,1,cat X, type 2
1,0,2,2,cat X, type 3
3,4,3,2,cat X, type 4
1,0,2,NA,cat Y, type 1
3,4,3,NA,cat Y, type 2
1,0,2,2,cat Y, type 3
3,4,3,2,cat Y, type 4
")

names(sample) <- c("A","B","C", "D", "cat", "type")

sample <- sample[, observations := sum(!is.na(D)), by = c("cat", "type")]

    A B C  D   cat   type observations
 1: 1 0 2 NA cat X type 1            1
 2: 3 4 3  1 cat X type 2            3
 3: 1 0 2  2 cat X type 3            3
 4: 3 4 3  0 cat X type 4            2
 5: 1 0 2 NA cat Y type 1            0
 6: 3 4 3 NA cat Y type 2            0
 7: 1 0 2  2 cat Y type 3            3
 8: 3 4 3 35 cat Y type 4            3
 9: 1 0 2 NA cat X type 1            1
10: 3 4 3  1 cat X type 2            3
...
24: 3 4 3  0 cat Y type 4            3

I would like to add the neighbouring group types together if they have fewer than two observations.

For example: add the group of type 1 with only 1 observation to the observations in group 2 (see the first line of the desired output).

Types need to be pooled together until all remaining categories have at least 2 observations. So type 1 and type 2 of category Y, need to be pooled with type 3.

I am having trouble coming up with way of writing code for this.

Can anyone suggest a good way to automatically create the new types?

I realise that there might be situations in which there might be two possible solutions for pooling the groups. However, as long as the groups which are added together are neighbouring groups (so type 1 is not added to type 4, which groups are added together are not important.

Desired output:

    A B C  D   cat   type  new_type observations
 1: 1 0 2 NA cat X type 1  type 2          4
 2: 3 4 3  1 cat X type 2  type 2          4
 3: 1 0 2  2 cat X type 3  type 3          3
 4: 3 4 3  0 cat X type 4  type 4          2
 5: 1 0 2  2 cat Y type 1  type 3          3
 6: 3 4 3 NA cat Y type 2  type 3          3
 7: 1 0 2  2 cat Y type 3  type 3          3
 8: 3 4 3  0 cat Y type 4  type 4          3
 9: 1 0 2 NA cat X type 1  type 2          4
10: 3 4 3  1 cat X type 2  type 2          4
...
24: 3 4 3  0 cat Y type 4  type 4          3

Solution does NOT have to use data.table

Russellrusset answered 29/9, 2022 at 10:59 Comment(3)

Why are rows 3-4 singletons in your desired output? You said that if they have fewer than two, they should be combined with a neighboring group. – Counts 29/9, 2022 at 11:28

@Counts Sorry, mistake on my side, they are combined now. – Russellrusset 29/9, 2022 at 11:44

@Waldi Fixed. My apologies, I keep overlooking the mistakes. I think I copied in a wrong table at some point, because of which I confused numbers. I went through the whole table again and I think it is correct now. – Russellrusset 3/10, 2022 at 14:45

With Reduce and accumulate = T option:

sample[,`:=`(type = last(type),observations=sum(observations)),
       .(cat,sapply(Reduce(f = function(x,y) {
                          grp= x$grp
                          if (x$nxtgrp) {grp=grp+1; x$cumsum=0}
                          nxtgrp=!((x$cumsum+y)<2)
                          list(grp = grp,
                               cumsum=x$cumsum + y,
                               nxtgrp = nxtgrp)},
                     x = observations,
                     init = list(grp = 0, cumsum=0, nxtgrp = F),
                     accumulate = T),
         function(x) x$grp)[-1])
       ][]

        A     B     C     D    cat   type observations
    <int> <int> <int> <int> <char> <char>        <int>
 1:     1     0     2    NA  cat X type 2            4
 2:     3     4     3     1  cat X type 2            4
 3:     1     0     2     2  cat X type 3            3
 4:     3     4     3     0  cat X type 4            2
 5:     1     0     2    NA  cat Y type 3            3
 6:     3     4     3    NA  cat Y type 3            3
 7:     1     0     2     2  cat Y type 3            3
 8:     3     4     3    35  cat Y type 4            3
 9:     1     0     2    NA  cat X type 2            4
10:     3     4     3     1  cat X type 2            4
11:     1     0     2     2  cat X type 3            3
12:     3     4     3    NA  cat X type 4            2
13:     1     0     2    NA  cat Y type 3            3
14:     3     4     3    NA  cat Y type 3            3
15:     1     0     2     2  cat Y type 3            3
16:     3     4     3     1  cat Y type 4            3
17:     1     0     2     4  cat X type 2            4
18:     3     4     3     1  cat X type 2            4
19:     1     0     2     2  cat X type 3            3
20:     3     4     3     2  cat X type 4            2
21:     1     0     2    NA  cat Y type 3            3
22:     3     4     3    NA  cat Y type 3            3
23:     1     0     2     2  cat Y type 3            3
24:     3     4     3     2  cat Y type 4            3
        A     B     C     D    cat   type observations

The idea is to generate an accumulated list with :

the current group : grp
the current cumsum : cumsum
a flag to increment group for next row : nxtgrp

As soon as the number of observations is above 2, the flag to increment group is set.
When the flag is set, at next row, cumsum is reset to zero and grp is incremented.

The grp list element can then be used as by parameter in data.table.

Another possibility is to achieve the same grouping with a for-loop function either in R, or in Rcpp:

observations_grp <- function(x) {
  cumsum_i <- 0
  nxtgrp <-  F
  n <- length(x)
  grp <- rep(0,n)
  grp_i <- 0;
  for (i in 1:n) {
    if (nxtgrp) {grp_i <- grp_i + 1; cumsum_i <- 0;}
    nxtgrp <- !((cumsum_i + x[i]) < 2)
    cumsum_i <- cumsum_i + x[i]
    grp[i] <- grp_i
  }
  grp
}

sample[,`:=`(type = last(type), observations=sum(observations)),
        .(cat,observations_grp(observations))
][]

A performance comparison shows that Reduce isn't faster than a R loop:

Unit: milliseconds
   expr    min      lq     mean  median      uq    max neval
 Reduce 1.3458 1.45025 1.732185 1.56405 1.73740 6.3339   100
   Loop 1.3374 1.44175 1.685722 1.53120 1.67665 3.7091   100

If you need speed, Rcpp will definitely help a lot.

Nordstrom answered 3/10, 2022 at 15:28 Comment(4)

Hi Waldi, this is great, and I can reproduce your example, but I am having some trouble seeing where cat X/Y comes in. When I run this on my actual data (it's more than fast enough by the way, so Rcpp won't be necessary), I see that type 1 is sometimes changed to type 4, which is theoretically possible when there are less than two observation by cat in total, however according to my data, that should not be the case. Did I create an example which accidentally does not need cat? – Russellrusset 3/10, 2022 at 18:12

*type 4 to type 1 – Russellrusset 3/10, 2022 at 18:24

I oversaw cat, and the example doesn't make a difference on cat : corrected – Nordstrom 3/10, 2022 at 18:49

I had some issues applying the solution to my actual data. I managed to recreate the issue and posted a new question: #74094179 – Russellrusset 17/10, 2022 at 7:14

Perhaps you can create a helper function like below

helper <- function(v) {
  s <- grp <- 0
  y <- vector("numeric", length(v))
  for (i in seq_along(v)) {
    y[i] <- grp
    s <- s + v[i]
    if (s >= 2) {
      s <- 0
      grp <- grp + 1
    }
  }
  y
}

then run

dt <- sample[
  ,
  c(.(grp = helper(observations)), .SD),
  .(id = rleid(cat))
][
  ,
  `:=`(type = last(type), observations = sum(observations)),
  .(id, grp)
][, -(1:2)]

and you will obtain

> dt
    A B C  D   type observations
 1: 1 0 2 NA type 2            4
 2: 3 4 3  1 type 2            4
 3: 1 0 2  2 type 3            3
 4: 3 4 3  0 type 4            2
 5: 1 0 2 NA type 3            3
 6: 3 4 3 NA type 3            3
 7: 1 0 2  2 type 3            3
 8: 3 4 3 35 type 4            3
 9: 1 0 2 NA type 2            4
10: 3 4 3  1 type 2            4
11: 1 0 2  2 type 3            3
12: 3 4 3 NA type 4            2
13: 1 0 2 NA type 3            3
14: 3 4 3 NA type 3            3
15: 1 0 2  2 type 3            3
16: 3 4 3  1 type 4            3
17: 1 0 2  4 type 2            4
18: 3 4 3  1 type 2            4
19: 1 0 2  2 type 3            3
20: 3 4 3  2 type 4            2
21: 1 0 2 NA type 3            3
22: 3 4 3 NA type 3            3
23: 1 0 2  2 type 3            3
24: 3 4 3  2 type 4            3
    A B C  D   type observations

Whitehouse answered 3/10, 2022 at 23:16 Comment(5)

appending to idx <- c(idx, ....) is inefficient. declare it idx <- vector(, length(v)) and fill it in a loop. – Badgett 8/10, 2022 at 19:36

@Badgett Thanks! Good point to speed up. Updated my solution already! – Whitehouse 8/10, 2022 at 21:6

If it's integer then you are losing some time, by using double, rather than integer, in your helper function: +1 vs +1L. – Badgett 8/10, 2022 at 23:12

@Badgett Sounds interesting! Didn't know the difference between integer and numeric on speed before, but learn it now. – Whitehouse 9/10, 2022 at 20:14

@Whitehouse I had some issues applying your solution to my actual data. I managed to recreate the issue and posted a new question: #74094179 I made the example with Waldi's answer, but I had the same results with your function. If you have any ideas I would be really happy to hear them. – Russellrusset 17/10, 2022 at 7:15

Here is a tidyverse solution. The highest type level (type 4) will decrease when observations<2, other type will go one higher.

library(dplyr)
sample %>% 
  mutate(
    new_type = as.numeric(factor(type)),
    new_type = paste0(
      "type ", 
      ifelse(observations<2,
             ifelse(new_type != max(new_type), new_type + 1, new_type - 1), 
             new_type)
      )
  ) %>% 
  group_by(cat, new_type) %>% 
  mutate(observations = sum(!is.na(D))) %>% 
  ungroup()

Vienne answered 3/10, 2022 at 17:36 Comment(1)

That doesn't meet the desired output as this won't move more than one type above/below- see row #5 for example. – Thermoelectricity 10/10, 2022 at 22:17

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags