split character vector with each distinct element having an equal amount of bins
Asked Answered
E

7

5
x <- rep(c("A","B","C"),times=c(6,8,3))
 "A" "A" "A" "A" "A" "A" "B" "B" "B" "B" "B" "B" "B" "B" "C" "C" "C"

I'm struggling to create a vector that corresponds to each letter being divided into exactly 3 bins:

       (A A A A A A  B B B B B B B B  C C C)
x_bin = 1 1 2 2 3 3  1 1 1 2 2 2 3 3  1 2 3

In this example, I can divide A into 3 bins by combining every 2 values. I can divide B into 3 bins by combining 3, 3 and 2 values. And I can only divide C into 3 bins by combining 1 value.

Is there a function that allows me to do this? I tried with cut and dplyr but cut only works with numeric data and it doesn't cut the way I want.

Elizabetelizabeth answered 10/11, 2023 at 12:49 Comment(0)
D
4

1) We can use ave/cut like this:

ave(x == x, x, FUN = \(x) cut(seq_along(x), 3))
## [1] 1 1 2 2 3 3 1 1 1 2 2 3 3 3 1 2 3

2) Another possiblity is unlist/tapply/cut:

unlist(tapply(x, x, \(x) cut(seq_along(x), 3, FALSE)))
## A1 A2 A3 A4 A5 A6 B1 B2 B3 B4 B5 B6 B7 B8 C1 C2 C3 
##  1  1  2  2  3  3  1  1  1  2  2  3  3  3  1  2  3 

Update

Minor improvement to (1) and add (2).

Dabchick answered 10/11, 2023 at 14:28 Comment(0)
R
6

We can use ave to group by letter, then rep(1:3, length.out=) to get it to the right length. This guarantees that the numbered groups (per letter) will be either equally balanced or off by no more than 1.

ave(rep(1L, length(x)), x, FUN = function(z) rep(1:3, length.out = length(z)))
#  [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 1 2 3

If you want all 1s first, 2s second, etc, then we can sort them:

ave(rep(1L, length(x)), x, FUN = function(z) sort(rep(1:3, length.out = length(z))))
#  [1] 1 1 2 2 3 3 1 1 1 2 2 2 3 3 1 2 3

Verification:

ave(rep(1L, length(x)), x, FUN = function(z) sort(rep(1:3, length.out = length(z)))) |>
  all.equal(x_bin)
# [1] TRUE

Data

x <- rep(c("A","B","C"),times=c(6,8,3))
x_bin <- c(1, 1, 2, 2, 3, 3, 1, 1, 1, 2, 2, 2, 3, 3, 1, 2, 3)
Rhetic answered 10/11, 2023 at 12:54 Comment(1)
sort is a good idea, cheers!Fastback
F
5
  • Try rep within ave
ave(
    seq_along(x),
    x,
    FUN = \(v) {
        rep(1:3,
            each = ceiling(length(v) / 3),
            length.out = length(v)
        )
    }
)
  • Or, another trick with matrix within ave
ave(
    seq_along(x),
    x,
    FUN = \(v)
    col(matrix(nrow = ceiling(length(v) / 3), ncol = 3))[seq_along(v)]
)

which should give

1 1 2 2 3 3 1 1 1 2 2 2 3 3 1 2 3
Fastback answered 10/11, 2023 at 12:56 Comment(1)
each= is another good idea :-)Rhetic
D
4

1) We can use ave/cut like this:

ave(x == x, x, FUN = \(x) cut(seq_along(x), 3))
## [1] 1 1 2 2 3 3 1 1 1 2 2 3 3 3 1 2 3

2) Another possiblity is unlist/tapply/cut:

unlist(tapply(x, x, \(x) cut(seq_along(x), 3, FALSE)))
## A1 A2 A3 A4 A5 A6 B1 B2 B3 B4 B5 B6 B7 B8 C1 C2 C3 
##  1  1  2  2  3  3  1  1  1  2  2  3  3  3  1  2  3 

Update

Minor improvement to (1) and add (2).

Dabchick answered 10/11, 2023 at 14:28 Comment(0)
T
3
x <- rep(c("A","B","C"),times=c(6,8,3))
xdf <- data.frame(x = x)

library(tidyverse)
xdf |> group_by(x) |> mutate(bin = rep(1:3, length.out = n())) |> arrange(x, bin)

gives

  x       bin
   <chr> <int>
 1 A         1
 2 A         1
 3 A         2
 4 A         2
 5 A         3
 6 A         3
 7 B         1
 8 B         1
 9 B         1
10 B         2
11 B         2
12 B         2
13 B         3
14 B         3
15 C         1
16 C         2
17 C         3
Trimeter answered 10/11, 2023 at 12:58 Comment(0)
K
3

Another way with rle + rep:

with(rle(x),
     sapply(seq(length(values)), 
            \(z) rep(1:3, 
                     each = ceiling(lengths[z] / 3), 
                     length.out = lengths[z]))
     ) |> 
  unlist()

#[1] 1 1 2 2 3 3 1 1 1 2 2 2 3 3 1 2 3
Kharkov answered 10/11, 2023 at 14:0 Comment(0)
T
3
times <- c(6,8,3)
x <- rep(c("A","B","C"),times=times)
CUT <- ceiling(times / 3)
x_bin <- unlist(sapply(CUT, function(x)  rep(seq(3), each = x)))

x_bin
#>  [1] 1 1 2 2 3 3 1 1 1 2 2 2 3 3 3 1 2 3

Created on 2023-11-10 with reprex v2.0.2

Training answered 10/11, 2023 at 14:2 Comment(0)
F
3

Try this

> table(x) |> Map(\(...) sort(rep_len(...)), list(1:3), length.out=_) |> unlist()
 [1] 1 1 2 2 3 3 1 1 1 2 2 2 3 3 1 2 3

Length of bins n=3 is defined in list(1:3).

Farfamed answered 10/11, 2023 at 15:33 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.