Split data into groups with min and max size in R
Asked Answered
A

4

6

I want to randomly assign a vector of values, some of which are duplicates, to a smaller number of groups such that at least two and no more than four values are assigned to each group, and the same value is not assigned to the same group.

Example data:

values <- c(2499,2499,2522,2522,2522,2522,2648,2648,2652,2652,2670,2670,2689,2689,2690,2690,2693,2693,2700,2700,2706,2706,2714,2714,2730,2730,2738,2738,2740,2740,2765,2765,2768,2768,2773,2773,2783,2783,2794,2794,2798,2798,2807,2807,2812,2812,2831,2831,2831,2835,2835,2836,2836,2836,2844,2844,2844,2846,2846,2846,2883,2883,2964,2964)

groups <- 1:26

I tried:

split(values, sample(groups, length(values), repl = TRUE))

And this is close to what I want. But sometimes this results in just one value being assigned to a group, or more than four. And sometimes the same value (one of the duplicates) is assigned to the same group.

The desired output would have all the values randomly distributed across groups such that all values in each group are unique (no duplicates), and there is a minimum of two and a maximum of four values in each group.

Amory answered 23/8, 2023 at 1:37 Comment(0)
M
3

To randomly assign group sizes, the idea is to start with a "bank" of 4 (size.max in the function below) of each group ID. Since each group is to have a minimum of 2 (size.min), we take 2 of each group ID from the bank and put them into buckets corresponding to their ID. We then randomly draw from the bank until the sum of the items in the buckets equals length(values). This can be done very quickly using sample and tabulate.

If R is the vector of counts of each unique value from values, and S is the random vector of group sizes, a random assignment of values to groups without any repeated values per group corresponds to a random binary matrix with fixed row and column sums (R and S), where a 1 in row i and column j corresponds to the ith unique value belonging to the jth group. It also corresponds to a random bipartite graph with no multiple edges or loop edges. The sample_degseq function from the igraph package generates these random bipartite graphs.

As a function:

library(igraph) # for sample_degseq

rsplit <- function(values, groups, size.min, size.max) {
  R <- table(values)
  S <- tabulate(
    sample(
      rep(groups, size.max - size.min),
      length(values) - size.min*length(groups)
    ),
    length(groups)
  ) + size.min

  d <- length(R) - length(S)
  
  with(
    as_data_frame(
      sample_degseq( # randomly assign values to groups
        c(R, integer(max(0, -d))),
        c(S, integer(max(0, d))),
        "simple.no.multiple.uniform"
      )
    ),
    split(as(names(R), class(values))[from], groups[to])
  )
}

Run it on the example data:

rsplit(values, groups, 2, 4)
#> $`1`
#> [1] 2846 2522
#> 
#> $`2`
#> [1] 2794 2964
#> 
#> $`3`
#> [1] 2773 2714
#> 
#> $`4`
#> [1] 2783 2740
#> 
#> $`5`
#> [1] 2693 2499 2844
#> 
#> $`6`
#> [1] 2846 2648
#> 
#> $`7`
#> [1] 2522 2812
#> 
#> $`8`
#> [1] 2798 2831
#> 
#> $`9`
#> [1] 2652 2690 2964 2670
#> 
#> $`10`
#> [1] 2738 2835 2844 2652
#> 
#> $`11`
#> [1] 2700 2670
#> 
#> $`12`
#> [1] 2499 2765
#> 
#> $`13`
#> [1] 2648 2846
#> 
#> $`14`
#> [1] 2807 2773 2690 2689
#> 
#> $`15`
#> [1] 2700 2883
#> 
#> $`16`
#> [1] 2883 2835 2812
#> 
#> $`17`
#> [1] 2807 2765
#> 
#> $`18`
#> [1] 2768 2844
#> 
#> $`19`
#> [1] 2706 2836
#> 
#> $`20`
#> [1] 2714 2522
#> 
#> $`21`
#> [1] 2706 2836
#> 
#> $`22`
#> [1] 2794 2836
#> 
#> $`23`
#> [1] 2738 2783
#> 
#> $`24`
#> [1] 2740 2693 2522 2730
#> 
#> $`25`
#> [1] 2689 2831
#> 
#> $`26`
#> [1] 2768 2831 2798 2730
Mcdougald answered 23/8, 2023 at 16:44 Comment(4)
Thanks very much for your help. Both answers worked equally well. The partitionsSample and sample_degseq functions are pretty obscure but also extremely helpful. Thanks again.Amory
interesting implementation with igraph, +1!Freestanding
@JamesDanielJohnston, I updated the sampling algorithm to what I think is a better one and that doesn't use RcppAlgos.Mcdougald
@ThomasIsCoding, I actually got the idea from you!Mcdougald
D
1

With your values and groups defined, I sample between 2 and 4 values, remove duplicates, check that there is at least 2 values after that, resample if not until TRUE and return a list with groups as names

library(tidyverse)
groups <- 1:26
map(
  .x = groups,
  .f = ~{
    # Sample variable length group size 2:4
    res <- sample(values, sample(2:4, 1), replace = TRUE)
    # remove duplicates
    res <- res[!duplicated(res)]
    # check for if length of no dups res is less than 2, if so, resample, repeat above
    while(length(res) < 2){
      res <- sample(values, sample(2:4, 1), replace = TRUE)
      res <- res[!duplicated(res)]
    }
    res 
  } 
) %>% # List with groups as names
  set_names(., groups)
Dariadarian answered 23/8, 2023 at 6:31 Comment(0)
F
1

Idea

I think your questions covers two tasks: random group size and value uniqueness per group. Let's dive into them.

1) Random group size

First of all, from the mathematical point of view, since you have the identical size requirement for each group, i.e., min=2 and max=4, you can analyze that the average groups size is definitely within that interval.

In this case, one greedy and simple thinking is to make the sizes of all groups are as close to each other as possible. In other words:

  1. We first allocate a size to each group, such that all group can evenly have the same max size.
  2. Regarding the remaining size budget, we randomly assign then through all groups.

2) Value uniqueness per group

Given the random group sizes, the next step is to split the vector values in terms of the size distribution and keep the uniqueness of values within each group. The trick thing happens at the duplicates. Actually, we can hash the duplicates and make all the duplicates are moved as far away from each other as possible. What we can do might be something like:

  1. Hash the values (including their duplicates) and reorder the values.
  2. Assign the reordered values to groups by group size, iteratively.

One Base R Implementation

f <- function(val, grp) {
    val <- sort(val)
    # min average size of each group
    p <- length(val) %/% length(grp)
    # number of remaining items to be assiggned
    q <- length(val) %% length(grp)
    # distribution of group sizes, by randomly assigning the remaining items
    g <- p + replace(rep(0, length(grp)), sample(seq_along(grp), q), 1)
    # split by given group sizes
    split(
        # reorder val
        unname(unlist(split(val, ave(val, val, FUN = seq_along)))),
        rep(grp, g)
    )
}

and we will see

> f(values, groups)
$`1`
[1] 2499 2522 2648

$`2`
[1] 2652 2670

$`3`
[1] 2689 2690

$`4`
[1] 2693 2700

$`5`
[1] 2706 2714

$`6`
[1] 2730 2738

$`7`
[1] 2740 2765 2768

$`8`
[1] 2773 2783

$`9`
[1] 2794 2798 2807

$`10`
[1] 2812 2831 2835

$`11`
[1] 2836 2844

$`12`
[1] 2846 2883 2964

$`13`
[1] 2499 2522

$`14`
[1] 2648 2652

$`15`
[1] 2670 2689

$`16`
[1] 2690 2693

$`17`
[1] 2700 2706 2714

$`18`
[1] 2730 2738 2740

$`19`
[1] 2765 2768

$`20`
[1] 2773 2783

$`21`
[1] 2794 2798 2807

$`22`
[1] 2812 2831 2835

$`23`
[1] 2836 2844 2846

$`24`
[1] 2883 2964

$`25`
[1] 2522 2831 2836

$`26`
[1] 2844 2846 2522

In addition, if you want to check the uniqueness of each group, you can try

> any(sapply(f(values, groups), anyDuplicated))
[1] FALSE

which indicates no duplicates in each group.

Freestanding answered 24/8, 2023 at 8:59 Comment(2)
It looks like the group size is somewhat random (exactly 14 2s and 12 3s randomly distributed), but there is no randomness to the assignment. The first group is always either c(2499, 2522) or c(2499, 2522, 2648). It also doesn't seem to guarantee uniqueness within groups. Try it on values <- c(values, rep(3e3, 5)).Mcdougald
@Mcdougald You are right that it is not a "real" random one. As I mentioned in my answer, this is a greedy approach to make all group have almost the same size, and randomness comes from the random assignment of leftovers. If chaos is more preferred, we can always shuffle the resulting groups (both the values per group and the group orders as well)Freestanding
F
0

Here is a more "old-school" fashion (use base R only and loops) to solve the problem:

  1. Generate random group (or partitions) size, constrained by lower and upper bounds of size in the meanwhile.
  2. Keep assigning the duplicates of values to different random partitions (to avoid duplicates within each partition), sequentially, until the assignment of all entries finishes.

Note that, the implementation is just an example to show how it works, but not optimized, thus being probably inefficient when scaling up.

Code Example

f <- function(values, groups, szmin = 2, szmax = 4) {
    # a helper function to create random partitions
    randgrp <- function(n, k, lb = szmin, ub = szmax) {
        res <- c()
        repeat {
            if (k == 0) {
                return(res)
            }
            v <- max(lb, n - (k - 1) * ub):min(ub, n - (k - 1) * lb)
            m <- v[sample(length(v), 1)]
            res <- append(res, m)
            n <- n - m
            k <- k - 1
        }
    }
    # group values
    vlst <- split(values, values)
    repeat {
        grp <- randgrp(length(values), length(groups))
        ids <- c()
        for (i in seq_along(vlst)) {
            p <- c()
            v <- vlst[[i]]
            for (j in seq_along(grp)) {
                if (grp[j] > 0) {
                    p <- append(p, j)
                    grp[j] <- grp[j] - 1
                }
                if (length(p) == length(v)) {
                    break
                }
            }
            ids <- append(ids, p)
        }
        if (length(ids) == length(values)) {
            break
        }
    }
    split(unname(unlist(vlst)), groups[ids])
}

and we could obtain something like

> res 
$`1`
[1] 2499 2522 2648 2652

$`2`
[1] 2499 2522

$`3`
[1] 2522 2648

$`4`
[1] 2522 2652 2670

$`5`
[1] 2670 2689

$`6`
[1] 2689 2690

$`7`
[1] 2690 2693 2700

$`8`
[1] 2693 2700

$`9`
[1] 2706 2714

$`10`
[1] 2706 2714 2730 2738

$`11`
[1] 2730 2738 2740

$`12`
[1] 2740 2765

$`13`
[1] 2765 2768

$`14`
[1] 2768 2773

$`15`
[1] 2773 2783 2794

$`16`
[1] 2783 2794

$`17`
[1] 2798 2807

$`18`
[1] 2798 2807 2812

$`19`
[1] 2812 2831 2835 2836

$`20`
[1] 2831 2835

$`21`
[1] 2831 2836

$`22`
[1] 2836 2844 2846

$`23`
[1] 2844 2846

$`24`
[1] 2844 2846

$`25`
[1] 2883 2964

$`26`
[1] 2883 2964
Freestanding answered 2/9, 2023 at 22:9 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.