Create combinations of a binary vector
Asked Answered
S

6

7

I would like to create all possible combinations of a binary vector made of a fixed number of 0 and 1. For example: dim(v)=5x1; n1=3; n0=2; In this case I'd like to have something like:

  1,1,1,0,0
  1,1,0,1,0
  1,1,0,0,1
  1,0,1,1,0
  1,0,1,0,1
  1,0,0,1,1
  0,1,1,1,0
  0,1,1,0,1
  0,1,0,1,1
  0,0,1,1,1

I found some help reading this post Create all possible combiations of 0,1, or 2 "1"s of a binary vector of length n but i would like to generate only the combinations I need avoiding any waste of space (I think that the problem will increase explonentially with n)

Shadowgraph answered 6/2, 2015 at 14:33 Comment(2)
A not so efficient approach would be x <- expand.grid(rep(list(0L:1L), 5L)); x[rowSums(x) ==3L,] but I think you want something faster than that.Slobber
The following may help: #17292591Chine
S
8

A slightly faster version of Marat's answer:

f.roland <- function(n, m) {
  ind <- combn(seq_len(n), m)
  ind <- t(ind) + (seq_len(ncol(ind)) - 1) * n
  res <- rep(0, nrow(ind) * n)
  res[ind] <- 1
  matrix(res, ncol = n, nrow = nrow(ind), byrow = TRUE)
}

all.equal(f.2(16, 8), f.roland(16, 8))
#[1] TRUE
library(rbenchmark)
benchmark(f(16,8),f.2(16,8),f.roland(16,8))

#             test replications elapsed relative user.self sys.self user.child sys.child
#2      f.2(16, 8)          100   5.693    1.931     5.670    0.020          0         0
#3 f.roland(16, 8)          100   2.948    1.000     2.929    0.017          0         0
#1        f(16, 8)          100   8.287    2.811     8.214    0.066          0         0
Sully answered 6/2, 2015 at 15:17 Comment(7)
For some reason, I can't reproduce your benchmarking results: my benchmarking says that f.2 and f.roland have about the same (within ~1%) performance. Could you please repeat benchmarking couple of times to make sure that results are consistent?Salvatore
And, for completeness, could you include other functions into benchmarking?Salvatore
@MaratTalipov I've rerun the benchmarks and got the same result. Can't include akrun's function since I don't want to install bioconductor.Sully
that's interesting. Could it be OS-related thing? I use Mac OS (x86_64-apple-darwin13.4.0 (64-bit)) What is your OS?Salvatore
Under Linux, f.roland is ~50% faster than f.2 (17.6 vs 26.3 s). These benchmarks appear to be highly machine-specific!Salvatore
@MaratTalipov I'm on a mac.Sully
On my Linux with R 4.0 f.roland is over 3 times faster than f.2 (16.8 vs 53.1 ms). f.3 is basically identical to f.2 (52.9 ms) and f is 69.3 ms.Treasonable
S
6

You can try this approach:

f <- function(n=5,m=3)
 t(apply(combn(1:n,m=m),2,function(cm) replace(rep(0,n),cm,1)))

f(5,3)
#       [,1] [,2] [,3] [,4] [,5]
#  [1,]    1    1    1    0    0
#  [2,]    1    1    0    1    0
#  [3,]    1    1    0    0    1
#  [4,]    1    0    1    1    0
#  [5,]    1    0    1    0    1
#  [6,]    1    0    0    1    1
#  [7,]    0    1    1    1    0
#  [8,]    0    1    1    0    1
#  [9,]    0    1    0    1    1
# [10,]    0    0    1    1    1

The idea is to generate all combinations of indices for 1, and then to use them to produce the final result.

Another flavor of the same approach:

f.2 <- function(n=5,m=3)
  t(combn(1:n,m,FUN=function(cm) replace(rep(0,n),cm,1)))

The second approach is about twice faster:

library(rbenchmark)
benchmark(f(16,8),f.2(16,8))
#         test replications elapsed relative user.self sys.self user.child sys.child
# 2 f.2(16, 8)          100   5.706    1.000     5.688    0.017          0         0
# 1   f(16, 8)          100  10.802    1.893    10.715    0.082          0         0

Benchmark

f.akrun <- function(n=5,m=3) {

  indx <- combnPrim(1:n,m)

  DT <- setDT(as.data.frame(matrix(0, ncol(indx),n)))
  for(i in seq_len(nrow(DT))){
    set(DT, i=i, j=indx[,i],value=1) 
  }
  DT  
}

benchmark(f(16,8),f.2(16,8),f.akrun(16,8))
#            test replications elapsed relative user.self sys.self user.child sys.child
# 2     f.2(16, 8)          100   5.464    1.097     5.435    0.028          0         0
# 3 f.akrun(16, 8)          100   4.979    1.000     4.938    0.037          0         0
# 1       f(16, 8)          100  10.854    2.180    10.689    0.129          0         0

@akrun's solution (f.akrun) is ~10% faster than f.2.

[EDIT] Another approach, which is even more faster and simple:

f.3 <- function(n=5,m=3) t(combn(n,m,tabulate,nbins=n))
Salvatore answered 6/2, 2015 at 14:48 Comment(2)
I really appreciate your help!Shadowgraph
f.3 is the best, not highlighted enough imo ;-)Subtilize
H
2

Here is another approach:

func <- function(n, m) t(combn(n, m, function(a) {z=integer(n);z[a]=1;z}))

func(n = 5, m = 2)

     # [,1] [,2] [,3] [,4] [,5]
 # [1,]    1    1    0    0    0
 # [2,]    1    0    1    0    0
 # [3,]    1    0    0    1    0
 # [4,]    1    0    0    0    1
 # [5,]    0    1    1    0    0
 # [6,]    0    1    0    1    0
 # [7,]    0    1    0    0    1
 # [8,]    0    0    1    1    0
 # [9,]    0    0    1    0    1
# [10,]    0    0    0    1    1
Herald answered 12/4, 2017 at 14:31 Comment(0)
S
2

An approach using RcppAlgos::permuteGeneral().

RcppAlgos::permuteGeneral(1:0, freq=3:2)
#       [,1] [,2] [,3] [,4] [,5]
#  [1,]    1    1    1    0    0
#  [2,]    1    1    0    1    0
#  [3,]    1    1    0    0    1
#  [4,]    1    0    1    1    0
#  [5,]    1    0    1    0    1
#  [6,]    1    0    0    1    1
#  [7,]    0    1    1    1    0
#  [8,]    0    1    1    0    1
#  [9,]    0    1    0    1    1
# [10,]    0    0    1    1    1
Sojourn answered 23/9, 2021 at 13:18 Comment(2)
RcppAlgos author here. Since these are permutations of the multiset c(0, 0, 1, 1, 1), you can make use of the freqs parameter. So your call would look like: permuteGeneral(0:1, freqs = c(2, 3)). This avoids generating duplicates and is much more efficient.Stour
@JosephWood Thanks a lot for pointing this out, indeed much more efficient, edited!Sojourn
A
1

You could try combnPrim from gRbase along with set from data.table (which could be faster)

source("http://bioconductor.org/biocLite.R")
biocLite("gRbase") 
library(gRbase)
library(data.table)
n <-5
indx <- combnPrim(1:n,3)

DT <- setDT(as.data.frame(matrix(0, ncol(indx),n)))
 for(i in seq_len(nrow(DT))){
  set(DT, i=i, j=indx[,i],value=1) 
 }
DT
 #   V1 V2 V3 V4 V5
 #1:  1  1  1  0  0
 #2:  1  1  0  1  0
 #3:  1  0  1  1  0
 #4:  0  1  1  1  0
 #5:  1  1  0  0  1
 #6:  1  0  1  0  1
 #7:  0  1  1  0  1
 #8:  1  0  0  1  1
 #9:  0  1  0  1  1
#10:  0  0  1  1  1
Abshier answered 6/2, 2015 at 15:4 Comment(0)
T
0

A slight performance improvement over f.roland (for n/m approx. equal 2, for m << n f.roland wins) using binary tree expansion, at a cost of higher memory usage:

f.krassowski = function(n, m) {
    m_minus_n = m - n
    paths = list(
        c(0, rep(NA, n-1)),
        c(1, rep(NA, n-1))
    )
    sums = c(0, 1)
    for (level in 2:n) {
        upper_threshold = level + m_minus_n

        is_worth_adding_0 = (sums <= m) & (upper_threshold <= sums)
        is_worth_adding_1 = (sums <= m - 1) & (upper_threshold - 1 <= sums)

        x = paths[is_worth_adding_0]
        y = paths[is_worth_adding_1]

        for (i in 1:length(x)) {
            x[[i]][[level]] = 0
        }
        for (i in 1:length(y)) {
            y[[i]][[level]] = 1
        }
        paths = c(x, y)
        sums = c(sums[is_worth_adding_0], sums[is_worth_adding_1] + 1)
    }
    matrix(unlist(paths), byrow=TRUE, nrow=length(paths))
}

The order of elements is different.

Benchmarking for n/m = 2:

               expr       min        lq     mean    median        uq      max
           f(16, 8) 47.488731 48.182502 52.04539 48.689082 57.558552 65.26211
         f.2(16, 8) 38.291302 39.533287 43.61786 40.513500 48.673713 54.21076
         f.3(16, 8) 38.289619 39.007766 40.21002 39.273940 39.970907 49.02320
       f.989(16, 8) 35.000941 35.199950 38.09043 35.607685 40.725833 49.61785
    f.roland(16, 8) 14.295560 14.399079 15.02285 14.559891 14.625825 23.54574
f.krassowski(16, 8)  9.343784  9.552871 10.20118  9.614251  9.863443 19.70659

enter image description here

Of note, f.3 has the smallest memory footprint:

expression mem_alloc
f(16, 8) 5.7MB
f.2(16, 8) 3.14MB
f.3(16, 8) 1.57MB
f.989(16, 8) 3.14MB
f.roland(16, 8) 5.25MB
f.krassowski(16, 8) 6.37MB

For n/m = 10:

               expr       min        lq      mean    median        uq      max
           f(30, 3) 14.590784 14.819879 15.061327 14.970385 15.238594 15.74435
         f.2(30, 3) 11.886532 12.164719 14.197877 12.267662 12.450575 32.47237
         f.3(30, 3) 11.458760 11.597360 12.741168 11.706475 11.892549 30.36309
       f.989(30, 3) 10.646286 10.861159 12.922651 10.971200 11.106610 30.86498
    f.roland(30, 3)  3.513980  3.589361  4.559673  3.629923  3.727350 21.58201
f.krassowski(30, 3)  8.861349  8.927388 10.430068  9.022631  9.405705 32.70073
Treasonable answered 23/12, 2020 at 19:46 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.