Selecting such vector elements so that the sum of elements is exactly equal to the specified value
Asked Answered
F

5

26

I have a vector of random positive integers. I would like to select only those elements of the vector whose sum will be exactly equal to a certain predetermined value.

Let's take an example like this. x=1:5, I am looking for elements whose sum is equal to 14. The solution is of course the vector c(2, 3, 4, 5).

Of course, there may be several solutions. Example 2. x=1:5, I'm looking for elements whose sum is equal to 7. Here, of course, should be the following three solutions:
1.c(2, 5),
2.c(3, 4),
3.c(1, 2, 4).

There may also be a situation where there will be no solutions at all. Example 3. x=c(1, 2, 7), I'm looking for elements whose sum equals 5. Of course, there are no correct solutions here.

Everything seems trivially simple if we have vectors of several elements. Here, I even came up with a few alternative solutions. However, the problem becomes when the size of the vector increases.

My vector looks like this:

x= c(236L, 407L, 51L, 308L, 72L, 9787L, 458L, 5486L, 42L, 4290L,
  31L, 3533L, 1102L, 24L, 100L, 669L, 9352L, 4091L, 2751L, 3324L,
  3193L, 245L, 86L, 98932L, 77L, 13L, 9789L, 91L, 999L, 25L, 25379L,
  9626L, 9092L, 622L, 97L, 57L, 2911L, 6L, 405L, 894L, 1760L, 9634L,
  96L, 9765L, 223L, 765L, 743L, 5960L, 14L, 50L, 89L, 348L, 5875L,
  5L, 58602L, 397L, 1181L, 94L, 954L, 7901L, 836L, 8810L, 52L,
  15L, 48L, 26L, 4L, 66L, 5265L, 80L, 282L, 231L, 76L, 661L, 7604L,
  7406L, 58L, 10L, 903L, 49446L, 80921L, 1L, 876L, 334L, 63L, 796L,
  88L, 413L, 1214L, 2983L, 9518L, 595L, 708L, 53L, 321L, 12L, 634L,
  4910L, 8984L, 465L)

I have to find at least one subset of elements whose sum will be exactly 23745. Unfortunately, I had a complete failure here. Whatever I write is calculated in hours and I don't get any correct result anyway. Does anyone have any idea how this can be solved in R? I will be grateful for even a small hint.

Faustino answered 17/10, 2021 at 21:45 Comment(2)
Hi blrun, it would be nice if you edit your question with what you have already tried. And maybe shortening the introduction a bit, it is nice for the reader if the question is short and to the point.Britt
I kindly ask for forgiveness. I tried to describe the problem as clearly and comprehensively as possible. I did not include my code because it did not lead to anything sensible.Faustino
T
31

I have to admit that your problem inspired me and made me wonder. I decided to face it by creating my own optimization function. And even though you got the answer that uses the gbp package (I didn't know it before), let them share my own function. Here it is findSumm.

findSumm = function(x, sfind, nmax=1, tmax=1){
  if(sum(x)<sfind) stop("Impossible solution! sum(x)<sfind!")

  fTimeSec = function() as.numeric(Sys.time()-l$tstart, units="secs")
  #The current selection of vector element
  sel = c(TRUE, rep(FALSE, length(x)-1))
  #List of intermediate states of the vector sel
  lsel = list()
  #List with a collection of parameters and results
  l = list(
    x = sort(x, TRUE),
    tstart = Sys.time(),
    chosen = list(),
    xfind = list(),
    time = c(),
    stop = FALSE,
    reason = "")

  while(TRUE) {
    #Maximum Runtime Test
    if(fTimeSec()>tmax) {
      l$reason = "Calculation time is greater than tmax.\n"
      l$stop = TRUE
      break
    }

    #Record the solution and test the number of solutions
    if(sum(l$x[sel])==sfind){
      #Save solution
      l$chosen[[length(l$chosen)+1]] = sel
      l$xfind[[length(l$xfind)+1]] = l$x[sel]
      l$time = c(l$time, fTimeSec())

      #Test the number of solutions
      if(length(l$chosen)==nmax){
        l$reason = "Already found nmax solutions.\n"
        l$stop = TRUE
        break
      }
    }

    idx = which(sel)
    if(idx[length(idx)]==length(sel)) {
      if(length(lsel)==0) break
      sel=lsel[[length(lsel)]]
      idx = which(sel)
      lsel[length(lsel)]=NULL
      sel[idx[length(idx)]]=FALSE
      sel[idx[length(idx)]+1]=TRUE
      next
    }

    if(sum(l$x[sel])>=sfind){
      sel[idx[length(idx)]]=FALSE
      sel[idx[length(idx)]+1]=TRUE
      next
    } else {
      lsel[[length(lsel)+1]] = sel  #Save the current state of sel vector
      sel[idx[length(idx)]+1]=TRUE
      next
    }
  }
  if(length(l$chosen)==0 & !l$stop) stop("No solutions!")

  l$reason = paste(l$reason, "Found", length(l$chosen),
                   "solutions in time", signif(fTimeSec(), 3), "seconds.\n")
  cat(l$reason)
  return(l)
}

Let's check how it works

findSumm(1:5, 20)$xfind
#Error in findSumm(1:5, 20) : Impossible solution! sum(x)<sfind!

findSumm(c(1,2,7), 5)$xfind
#Error in findSumm(c(1, 2, 7), 5) : No solutions!

findSumm(1:5, 14, 10, 10)$xfind
# Found 1 solutions in time 0.007 seconds.
# [[1]]
# [1] 5 4 3 2

findSumm(1:5, 5, 10, 10)$xfind
# Found 3 solutions in time 0.001 seconds.
# [[1]]
# [1] 5
#
# [[2]]
# [1] 4 1
#
# [[3]]
# [1] 3 2

findSumm(1:5, 7, 10, 10)$xfind
# Found 3 solutions in time 0.004 seconds.
# [[1]]
# [1] 5 2
#
# [[2]]
# [1] 4 3
#
# [[3]]
# [1] 4 2 1

As you can see it was doing great. Now it's time to check that on your vector x.

x= c(236L, 407L, 51L, 308L, 72L, 9787L, 458L, 5486L, 42L, 4290L,
  31L, 3533L, 1102L, 24L, 100L, 669L, 9352L, 4091L, 2751L, 3324L,
  3193L, 245L, 86L, 98932L, 77L, 13L, 9789L, 91L, 999L, 25L, 25379L,
  9626L, 9092L, 622L, 97L, 57L, 2911L, 6L, 405L, 894L, 1760L, 9634L,
  96L, 9765L, 223L, 765L, 743L, 5960L, 14L, 50L, 89L, 348L, 5875L,
  5L, 58602L, 397L, 1181L, 94L, 954L, 7901L, 836L, 8810L, 52L,
  15L, 48L, 26L, 4L, 66L, 5265L, 80L, 282L, 231L, 76L, 661L, 7604L,
  7406L, 58L, 10L, 903L, 49446L, 80921L, 1L, 876L, 334L, 63L, 796L,
  88L, 413L, 1214L, 2983L, 9518L, 595L, 708L, 53L, 321L, 12L, 634L,
  4910L, 8984L, 465L)

findSumm(x, 23745, 1, 10)$xfind[[1]]
# Already found nmax solutions.
# Found 1 solutions in time 0.008 seconds.
# [1] 9789 9787 4091   77    1

A few comments about my function. My function searches for all possible and valid combinations unless it reaches the number of valid results specified by nmax or the computation takes tmax seconds. In the case of your vector x and the sum you are looking for 23745, the number of correct solutions is enormous. I turned it on for 1 min and got the 37827 results! And the function would still find valid results at a rate of 626 solutions per second perhaps for the next 100 years! Below is a visualization of this process.

l = findSumm(x, 23745, +Inf, 60)
library(tidyverse)
library(ggpmisc)
df = tibble(
  n = 1:length(l$chosen),
  t = l$time
)

df %>% ggplot(aes(t,n))+
  geom_line(size=0.1)+
  geom_smooth(method = lm, formula = y~x)+
  stat_poly_eq(formula = y~x,
               aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~")),
               parse = TRUE)

enter image description here

Finally, I decided to check the performance of my function. I did not expect a revelation because it was not written in pure C. However, I must admit that the graph below surprised me very pleasantly!

library(microbenchmark)
ggplot2::autoplot(microbenchmark(findSumm(x, 23745), 
                                 gbp1d_solver_dpp(p = x, w = x, c = 23745L), 
                                 times=100))

enter image description here

As it turned out, my function is almost 4 times faster than gbp1d_solver_dpp. I'm proud!

Since I am very stubborn and curiously inquisitive, I decided to perform one more test on a vector with a length of 1000.

x=c(234L, 1891L, 3187L, 38417L, 2155L, 6857L, 71692L, 463575L, 
    800L, 2195L, 820L, 9735L, 913L, 62685L, 920597L, 864L, 903L, 
    478L, 2828L, 99371L, 3109L, 379L, 8544L, 444L, 772L, 571L, 226L, 
    94L, 378L, 60253L, 10920L, 47626L, 671L, 45163L, 27767L, 62498L, 
    87706L, 4966L, 4615L, 14897L, 261L, 684L, 3780L, 97L, 705L, 7313L, 
    3629L, 436L, 5076L, 3198L, 731L, 56634L, 67411L, 249L, 403L, 
    82728L, 9986L, 643662L, 11045L, 934L, 8154L, 289L, 4452L, 624L, 
    4876L, 86859L, 933L, 2372L, 6493L, 773566L, 6599L, 459L, 2024L, 
    80425L, 591L, 6262L, 35033L, 89607L, 6435L, 14917L, 9559L, 67983L, 
    82365L, 88127L, 466L, 758L, 11605L, 828L, 410L, 557L, 2991L, 
    808L, 8512L, 273605L, 294L, 4666L, 27L, 26337L, 7340L, 682L, 
    46480L, 19903L, 699L, 700L, 58L, 136L, 852L, 909L, 64316L, 9109L, 
    876L, 6382L, 803L, 295L, 9539L, 26271L, 1906L, 23639L, 9022L, 
    9513L, 169L, 65427L, 861864L, 743L, 91L, 9039L, 247L, 58749L, 
    5674L, 65959L, 99126L, 7765L, 5934L, 13881L, 77696L, 66894L, 
    977L, 6279L, 46273L, 919L, 6307L, 316L, 420113L, 61336L, 70L, 
    6148L, 257L, 17804L, 14L, 989L, 16907L, 36L, 25L, 333L, 224L, 
    119L, 4000L, 9438L, 5439L, 748L, 16532L, 4847L, 939L, 9504L, 
    2782L, 424L, 64034L, 5306L, 30247L, 6636L, 3976L, 60588L, 180L, 
    78118L, 1L, 61866L, 9501L, 15834L, 66712L, 77219L, 448L, 612L, 
    5339L, 58413L, 4785L, 2191L, 35711L, 84383L, 6261L, 896L, 24353L, 
    54868L, 288L, 8059L, 867L, 687L, 94667L, 1713L, 1507L, 71048L, 
    882L, 4155L, 97230L, 49492L, 47839L, 793L, 263L, 63160L, 9062L, 
    3518L, 55956L, 6626L, 14619L, 636L, 1127L, 970L, 5512L, 118117L, 
    2370L, 802L, 98333L, 6089L, 1076L, 80L, 305L, 3995L, 437L, 49L, 
    9207L, 2021L, 7554L, 9486L, 33501L, 55745L, 967L, 24857L, 692L, 
    4148L, 464957L, 2381L, 3876L, 3246L, 1478L, 308L, 98068L, 532L, 
    4670L, 7965L, 940L, 467L, 777L, 68749L, 2739L, 23951L, 831L, 
    60763L, 12047L, 75620L, 650L, 69584L, 294122L, 41149L, 9657L, 
    780L, 153054L, 37990L, 16L, 894L, 15500L, 31873L, 3800L, 472L, 
    50989L, 8767L, 8209L, 2929L, 4751L, 38L, 47403L, 64941L, 28042L, 
    49020L, 81785L, 299L, 936L, 63136L, 3L, 42033L, 1750L, 1147L, 
    273L, 62668L, 41L, 5829L, 686L, 511L, 65019L, 842L, 88716L, 96217L, 
    9442L, 6324L, 197L, 55422L, 630L, 665L, 3921L, 726L, 766916L, 
    43944L, 9035L, 573L, 77942L, 29689L, 749L, 95240L, 281L, 1933L, 
    78265L, 812L, 854L, 17445L, 8855L, 2940L, 6057L, 46689L, 999L, 
    381L, 347L, 50199L, 161L, 534L, 804L, 99043L, 13183L, 679L, 432L, 
    38887L, 575L, 781L, 2023L, 187077L, 89498L, 85L, 16780L, 3731L, 
    45904L, 13861L, 3971L, 301L, 4175L, 9427L, 126913L, 845L, 175L, 
    1684L, 9064L, 56647L, 116L, 479672L, 6754L, 441L, 412L, 97091L, 
    4062L, 598L, 146L, 423L, 2715L, 198939L, 80577L, 76385L, 2088L, 
    139L, 647L, 246L, 85002L, 898L, 50939L, 135L, 46388L, 623L, 17928L, 
    63072L, 346L, 78582L, 16691L, 838L, 44L, 5181L, 7918L, 3650L, 
    35L, 8825L, 9758L, 22677L, 9838L, 2239L, 9001L, 96689L, 570L, 
    47373L, 507L, 6378L, 40839L, 11677L, 937874L, 2485L, 22188L, 
    20413L, 13L, 877L, 5578L, 428L, 61L, 3200L, 5444L, 85540L, 640L, 
    94460L, 310L, 6043L, 3771L, 6167L, 476L, 9365L, 1956L, 143L, 
    7841L, 4957L, 3309L, 9317L, 41434L, 97881L, 51853L, 474L, 3098L, 
    7109L, 93976L, 545L, 28475L, 2066L, 4959L, 7410L, 293L, 8246L, 
    43L, 721L, 2260L, 72854L, 100L, 61382L, 107L, 5637L, 891L, 256L, 
    442L, 84440L, 55792L, 195L, 24074L, 19L, 57376L, 59159L, 805253L, 
    193329L, 3636L, 98954L, 968L, 380L, 5203L, 90157L, 71907L, 35497L, 
    41769L, 1683L, 1984L, 5765L, 832L, 411L, 4888L, 9801L, 710L, 
    2325L, 40L, 32927L, 435L, 66L, 66301L, 94776L, 48234L, 28977L, 
    122312L, 48L, 359L, 572L, 753L, 945L, 32241L, 328L, 55976L, 128L, 
    815794L, 57894L, 576L, 60131L, 342448L, 8913L, 33506L, 20448L, 
    58750L, 637L, 82086L, 635710L, 96772L, 272L, 938L, 4863L, 737L, 
    949L, 4804L, 3446L, 92319L, 28883L, 6032L, 53970L, 9394L, 5630L, 
    71583L, 136862L, 23161L, 8545L, 54249L, 213666L, 668L, 893L, 
    881126L, 8252L, 584L, 83L, 13754L, 244156L, 530L, 64574L, 22009L, 
    89204L, 34992L, 85992L, 82697L, 50L, 95845L, 3096L, 42L, 554949L, 
    325L, 2092L, 28L, 3830L, 893583L, 625L, 3740L, 4513L, 9938L, 
    910L, 8868L, 9614L, 41281L, 27915L, 25839L, 4417L, 5730L, 2825L, 
    683L, 550L, 88838L, 9248L, 961L, 2748L, 7259L, 53220L, 2179L, 
    4036L, 46014L, 83725L, 8211L, 6957L, 6886L, 4653L, 6300L, 80437L, 
    135885L, 23745L, 9536L, 78L, 652590L, 1037L, 5293L, 492L, 7467L, 
    71685L, 890L, 5023L, 96524L, 17465L, 53665L, 21508L, 463L, 159L, 
    311L, 764L, 27534L, 71L, 2504L, 270L, 6449L, 13449L, 302L, 88L, 
    3893L, 22007L, 9208L, 680618L, 878L, 14721L, 20L, 322374L, 644L, 
    944669L, 57334L, 233L, 982L, 870L, 950L, 121L, 254L, 4226L, 45L, 
    61823L, 9626L, 58590L, 6552L, 3920L, 68L, 3644L, 35775L, 4591L, 
    636207L, 78314L, 408L, 371L, 984L, 7089L, 4679L, 2233L, 756L, 
    20527L, 178L, 80573L, 589923L, 120L, 7938L, 894842L, 6563L, 569L, 
    91110L, 620L, 786288L, 46022L, 396L, 762533L, 145964L, 7732L, 
    60L, 274L, 87869L, 227L, 6706L, 707L, 955L, 48246L, 771L, 29001L, 
    14224L, 5173L, 20215L, 7566L, 1564L, 733L, 3568L, 3570L, 39256L, 
    925L, 41577L, 348L, 68267L, 151L, 98572L, 1389L, 5421L, 69043L, 
    42434L, 27597L, 53320L, 46051L, 1686L, 59L, 361L, 747579L, 5044L, 
    73873L, 28894L, 8146L, 353L, 2622L, 664L, 349L, 90764L, 8920L, 
    716L, 14903L, 96055L, 89L, 94239L, 416L, 7896L, 232L, 5543L, 
    61664L, 6709L, 2L, 14275L, 2954L, 917416L, 3567L, 42086L, 99956L, 
    86112L, 206L, 64L, 25956L, 57112L, 425L, 6507L, 28034L, 991L, 
    8444L, 140L, 1461L, 68783L, 347633L, 87696L, 593L, 164L, 837L, 
    8793L, 965L, 8811L, 97412L, 351L, 23L, 66808L, 8308L, 14245L, 
    12519L, 3019L, 1920L, 813L, 485L, 979L, 929L, 2970L, 32447L, 
    8962L, 867973L, 40534L, 551L, 20941L, 49413L, 188L, 948L, 9018L, 
    187252L, 3919L, 45963L, 358L, 7211L, 959L, 47L, 4220L, 36086L, 
    1645L, 33056L, 300L, 29682L, 9152L, 431L, 364L, 2211L, 3779L, 
    4633L, 22500L, 33980L, 794L, 84558L, 488L, 732L, 6686L, 15042L, 
    906L, 13553L, 6115L, 153L, 866L, 3624L, 329L, 6875L, 86L, 6298L, 
    57424L, 17582L, 955879L, 40945L, 4858L, 694L, 755L, 499L, 406L, 
    564L, 874L, 1695L, 43961L, 578L, 9063L, 505L, 5856L, 4484L, 76708L, 
    712L, 23348L, 986L, 275L, 996L, 8966L, 220L, 7008L, 849L, 953460L, 
    3062L, 278L, 26L, 8547L, 16895L, 98289L, 815L, 25135L, 956L, 
    370L, 8221L, 72674L, 31711L, 73L, 41667L, 2915L, 797L, 41309L, 
    4257L, 8148L, 5723L, 2124L, 8306L, 53388L, 33520L, 680L, 893759L, 
    40133L, 94791L, 988L, 162L, 79366L, 37625L, 7125L, 50947L, 171L, 
    99558L, 166L, 90717L, 5807L, 606L, 98592L, 59207L, 966L, 61299L, 
    7553L, 9678L, 62322L, 156L, 267L, 8478L, 59554L, 2264L, 28338L, 
    899L, 9719L, 98L, 51403L, 6302L, 265L, 79929L, 101L, 5227L, 972L, 
    145L, 48018L, 90140L, 698L, 8L, 5751L, 26083L, 1295L, 78124L, 
    383L, 2776L, 80204L, 210L, 3422L, 36064L, 46L, 4953L, 20271L, 
    3916L, 767L, 601372L, 56575L, 5237L, 5621L, 6705L, 1191L, 63768L, 
    1016L, 313L, 2285L, 12489L, 2755L, 338L, 7518L, 2630L, 421L, 
    6554L, 306L, 113L, 57197L, 885L, 9445L, 37364L, 86630L, 2460L, 
    715L, 10829L, 9914L, 6635L, 229L, 525L, 839L, 3278L, 969L, 182L, 
    187L, 7022L, 554L, 6489L, 15791L, 4157L, 47048L, 9447L, 152L, 
    1419L, 22618L, 5194L, 609L, 923L, 768L, 6248L, 714L, 1159L, 825893L, 
    53492L, 19731L, 65167L, 96325L, 336L, 4443L, 843L, 62960L, 9788L, 
    35032L, 284L, 4647L, 360L, 11297L, 1515L)

findSumm(x, 9568447L)$xfind[[1]]
# Already found nmax solutions.
# Found 1 solutions in time 0.065 seconds.
# [1] 955879 953460 944669 937874 920597 917416 894842 893759 893583 881126 347633  27597      8      3      1

As you can see my findSumm function works great. It took only 0.065 seconds to extract a subset with a sum equal to 9568447L! Unfortunately, trying to run gbp1d_solver_dpp on such a long vector resulted in the error "size is too large". So I was not able to compare the performance of these two solutions with such a large vector.

Tendentious answered 18/10, 2021 at 20:22 Comment(8)
Brilliant answer Marek! I'm not a computer scientist but I believe there are many potential approaches to this problem (e.g. geeksforgeeks.org/0-1-knapsack-problem-dp-10) and it is an ongoing area of research, see e.g. link.springer.com/article/10.1007/s00366-020-01240-3 / researchgate.net/publication/…Unsavory
Also, if you use a language that is compatible with tail recursion like scheme, F# or haskell, you can process very large lists very quickly e.g. zbray.wordpress.com/2011/11/02/…. Just thought you might be interested.Unsavory
@Marek Fiołka, thank you very much for your really educational solution! I still have a small request. Could you have a look at my other question. So far I have not found a solution and maybe you can help me. I would be extremely obligated.Faustino
Hey @jared_mamrot! I must admit that with your comments you have restored my faith in the idea of Stack Overflow - we all learn from each other. I've been a bit frustrated lately. I have given many answers, but most have remained uninterested. The questioners in no way wanted to let me know if what I had done for them was worth anything in their eyes.Discriminator
As for my function, it is absolutely not recursive! She was like that in the beginning. However, it caused a stack overflow when finding multiple solutions. Well, on Stack Overflow such an effect could be interesting to discuss, but I quickly changed this algorithm from recursive to iterative form. And really not much changed, instead of while(TRUE) it was fFind = function (sel), instead of next it was fFind(sel) and instead of break it was just return(). And there was no need to write any intermediate results, so there was no lsel[[length(lsel)+1]] = sel.Discriminator
Hey @blrun. I'll take a look at your unresolved question. We'll be in touch.Discriminator
"we all learn from each other" - this is primarily why I use SO. I don't mind donating my time to help beginners, but I believe the overall value in solving more challenging problems is worth the effort for my own development (even if users often don't appear to appreciate it). Thanks for donating your time / knowledge here - it is definitely appreciated :)Unsavory
Superb answer with amazing performance! Great work! +1!Setzer
U
16

This task sounds like a 1 dimensional bin packing problem or knapsack problem, in which case there are many resources available online to help guide you.

One potential solution is to use the gbp package, e.g.

#install.packages("gbp")
library(gbp)
#> Loading required package: magrittr
#> Loading required package: data.table
x <- c(236L, 407L, 51L, 308L, 72L, 9787L, 458L, 5486L, 42L, 4290L,
     31L, 3533L, 1102L, 24L, 100L, 669L, 9352L, 4091L, 2751L, 3324L,
     3193L, 245L, 86L, 98932L, 77L, 13L, 9789L, 91L, 999L, 25L, 25379L,
     9626L, 9092L, 622L, 97L, 57L, 2911L, 6L, 405L, 894L, 1760L, 9634L,
     96L, 9765L, 223L, 765L, 743L, 5960L, 14L, 50L, 89L, 348L, 5875L,
     5L, 58602L, 397L, 1181L, 94L, 954L, 7901L, 836L, 8810L, 52L,
     15L, 48L, 26L, 4L, 66L, 5265L, 80L, 282L, 231L, 76L, 661L, 7604L,
     7406L, 58L, 10L, 903L, 49446L, 80921L, 1L, 876L, 334L, 63L, 796L,
     88L, 413L, 1214L, 2983L, 9518L, 595L, 708L, 53L, 321L, 12L, 634L,
     4910L, 8984L, 465L)

test <- gbp1d_solver_dpp(p = x, w = x, c = 23745L)

list_of_selected_items <- x[as.logical(test$k)]
list_of_selected_items
#> [1]  236   51  308  458 5486 4290   31 3533 9352
sum(list_of_selected_items)
#> [1] 23745

Created on 2021-10-18 by the reprex package (v2.0.1)

Unsavory answered 17/10, 2021 at 22:2 Comment(1)
Hey @jared_mamrot. Thanks for pointing out the gbp package. You have a vote from me! I admit that I did not know it before and I would like to read it, especially in variants 2, 3 and 4D. However, as it turned out (see my answer), this function has some limitations and cannot deal with long vectors. And also we can find a much faster algorithm. For example, for one-dimensional vectors.Discriminator
S
12

I think this is a subset sum problem. Below are two options for two different use cases, hope it can help you a bit.

  • Find All Subsets:

If you you want to find all possible subsets, you can the the following base R code, which is written in a recursion manner:

findAllSubsets <- function(v, s, r = c()) {
  if (s == 0) {
    return(list(r))
  } else {
    res <- list()
    v <- v[v<=s]
    for (k in seq_along(v)) {
      if (length(r) == 0 || (tail(r, 1) >= v[k] & sum(v[k:length(v)]) >= s)) {
        res[[k]] <- Recall(v[-k], s - v[k], c(r, v[k]))
      }
    }
    return(unlist(res, recursive = FALSE))
  }
}

and you will see for example

> x <- 1:10

> S <- 23

> findAllSubsets(x, S)
[[1]]
[1] 7 6 4 3 2 1

[[2]]
[1] 7 6 5 3 2

[[3]]
[1] 7 6 5 4 1

[[4]]
[1] 8 5 4 3 2 1

[[5]]
[1] 8 6 4 3 2

[[6]]
[1] 8 6 5 3 1

[[7]]
[1] 8 6 5 4

[[8]]
[1] 8 7 4 3 1

[[9]]
[1] 8 7 5 2 1

[[10]]
[1] 8 7 5 3

[[11]]
[1] 8 7 6 2

[[12]]
[1] 9 5 4 3 2

[[13]]
[1] 9 6 4 3 1

[[14]]
[1] 9 6 5 2 1

[[15]]
[1] 9 6 5 3

[[16]]
[1] 9 7 4 2 1

[[17]]
[1] 9 7 4 3

[[18]]
[1] 9 7 5 2

[[19]]
[1] 9 7 6 1

[[20]]
[1] 9 8 3 2 1

[[21]]
[1] 9 8 4 2

[[22]]
[1] 9 8 5 1

[[23]]
[1] 9 8 6

[[24]]
[1] 10  5  4  3  1

[[25]]
[1] 10  6  4  2  1

[[26]]
[1] 10  6  4  3

[[27]]
[1] 10  6  5  2

[[28]]
[1] 10  7  3  2  1

[[29]]
[1] 10  7  4  2

[[30]]
[1] 10  7  5  1

[[31]]
[1] 10  7  6

[[32]]
[1] 10  8  3  2

[[33]]
[1] 10  8  4  1

[[34]]
[1] 10  8  5

[[35]]
[1] 10  9  3  1

[[36]]
[1] 10  9  4

However, this method does NOT scale well. Thus, it is only recommended if one want to find all possible subsets and have a small size x.

  • Find One Subset:

If you just want one possible subset, you can use the package adagio with its function subsetsum directly, e.g,

subsetS <- function(x, target) {
  v <- x[x <= target]
  with(adagio::subsetsum(v, target), v[inds])
}

such that

> subsetS(x,23745)
[1]  236   51  308  458 5486 4290   31 3533 9352

Benchmarking

set.seed(0)
x <- sample(100000L,200)

target <- 237450
ggplot2::autoplot(microbenchmark(
  findSumm(x, target),
  gbp1d_solver_dpp(p = x, w = x, c = target),
  subsetS(x, target),
  times = 100
))

enter image description here

Setzer answered 19/10, 2021 at 22:26 Comment(0)
C
11

I have to find at least one subset of elements

Reflecting ThomasIsCoding's observation that this is a subset sum problem, specifically the coin changing problem that has a staggering amount of well-documented approaches in many languages, and is widely used to teach recursion. One of these implementations, of Mark, has been adapted to fit to this particular use case. Naturally, an approach using C++ would be much faster, but the fun is to keep it in R.

Note The \(x) and \(e) anonymous function are for R 4.1 and later. You can find your version by running version[c("major", "minor")]. For lower versions, replace \(x) with function(x) and \(e) with function(e).

Edit: big oopsie - forgot to change target parameter. New benchmarks & rewriting of function

cc <- \(x, target, warn = TRUE){
    cc_env <- environment()
    tryCatch(aux(x, target, cc_env), error = \(e){ # R4.1 new anonymous function
        if (.Internal(exists("out", cc_env, "integer", TRUE))){
            return(cc_env[["out"]])
        } 
        if(warn) warning("No subset of x can sum up to target!", call. = F)
        NA
    })
}
aux = \(x, target, env, out = 0L){
    s <- sum(out)
    if(s == target) {
        env[["out"]] <- out[-1L]
        stop(call. = F)
    }
    if(s > target) return() # avoid eternal loop
    for(i in seq_along(x)){
        n = x[i]
        left = x[(i+1L):length(x)]
        aux(left, target, env, c(out, n)) # recursion here
    }
}

The output is a vector of elements of x of which the sum equals y.

> cc(x, y)  
 [1]  236  407   51  308   72 9787  458 5486   42 4290   31 1102   24  100  669  245   86   77   13   91   25   97    6
[24]   14    5   10    1   12

Benchmarks

library(adagio)
library(gbp)
x = c(236L, 407L, 51L, 308L, 72L, 9787L, 458L, 5486L, 42L, 4290L,
     31L, 3533L, 1102L, 24L, 100L, 669L, 9352L, 4091L, 2751L, 3324L,
     3193L, 245L, 86L, 98932L, 77L, 13L, 9789L, 91L, 999L, 25L, 25379L,
     9626L, 9092L, 622L, 97L, 57L, 2911L, 6L, 405L, 894L, 1760L, 9634L,
     96L, 9765L, 223L, 765L, 743L, 5960L, 14L, 50L, 89L, 348L, 5875L,
     5L, 58602L, 397L, 1181L, 94L, 954L, 7901L, 836L, 8810L, 52L,
     15L, 48L, 26L, 4L, 66L, 5265L, 80L, 282L, 231L, 76L, 661L, 7604L,
     7406L, 58L, 10L, 903L, 49446L, 80921L, 1L, 876L, 334L, 63L, 796L,
     88L, 413L, 1214L, 2983L, 9518L, 595L, 708L, 53L, 321L, 12L, 634L,
     4910L, 8984L, 465L)
target <- 23745L
ggplot2::autoplot(microbenchmark::microbenchmark(
  findSumm(x, target),
  gbp::gbp1d_solver_dpp(p = x, w = x, c = target),
  subsetS(x, target),
  cc(x, target, warn = F), # slightly faster than warn = T
  times = 100
))

The customary benchmark plot. For OP's case, on average the recursive approach wins! Benchmark

The primary reason for the dramatic speed-up is the reduction of writing to memory, and avoiding unnecessary function calls. Further, this problem is excellent use case for the use of recursion, ensuring a O(2^n).

> bench::bench_memory(cc(x, target, warn = F))
# A tibble: 1 x 2
  mem_alloc memory              
  <bch:byt> <list>              
1    42.2KB <Rprofmem [140 x 3]>
> bench::bench_memory(subsetS(x, target))
# A tibble: 1 x 2
  mem_alloc memory              
  <bch:byt> <list>              
1     5.4MB <Rprofmem [346 x 3]>

However, the performance of the recursive function depends on the inputs. Case in point, here is a situation where the while approach outperforms recursion.

set.seed(1000)
x <- sample(100000L, 200L)
target <- 237450L

Benchmark 2

In these situations, the current version of cc uses up a lot of memory, leading to a call to gc() which is rather expensive.

Error handling

But what if we feed some input that is not valid? We can compare some of the implementations here.

x <- c(1L, 2L)
target <- 5L # impossible to reach with 1 and 2.

> findSumm(x, target)
Error in findSumm(x, target) : Impossible solution! sum(x)<sfind!

> gbp::gbp1d_solver_dpp(p = x, w = x, c = target)
C++ object <0000023130596180> of class 'gbp1d' <000002313b255230>

> subsetS(x, target)
[1] NA NA
Warning message:
In adagio::subsetsum(v, target) :
  Total sum of 'S' is smaller than 't'; no solution possible.

> cc(x, target)
[1] NA
Warning message:
No subset of x can sum up to target! 

The manner of error handling differs. Should the evaluation return a hard show-stopping error, or should the code execution process not be interrupted? In this case, I think a NA paired with a warning that users can silence is reasonable.

Digging into the adagio package's error NA NA, this seems to come from c(1,2)[NA] which returns a vector of length 2, which is odd since c(1,2,3,4,5)[c(1,2,3, NA)] gives [1] 1 2 3 NA.

How cc works

Try to find out yourself first, and then inspect the following call.

cc_explained <- \(x, target, warn = TRUE){
  cc_env <- environment()
  print(environment())
  print("----")
  tryCatch(aux(x, target, cc_env), error = \(e){ # R4.1 new anonymous function
    if (.Internal(exists("out", cc_env, "integer", TRUE))){
      print("----")
      print(environment())
      return(cc_env[["out"]])
    } 
    if(warn) warning("No subset of x can sum up to target!", call. = F)
    NA
  })
}
aux = \(x, target, env, out = 0L){
  print(environment())
  s <- sum(out)
  if(s == target) {
    lobstr::cst()
    env[["out"]] <- out[-1L]
    stop(call. = F)
  }
  if(s > target) return() # avoid eternal loop
  for(i in seq_along(x)){
    n = x[i]
    left = x[(i+1L):length(x)]
    aux(left, target, env, c(out, n)) # recursion here
  }
}
cc_explained(x,y)

Outline

Environments

Environments power much of R's functionality. For this task, environments are the prefered data structure, as they

  1. Are not copied on modification
  2. Allow lexical scoping from within a function environment In a nutshell, it does not carry the memory overhead like maintaining a list of possible outcomes.

Every time a function is called, a new environment is created that looks like <environment: 0x0000020fe68f06e8>. Because of this, we need to pass on our output value to the environment of the first function in the call stack, gotten by running environment() where the empty space is the NULL argument. This is passed down on each iteration.

Recursion

The call stack lobstr::cst() reveals the recursive nature of cc.

    x
  1. \-global::cc_explained(x, y)
  2.   +-base::tryCatch(...)
  3.   | \-base:::tryCatchList(expr, classes, parentenv, handlers)
  4.   |   \-base:::tryCatchOne(expr, names, parentenv, handlers[[1L]])
  5.   |     \-base:::doTryCatch(return(expr), name, parentenv, handler)
  6.   \-global::aux(x, target, cc_env)
  7.     \-global::aux(left, target, env, c(out, n))
...
 
 33.                                                         \-global::aux(left, target, env, c(out, n))
 34.                                                           \-global::aux(left, target, env, c(out, n))
 35.   

Every time the sum is not equal to the target, aux recalls itself with an updated value for out.

Stopping mechanism

2 different stopping mechanisms are used:

  1. return() returning a NULL if the target is not met.
  2. stop() when a target is met. Just before stop() is called, the cc_env environment is updated with the value for the iterator, out at that point.

the tryCatch() statement in the cc function checks if the recursive call was able to find a solution by checking whether a new variable out exists in cc_env. Finally, the value is retrieved. This action looks like list subsetting, but isn't.

Coucal answered 22/10, 2021 at 16:38 Comment(6)
Almost every answer I gave SO taught me something. Though it was frustrating at times when there was no reaction, newet a short comment from the author of the question. However, I did not think that I would learn so much from this one answer. Thanks to everyone who looked here. Thank you @Donald for a very interesting solution. I have to analyze it in detail! You have a vote from me! (and maybe even a few).Discriminator
One has to have some reputation to be able to post comment / vote. A while ago I read on meta stack exchange that most people arrive here by Google, and most are throwaway accounts. But once in a while there's a nugget! Let me know if any clarifications are neededCoucal
Oh no! You don't want your explanations to deprive me of the pleasure of learning something interesting (through careful analysis of what they have created) from those smarter than me!Discriminator
Thank you very much to everyone who contributed to this topic. And since I already have these 15 reputation points you all have my vote from me!!Faustino
@Marek Fiołka I have revised the function and added an explanation - be careful to not scroll too far down, spoilers down there.Coucal
Great answer with details in many aspects! +1!Setzer
A
4

You can use comboGeneral from the package RcppAlgos (I am the author). E.g.:

library(RcppAlgos)
comboGeneral(x, 5,
             constraintFun = "sum",
             comparisonFun = "==",
             limitConstraints = 23745L,
             upper = 1L)
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]    1   31 4290 9634 9789

Here is a more general approach:

GetPartition <- function(x, target, n_res = 1L) {
    
    len_x <- length(x)
    n_soln <- 0L
    res <- list()
    
    m <- 1L
    i <- 1L

    while (n_soln < n_res || m > len_x) {
        combs <- comboGeneral(
            x, m, constraintFun = "sum",
            comparisonFun = "==", limitConstraints = target,
            upper = min(choose(len_x, m), n_res - n_soln) 
        )

        if (nrow(combs)) {
            n_soln <- n_soln + nrow(combs)
            res[[i]] <- combs
            i <- i + 1L
        }
        
        m <- m + 1L
    }
    
    return(res)
}

The results are in lexicographical order. For example:

> GetPartition(x, target = 23745L, n_res = 13L)
#> [[1]]
#>      [,1] [,2] [,3] [,4]
#> [1,]   42 4290 9626 9787
#> [2,]   76 5875 8810 8984
#> [3,]   86 4910 8984 9765
#> [4,]   97 5486 8810 9352
#> [5,]   97 5960 7901 9787
#> [6,]  100 4091 9765 9789
#> [7,]  231 4091 9634 9789
#> [8,]  236 4910 8810 9789
#> [9,]  465 5486 8810 8984
#> 
#> [[2]]
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]    1   31 4290 9634 9789
#> [2,]    1   63 4290 9626 9765
#> [3,]    1   63 4910 8984 9787
#> [4,]    1   77 4091 9787 9789

And it's really fast:

system.time(print(GetPartition(x, 23745L)))
#> [[1]]
#>      [,1] [,2] [,3] [,4]
#> [1,]   42 4290 9626 9787
#> 
#>  user  system elapsed 
#> 0.015   0.000   0.016

Testing on the very large vector in @Marek's answer (I renamed it x_big for clarity):

system.time(print(GetPartition(x_big, 9568447L)))
#> [[1]]
#>        [,1]   [,2]   [,3]   [,4]   [,5]   [,6]   [,7]   [,8]   [,9]  [,10]  [,11]
#> [1,] 762533 773566 805253 825893 861864 881126 893583 917416 937874 953460 955879
#> 
#>  user  system elapsed 
#> 0.006   0.000   0.006
Agricola answered 26/10, 2021 at 1:9 Comment(1)
Very fast, and very low memory usage - super impressive! Thanks for your answer!Unsavory

© 2022 - 2024 — McMap. All rights reserved.