Efficient code to do set operations with bigz-class values?
Asked Answered
G

2

6

The current release of the package gmp does not support set operations such as intersect, setdiff , etc. I'm doing some work with number sequences (see OEIS for examples) and need to handle large collections of large integers. I'm currently stuck with using various loops to generate the desired differences or intersections; while I could probably generate compiled (Rccp, etc) code, I'm hoping to find a way within existing R functions and packages.

Gazebo answered 1/6, 2022 at 18:36 Comment(13)
Could you add more detail about the objects you're working with? For example, how long are the collections, and how big are the numbers? gmp doesn't even have a good sort() function, so I think it's going to be tricky.Brycebryn
a pipeline like Rmpfr -> sets -> github EnriquePH/OEIS.R?Countable
@Brycebryn the problem is that bigz - class objects (as well as bigq ) do not have a method available for the set operation functions. So I can't do , e.g., intersect even on as.bigz(1:4) and as.bigz(3:6) . Number sequences often grow well past max(int) so I have to use extended math.Gazebo
It was the 'factorial' example in the 'Arbitrarily Accurate..' vignette that suggested 'happy with integers'. LMGTFY, which I learned from you, didn't let us down.Countable
@Countable I spoke too soon. mpfr doesn't support base set functions and the library set doesn't handle mpfr objects. So I'm' still stuck.Gazebo
Not even, speaking of sets, as csets?, which I, perhaps wrongly took as able to consume R objects (again assuming a mpfr number) and class them cset, well shoot. Am I thinking in the wrong direction When in doubt = Strings?, and perhaps mpfr wants an 'as.character'...Countable
@Countable as.cset(bigz_thing) looks promising. I'll report back as I get farther.Gazebo
cset will accept bigz values but can't perform operations on them. I'll be contacting the maintainer about that. I tried intersect(as.character(bigz_stuff),as.character(other_bigzstuff)) but that turns out to be slower than just running a for-loop on the bigz vectors.Gazebo
did you consider going through characters for set operations as.bigz(intersect(as.character(as.bigz("10000000000000000000000000000000000000000")+1:4),as.character(as.bigz("10000000000000000000000000000000000000000")+3:6)))Pregnancy
@Pregnancy I did use that approach, which does work correctly. The drawback is that it's horribly slow. If I do a while- or for- loop to compare against elements of a set one-by-one, it's faster than converting into and out of characters.Gazebo
Are there representative OEIS sequences that would reflect your workflow/needs?Countable
@Countable well, most of them :-) if you go out enough terms. I've run a couple such as Levine (1997) for which the 15th term is 508009471379488821444261986503540 oeis.org/A011784Gazebo
Okay, I was snooping at A000045 and A078140, since I know the term Fibonacci. github EnriquePH/OEIS.R --no-build-vignettes generally works for getting bfiles. max(nchar(A000045$data$A000045)) [1] 418Countable
L
4

Use bignum instead of gmp what returns a character after using intersect. And reconverting needs time.

library(bignum)

t1 <- biginteger(1:4)
t2 <- biginteger(3:6)
intersect(t1, t2)
#[1] "3" "4"

biginteger(intersect(t1, t2))
#<biginteger[2]>
#[1] 3 4

Store the bigz in a list instead as a vector.

library(gmp)

intersect(as.bigz(1:4), as.bigz(3:6))
#Big Integer ('bigz') object of length 2:
#[1] 1 2

intersect(as.list(as.bigz(1:4)), as.list(as.bigz(3:6)))
#[[1]]
#Big Integer ('bigz') :
#[1] 3
#
#[[2]]
#Big Integer ('bigz') :
#[1] 4

setdiff(as.list(as.bigz(1:4)), as.list(as.bigz(3:6)))
#[[1]]
#Big Integer ('bigz') :
#[1] 1
#
#[[2]]
#Big Integer ('bigz') :
#[1] 2

f3 <- as.list(as.bigz(c(3,5,9,6,4)))
f4 <- as.list(as.bigz(c(6,7,8,10,9)))
intersect(f3, f4)
#[[1]]
#Big Integer ('bigz') :
#[1] 9
#
#[[2]]
#Big Integer ('bigz') :
#[1] 6

Unfortunately this is much slower than converting it to character.

For intersect duplicated could be used. But this is also slower than converting to character.

. <- c(unique(as.bigz(1:4)), unique(as.bigz(3:6)))
.[duplicated(.)]
#Big Integer ('bigz') object of length 2:
#[1] 3 4

And comparing each element with each other is also slower:

t1 <- unique(as.bigz(1:4))
t2 <- unique(as.bigz(3:6))
t1[sapply(t1, function(x) any(x == t2))]
#Big Integer ('bigz') object of length 2:
#[1] 3 4

Marginal faster in this case is:

t1 <- kit::funique(as.character(as.bigz(1:4)))
t2 <- as.character(as.bigz(3:6));
as.bigz(t1[fastmatch::fmatch(t1, t2, 0L) > 0L])
#Big Integer ('bigz') object of length 2:
#[1] 3 4

And a simple Rcpp versions.

library(Rcpp)
sourceCpp(code=r"(
#include <Rcpp.h>
#include <list>
#include <cstring>
using namespace Rcpp;
// [[Rcpp::export]]
RawVector fun(const RawVector &x, const RawVector &y) {
  std::list<uint32_t const*> xAdr;
  std::list<uint32_t const*> yAdr;
  const uint32_t *px = (const uint32_t *) &x[0];
  const uint32_t *py = (const uint32_t *) &y[0];
  uint32_t nextNr = 1;
  for(uint_fast32_t j=0; j<px[0]; ++j) {
    uint32_t n = px[nextNr] + 2;
    auto i = xAdr.cbegin();
    for(; i != xAdr.cend(); ++i) {
      if(std::memcmp(&px[nextNr], *i, n * sizeof(uint32_t)) == 0) break;
    }
    if(i == xAdr.cend()) xAdr.push_back(&px[nextNr]);
    nextNr += n;
  }
  nextNr = 1;
  for(uint_fast32_t j=0; j<py[0]; ++j) {
    yAdr.push_back(&py[nextNr]);
    nextNr += py[nextNr] + 2;;
  }
  uint32_t l=1;
  for(auto j = xAdr.cbegin(); j != xAdr.cend(); ) {
    auto i = yAdr.cbegin();
    for(; i != yAdr.cend(); ++i) {
      if(std::memcmp(*j, *i, (**j + 2) * sizeof(uint32_t)) == 0) {
        yAdr.erase(i);
        break;
      }
    }
    if(i == yAdr.cend()) j = xAdr.erase(j);
    else {l += **j + 2; ++j;}
  }
  RawVector res(Rcpp::no_init(l * sizeof(uint32_t)));
  res.attr("class") = "bigz";
  uint32_t *p = (uint32_t *) &res[0];
  p[0] = xAdr.size();
  nextNr = 1;
  for(auto j = xAdr.cbegin(); j != xAdr.cend(); ++j) {
    std::memcpy(&p[nextNr], *j, (**j + 2) * sizeof(uint32_t));
    nextNr += p[nextNr] + 2;
  }
  return res;
}
)")

fun(as.bigz(1:4), as.bigz(3:6))
#Big Integer ('bigz') object of length 2:
#[1] 3 4

Benchmark

library(gmp)
x <- as.bigz("10000000000000000000000000000000000000000")+1:4
y <- as.bigz("10000000000000000000000000000000000000000")+3:6

library(bignum)
xb <- biginteger("10000000000000000000000000000000000000000")+1:4
yb <- biginteger("10000000000000000000000000000000000000000")+3:6

bench::mark(check = FALSE,
         "list" = do.call(c, intersect(as.list(x), as.list(y))),
         "char" = as.bigz(intersect(as.character(x), as.character(y))),
         "dupli" = {. <- c(unique(x), unique(y)); .[duplicated(.)]},
         "loop" = {t1 <- unique(x); t2 <- unique(y); t1[sapply(t1, function(x) any(x == t2))]},
         "own" = {t1 <- as.character(x); t2 <- as.character(y);
           x[!duplicated(t1) & match(t1, t2, 0L) > 0L]},
         "own2" = {t1 <- unique(as.character(x)); t2 <- as.character(y);
           as.bigz(t1[match(t1, t2, 0L) > 0L])},
         "kitFmat" = {t1 <- kit::funique(as.character(x)); t2 <- as.character(y);
           as.bigz(t1[fastmatch::fmatch(t1, t2, 0L) > 0L])},
         "collFmat" = {t1 <- collapse::funique(as.character(x)); t2 <- as.character(y);
           as.bigz(t1[fastmatch::fmatch(t1, t2, 0L) > 0L])},
         "bignum" = biginteger(intersect(xb, yb)),
         "rcppIdx" = fun(x, y),
         )

Result

   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
 1 list       135.89µs 247.15µs     3147.        0B     2.03  1552     1
 2 char         15.4µs  20.79µs    29933.        0B    12.0   9996     4
 3 dupli       54.58µs  86.69µs     7428.      280B     6.18  3604     3
 4 loop        80.11µs 156.36µs     4198.    6.36KB     8.26  2032     4
 5 own         15.46µs  27.09µs    25254.        0B     5.05  9998     2
 6 own2        13.69µs  20.53µs    28952.        0B     8.69  9997     3
 7 kitFmat     13.23µs  20.57µs    30964.        0B     6.19  9998     2
 8 collFmat    16.76µs  21.76µs    26667.    2.49KB     5.33  9998     2
 9 bignum      32.12µs  48.51µs    10784.        0B     8.47  5090     4
10 rcppIdx      2.35µs   3.09µs   212933.    2.49KB     0    10000     0

Here it looks like that converting to character and back is currently the fastest way. Even subsetting bigz is slower than subsetting the character and converting it to bigz.
The Version in Rcpp is about 6-7 times faster than the fastest other method.

Lesalesak answered 11/6, 2022 at 20:47 Comment(7)
Confirmed it is working , good job @Lesalesak +1Comfrey
Also thought it would be a solution, but unfortunately slower than as.character which according to OP is already too slow.microbenchmark::microbenchmark(intersect(as.list(as.bigz(1:4)), as.list(as.bigz(3:6))),intersect(as.character(as.bigz(1:4)), as.character(as.bigz(3:6))))Pregnancy
impressive benchmarks with such many options, +1!Aisne
This is nicely presented! I will experiment with bignum to see if it can do everything I need in my algorithms (i.e. more than just set-type operations). My benchmarks basically agree with your work, in that a simple for - loop to test each element one-by-one remains the fastest I've found.Gazebo
Hmmm I see that intersect(bignum_a, bignum_b) returns the character equivalents. That's unlikely to be a fast operation.Gazebo
Use of bignum works but as with all other approaches so far is far slower than a for loop over bigz value comparisons. I'll accept this answer & give you the bonus because of the excellent details & attempts.Gazebo
I add a simple Rcpp version, for comparing the timings with the other options.Lesalesak
C
1

I've likely got this wrong, but using mprf objects provides access to base R intersect, union, setdiff, while a sort(... needs to be wrapped inside a mprf(sort(...), 'bits'):

library(Rmprf)

f3 <- mpfr(5:9, 53)
f4 <- mpfr(8:12, 53)

intersect(f3,f4)
2 'mpfr' numbers of precision  53   bits 
[1] 8 9

setdiff(f3,f4)
3 'mpfr' numbers of precision  53   bits 
[1] 5 6 7

f3 %in% f4
[1] FALSE FALSE FALSE  TRUE  TRUE

# large integers from vignette
ns <- mpfr(1:24, 120)
fact_ns <- factorial(ns)
fact_ns[20:24]
5 'mpfr' numbers of precision  120   bits 
[1]      2432902008176640000     51090942171709440000   1124000727777607680000
[4]  25852016738884976640000 620448401733239439360000

pasc80 <- chooseMpfr.all(n = 80, 77)[40:49]
pasc80
10 'mpfr' numbers of precision  77   bits 
 [1] 107507208733336176461620 104885081691059684352800  97393290141698278327600
 [4]  86068488962431036661600  72375774809317008101800  57900619847453606481440
 [7]  44054819449149483192400  31869443856831541032800  21910242651571684460050
[10]  14308729894903957198400

mpfr(sort(union(fact_ns[20:24], pasc80)), 77)
15 'mpfr' numbers of precision  77   bits 
 [1]      2432902008176640000     51090942171709440000   1124000727777607680000
 [4]  14308729894903957198400  21910242651571684460050  25852016738884976640000
 [7]  31869443856831541032800  44054819449149483192400  57900619847453606481440
[10]  72375774809317008101800  86068488962431036661600  97393290141698278327600
[13] 104885081691059684352800 107507208733336176461620 6204484017332394393600

so for these operations sets is not necessary, and assuming your workflow is amenable to Rmprf based objects.

As the problem is presented in the context of 'precision', one likely wouldn't want a function that promotes or demotes sets to their highest/lowest 'prec', but be intentionally involved in the decision (though, admittedly, I looked for one).

Here, renaming your f3 & f4 below to f7 & f8:

getPrec(f7)[1]
[1] 10
getPrec(f8)[1]
[1] 20
intersect(roundMpfr(f7, 20), f8)
2 'mpfr' numbers of precision  20   bits 
[1] 9 6
intersect(f7, roundMpfr(f8, 10))
2 'mpfr' numbers of precision  10   bits 
[1] 9 6

So it appears that 'precision handling' is required as to set operations, though such additional cycles may be avoided if it is plausible that upon mpfr creation, defaults would render the inputs the same precision. Using OEIS as inputs:

library(OEIS.R) # git clone of EnriquePH/OEIS.R --no-build-vignettes
A011784 <- OEIS_bfile('A011784')
max(nchar(A011784$data$A011784))
[1] 221
max(nchar(A078140$data$A078140))
[1] 228

# so we see precision handling here, perhaps
A011784_228 <- mpfr(A011784$data$A011784, 228)
A078140_228 <- mpfr(A078140$data$A078140, 228)

intersect(A011784_228,A078140_228)
2 'mpfr' numbers of precision  228   bits 
[1] 1 3

Ah, so little in common. And it is probably not that your sequences are in OEIS, rather checking for similarity to those from your sequences 'from the wild', and this doesn't reflect your workflow.

As to using lists:

is(A011784_bigz)
[1] "bigz"     "oldClass" "Mnumber"  "mNumber" 
> is(A011784_228)
[1] "mpfr"    "list"    "Mnumber" "mNumber" "vector"

So those as.list cycles have already been expended in mpfr creation.

And some related, light reading primitive sets from recent news.

Countable answered 4/6, 2022 at 22:41 Comment(1)
Unfortunately, you've been misled the same way I was. Try f3 <- mpfr(c(3,5,9,6,4),10 ) and f4 <- mpfr(c(6,7,8,10,9),20) and the results of intersect will be empty because of the difference in precision. So possibly if prior to each set-operation I were to reset `Precbits" for both objects to the same value, it might work correctly, but that leads to a lot of processing overhead.Gazebo

© 2022 - 2024 — McMap. All rights reserved.