Fast bitarray in OCaml

Asked 6/2, 2013 at 14:32 Answered 12/2, 2013 at 17:58

Solved c++performance algorithm ocaml bitarray

Yet another synthetic benchmark: Sieve of Eratosthenes

C++

#include <vector>
#include <cmath>

void find_primes(int n, std::vector<int>& out)
{
   std::vector<bool> is_prime(n + 1, true);
   int last = sqrt(n);
   for (int i = 2; i <= last; ++i)
   {
      if (is_prime[i])
      {
         for (int j = i * i; j <= n; j += i)
         {
            is_prime[j] = false;
         }
      }
   }

   for (unsigned i = 2; i < is_prime.size(); ++i)
   {
      if (is_prime[i])
      {
         out.push_back(i);
      }
   }
}

OCaml (using Jane Street's Core and Res libraries)

open Core.Std
module Bits = Res.Bits
module Vect = Res.Array

let find_primes n =
  let is_prime = Bits.make (n + 1) true in
  let last = float n |! sqrt |! Float.iround_exn ~dir:`Zero in
  for i = 2 to last do
    if not (Bits.get is_prime i) then () else begin
      let j = ref (i * i) in
      while !j <= n; do
        Bits.set is_prime !j false;
        j := !j + i;
      done;
    end;
  done;
  let ar = Vect.empty () in
  for i = 2 to n do
    if Bits.get is_prime i then Vect.add_one ar i else ()
  done;
  ar

I was surprised that OCaml version (native) is about 13 times slower than C++. I replaced Res.Bits with Core_extended.Bitarray, but it became ~18 times slower. Why it is so slow? Doesn't OCaml provide fast operations for bit manipulation? Is there any alternative fast implementation of bit arrays?

To be clear: I'm from C++ world and consider OCaml as a possible alternative for writing performance critical code. Actually, I'm a bit scary with such results.

EDIT:

Profiling results

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 50.81      1.26     1.26                             camlRes__pos_1113
  9.72      1.50     0.24                             camlRes__unsafe_get_1117
  6.68      1.66     0.17                             camlRes__unsafe_set_1122
  6.28      1.82     0.16                             camlNopres_impl__set_1054
  6.07      1.97     0.15                             camlNopres_impl__get_1051
  5.47      2.10     0.14 47786824     0.00     0.00  caml_apply3
  3.64      2.19     0.09 22106943     0.00     0.00  caml_apply2
  2.43      2.25     0.06   817003     0.00     0.00  caml_oldify_one
  2.02      2.30     0.05        1    50.00   265.14  camlPrimes__find_primes_64139
  1.21      2.33     0.03                             camlRes__unsafe_get_1041
...

Tweet answered 6/2, 2013 at 14:32 Comment(8)

Have you profiled the code to see where it is spending its time? – Pentecost 6/2, 2013 at 14:44

Yes. I'm not good enough in OCaml yet, but gprof said the program spends most of its time on bit array manipulation. I tried to replace bit array with regular Array and it became only 3.3 times slower than C++. Obviously bit array is a bottle neck there. – Tweet 6/2, 2013 at 14:51

For compiler writers, generating performance-critical code is not easy and worst, most compiler makers (except the guys at Clang) want to do it their way :-( . My opinion: for performance critical code these days stick (in this order) to: C++, (insert here Java and Fortran if you want to die young), Javascript (but read optimization guide), Haskell with Ghc... Decent but not quite there: mostly everybody else neither using LLVM nor having Microsoft's/Google budget. Actually, Microsoft and Google's budget aren't a guaranty either. – Tremolant 6/2, 2013 at 14:53

Sounds like it. I'm pretty sure it can be solved by implementing the bit operation optimization better [or something like that]. But I'm not even at your level at OCaml - I can barely read the code you posted - just wanted to make sure you were looking at the right thing when trying to find the problem, it's not unusual to "guess wrong" when looking for where the code is slow. – Pentecost 6/2, 2013 at 14:54

Not that it matters for performance, but you write weird Ocaml code. You do know that you can omit else in a conditional statement, as long as it is of type unit? So there is no need to ever write else (). I tried using Bigarray but it gives results that are slightly slower than gasche's solution with strings. Oh, and that sqrt is also quite nasty, it introduces uneccessary numerical errors (for large enough n). – Rigmarole 6/2, 2013 at 22:45

Agree. My OCaml code is extremely weird. I'm just in the very beginning. – Tweet 6/2, 2013 at 22:57

OCaml these days is more suited for symbolic processing. You can make things faster by offloading the critical code part to a specialized C module, with ocaml bindings though, but it must be done carefully (there is a small cost from moving values back & forth between languages). – Vanadous 7/2, 2013 at 2:59

If you are looking for C level performance in a functional setting, maybe you could have a look at ats, which sports some ML like languages features, as well as C. That said, the answers below demonstrate that OCaml can provide very decent performances. – Vanadous 7/2, 2013 at 3:4

Did you try using simple datastructure first before jumping on the sophisticated ones?

On my machine, the following code is only 4x slower than you C++ version (note that I made the minimal changes to use an Array as the cache, and a list to accumulate results; you could use the array get/set syntactic sugar):

let find_primes n =
  let is_prime = Array.make (n + 1) true in
  let last = int_of_float (sqrt (float n)) in
  for i = 2 to last do
    if not (Array.get is_prime i) then () else begin
      let j = ref (i * i) in
      while !j <= n; do
        Array.set is_prime !j false;
        j := !j + i;
      done;
    end;
  done;
  let ar = ref [] in
  for i = 2 to n do
    if Array.get is_prime i then ar := i :: !ar else ()
  done;
  ar

(4x slower: it takes 4s to compute the 10_000_000 first primes, vs. 1s for g++ -O1 or -O2 on your code)

Realizing that the efficiency of your bitvector solution probably comes from the economic memory layout, I changed the code to use strings instead of arrays:

let find_primes n =
  let is_prime = String.make (n + 1) '0' in
  let last = int_of_float (sqrt (float n)) in
  for i = 2 to last do
    if not (String.get is_prime i = '0') then () else begin
      let j = ref (i * i) in
      while !j <= n; do
        String.set is_prime !j '1';
        j := !j + i;
      done;
    end;
  done;
  let ar = ref [] in
  for i = 2 to n do
    if String.get is_prime i = '0' then ar := i :: !ar else ()
  done;
  ar

This now takes only 2s, which makes it 2x slower than your C++ solution.

Hyacinthhyacintha answered 6/2, 2013 at 20:23 Comment(8)

Using strings is a great idea, it avoids the boxing penalty for individual integers. – Claribel 6/2, 2013 at 20:45

Yes, I tried regular Array and get the same result (about 4x slower). But, it consumes too much memory. Using String is a great idea. Anyway, it consumes 8 bits per bit, so I won't be able to calculate, say, 10 billions, which is possible with bit-per-bit implementation. – Tweet 6/2, 2013 at 20:53

@JeffreyScofield Could you please explain a bit more about boxing penalty? Are integers boxed in regular Array? – Tweet 6/2, 2013 at 20:55

They're not boxed, but they give up one bit to mark this fact. I.e., there is a bit set aside to indicate that they aren't boxed. – Claribel 6/2, 2013 at 21:19

@JeffreyScofield: I would bet my hat that any potential tagging penalty is lost in the noise of cache misses incurred by these widely spread accesses. I'm quite sure the performance gain of the strings (or C++ bitvectors) solution all come down to the newfound cache efficiency of a more economical data structure. I have no idea how the proposed bitvector structures were implemented, but my bet would be that the performance cost is due to functional calls (because of lack of .cmx preventing inlining or whatever) rather than the actual code. – Hyacinthhyacintha 6/2, 2013 at 22:53

I suspect you have better intuition than I do. In my tests I saw 10x slowdown for straightforward 62-booleans-per-int implementation where inlining seemed possible. (I had all sources in one place.) – Claribel 6/2, 2013 at 22:57

The second part was too much guessing and not enough profiling. Turns out I can get 1.55s runtime with bit fiddling loosely adapted from Batteries BitSet data-structure, but you need to do a bit more than just make sure inlining works (that will get you in the 2x-4x ballpark, though). – Hyacinthhyacintha 6/2, 2013 at 23:19

Here is the code for the bitset version, now 1.5s on my machine. I don't think it's worth going into this when the String version is quick enough. – Hyacinthhyacintha 6/2, 2013 at 23:40

It's not often useful to compare micro-benchmarks like this, but the basic conclusion is probably correct. This is a case where OCaml is at a distinct disadvantage. C++ can access a more or less ideal representation (vector of machine integers). OCaml can make a vector, but can't get at the machine integers directly. So OCaml has to use div and mod where C++ can use shift and mask.

I reproduced this test (using a different bit vector library) and found that appreciable time in OCaml was spent constructing the result, which isn't a bit array. So the test might not be measuring exactly what you think.

Update

I tried some quick tests packing 32 booleans into a 63-bit int. It does seem to make things go faster, but only a little bit. It's not a perfect test, but it suggests gasche is right that the non-power-of-2 effect is minor.

Claribel answered 6/2, 2013 at 18:14 Comment(3)

We have to use div and mod regardless of the language to find a word and offset within it. – Tweet 6/2, 2013 at 20:40

If you use the whole machine word you are going to div and mod by a power of 2. You can use shifts and masks for these, which are faster. OCaml sets aside 1 bit of each machine word as a boxing tag. That is its disadvantage (div and mod by 31 or 63), and I suspect it accounts for most of the speed differential in this micro-benchmark. – Claribel 6/2, 2013 at 20:42

Ok, got it! It seems it is really slower due to tag bits. – Tweet 6/2, 2013 at 20:46

It seems Jeffrey Scofield is right. Such terrible performance degradation is due to div and mod operations.

I prototyped small Bitarray module

module Bitarray = struct
  type t = { len : int; buf : string }

  let create len x =
    let init = (if x = true then '\255' else '\000') in
    let buf = String.make (len / 8 + 1) init in
    { len = len; buf = buf }

  let get t i =
    let ch = int_of_char (t.buf.[i lsr 3]) in
    let mask = 1 lsl (i land 7) in
    (ch land mask) <> 0

  let set t i b =
    let index = i lsr 3 in
    let ch = int_of_char (t.buf.[index]) in
    let mask = 1 lsl (i land 7) in
    let new_ch = if b then (ch lor mask) else (ch land lnot mask) in
    t.buf.[index] <- char_of_int new_ch
end

It uses string as byte array (8 bits per char). Initially I used x / 8 and x mod 8 for bit extraction. It was 10x slower than C++ code. Then I replaced them with x lsr 3 and x land 7. Now, it is only 4x slower than C++.

Tweet answered 6/2, 2013 at 23:44 Comment(0)

Please make sure that you install Core including the .cmx file (.cmxa is not enough!), otherwise cross-module inlining will not work. Your profile suggests that certain calls may not have been inlined, which would explain a dramatic loss of efficiency.

Sadly, the Oasis packaging tool, which a lot of OCaml projects use, currently has a bug that prevents it from installing the .cmx file. The Core package is also affected by this problem, probably irrespective of which package manager (Opam, Godi) you use.

Borrell answered 12/2, 2013 at 17:58 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags