Interpreting a benchmark in C, Clojure, Python, Ruby, Scala and others [closed]
Asked Answered
H

13

90

Disclaimer

I know that artificial benchmarks are evil. They can show results only for very specific narrow situation. I don't assume that one language is better than the other because of the some stupid bench. However I wonder why results is so different. Please see my questions at the bottom.

Math benchmark description

Benchmark is simple math calculations to find pairs of prime numbers which differs by 6 (so called sexy primes) E.g. sexy primes below 100 would be: (5 11) (7 13) (11 17) (13 19) (17 23) (23 29) (31 37) (37 43) (41 47) (47 53) (53 59) (61 67) (67 73) (73 79) (83 89) (97 103)

Results table

In table: calculation time in seconds Running: all except Factor was running in VirtualBox (Debian unstable amd64 guest, Windows 7 x64 host) CPU: AMD A4-3305M

  Sexy primes up to:        10k      20k      30k      100k               

  Bash                    58.00   200.00     [*1]      [*1]

  C                        0.20     0.65     1.42     15.00

  Clojure1.4               4.12     8.32    16.00    137.93

  Clojure1.4 (optimized)   0.95     1.82     2.30     16.00

  Factor                    n/a      n/a    15.00    180.00

  Python2.7                1.49     5.20    11.00       119     

  Ruby1.8                  5.10    18.32    40.48    377.00

  Ruby1.9.3                1.36     5.73    10.48    106.00

  Scala2.9.2               0.93     1.41     2.73     20.84

  Scala2.9.2 (optimized)   0.32     0.79     1.46     12.01

[*1] - I'm afraid to imagine how much time will it take

Code listings

C:

int isprime(int x) {
  int i;
  for (i = 2; i < x; ++i)
    if (x%i == 0) return 0;
  return 1;
}

void findprimes(int m) {
  int i;
  for ( i = 11; i < m; ++i)
    if (isprime(i) && isprime(i-6))
      printf("%d %d\n", i-6, i);
}

main() {
    findprimes(10*1000);
}

Ruby:

def is_prime?(n)
  (2...n).all?{|m| n%m != 0 }
end

def sexy_primes(x)
  (9..x).map do |i|
    [i-6, i]
  end.select do |j|
    j.all?{|j| is_prime? j}
  end
end

a = Time.now
p sexy_primes(10*1000)
b = Time.now
puts "#{(b-a)*1000} mils"

Scala:

def isPrime(n: Int) =
  (2 until n) forall { n % _ != 0 }

def sexyPrimes(n: Int) = 
  (11 to n) map { i => List(i-6, i) } filter { _ forall(isPrime(_)) }

val a = System.currentTimeMillis()
println(sexyPrimes(100*1000))
val b = System.currentTimeMillis()
println((b-a).toString + " mils")

Scala opimized isPrime (the same idea like in Clojure optimization):

import scala.annotation.tailrec

@tailrec // Not required, but will warn if optimization doesn't work
def isPrime(n: Int, i: Int = 2): Boolean = 
  if (i == n) true 
  else if (n % i != 0) isPrime(n, i + 1)
  else false

Clojure:

(defn is-prime? [n]
  (every? #(> (mod n %) 0)
    (range 2 n)))

(defn sexy-primes [m]
  (for [x (range 11 (inc m))
        :let [z (list (- x 6) x)]
        :when (every? #(is-prime? %) z)]
      z))

(let [a (System/currentTimeMillis)]
  (println (sexy-primes (* 10 1000)))
  (let [b (System/currentTimeMillis)]
    (println (- b a) "mils")))

Clojure optimized is-prime?:

(defn ^:static is-prime? [^long n]
  (loop [i (long 2)] 
    (if (= (rem n i) 0)
      false
      (if (>= (inc i) n) true (recur (inc i))))))

Python

import time as time_

def is_prime(n):
  return all((n%j > 0) for j in xrange(2, n))

def primes_below(x):
  return [[j-6, j] for j in xrange(9, x+1) if is_prime(j) and is_prime(j-6)]

a = int(round(time_.time() * 1000))
print(primes_below(10*1000))
b = int(round(time_.time() * 1000))
print(str((b-a)) + " mils")

Factor

MEMO:: prime? ( n -- ? )
n 1 - 2 [a,b] [ n swap mod 0 > ] all? ;

MEMO: sexyprimes ( n n -- r r )
[a,b] [ prime? ] filter [ 6 + ] map [ prime? ] filter dup [ 6 - ] map ;

5 10 1000 * sexyprimes . .

Bash(zsh):

#!/usr/bin/zsh
function prime {
  for (( i = 2; i < $1; i++ )); do
    if [[ $[$1%i] == 0 ]]; then
      echo 1
      exit
    fi
  done
  echo 0
}

function sexy-primes {
  for (( i = 9; i <= $1; i++ )); do
    j=$[i-6]
    if [[ $(prime $i) == 0 && $(prime $j) == 0 ]]; then
      echo $j $i
    fi
  done
}

sexy-primes 10000

Questions

  1. Why Scala is so fast? Is it because of static typing? Or it is just using JVM very efficiently?
  2. Why such a huge difference between Ruby and Python? I thought these two are not somewhat totally different. Maybe my code is wrong. Please enlighten me! Thanks. UPD Yes, that was error in my code. Python and Ruby 1.9 are pretty equal.
  3. Really impressive jump in productivity between Ruby versions.
  4. Can I optimize Clojure code by adding type declarations? Will it help?
Hellenistic answered 25/7, 2012 at 0:20 Comment(28)
1) Your question is not a good fit for SO, imho. 2) Why you don't provided Clojure and Scala versions?Herwick
@pst I mean language versions, like 2.8, 2.9, ...Herwick
Could you add your compiler options for each language?Exclosure
Not that it has anything to do with your performance checks, but from an algorithm standpoint, you really only need to check up to something like n/2 to know if it's prime or not (although it probably won't make a huge difference).Speaker
If you take a look at shootout.alioth.debian.org you will find that your results are all in the typical relative rangesSchweiker
@Speaker actually up to sqrt(n) but that can take some time to compute. Also your C code prints out the primes as it finds them, whereas your other languages compute them into lists and then prints them out. While C is unsurprisingly the fastest, you might be able to get it faster.Exterritorial
@Exterritorial -- really, sqrt(n)? I guess that makes sense ... Awesome. I wish I could favorite that comment so I would always remember it (or at least be able to find it). I was just thinking naively which is why I said n/2 (which is just a bit-shift, so that should be fast to compute).Speaker
(And of course the Sieve of Eratosthenes .. but this micro benchmark is pretty much a stress test of iteration and math operations. However, they are still not "fair" as in some are more lazy.)Handiwork
Could I send you a Go source for you to run on your VirtualBox? I am getting a very fast result on my laptop but it's obviously a very different configuration and I would like to know how it compares to the other languages under your configuration.Implacable
I just ran both my Go version and your C version (that look a lot alike) and I practically got the same speed in both of them. I only tried the 100k version: C: 2.723s Go: 2.743s.Implacable
With 1000*1000 (1M instead of 100K): C: 3m35.458s Go: 3m36.259sImplacable
Added a PHP version. 100K: PHP: 1m3.766s Really slow compared to C and Go.Implacable
@Exterritorial -- sqrt() is usually a well-optimized library function. I think the cost of a single sqrt() on a single float value will be much cheaper than doing repeated modulus operations that are not needed, and the bigger the numbers being checked get, the bigger the savings. I just tried the code from my answer, and changed the check for sqrt() to just n - 1 and execution time more than doubled (to 19 seconds).Coumarone
@Speaker -- I tried your optimization (check to n/2), and it cut time to half in all cases. Obviously, proportionally, the performance ratio between languages was the same than before, but I think you proved a point: performance-wise, writing smart code is more important than choosing a fast language.Implacable
@mgilson: wait till you hear about deterministic variants of Miller-Rabin primality test. They are relatively simple to implement e.g., Francky's Python code in the discussion of problem 387 on projecteuler.net.Cupp
For Factor, please take a look at math.primes vocabulary. It already has a very well optimized prime? word.Encompass
You don't need to compute sqrt for this check. You can compute the square of i as in for (i = 2; i * i <= x; ++i) ...Magnetron
could move to codegolf.stack... and try to see which language is faster, or who can implement the fastest for each language, etc. I think that it would be interesting to see better sugestions thereMorton
Performance comparisons are fun, but this one in particular will rely on arbitrary implementation details that are only specific to this problem.Hardee
I suggest you annotate Scala's optimized isPrime with @tailrec, to ensure you are using tail recursion. It is easy to mistakenly do something that prevents tail recursion, and this annotation should warn you if that happens.Alexandria
@DanielC.Sobral Good point! To do that I should import scala.annotation.tailrecHellenistic
@pst: Python code that uses unoptimized Sieve of Eratosthenes runs in 0.03 seconds (30 milliseconds) for 100K i.e., 500 times faster than the C version above.Cupp
@J.F.Sebastian Yeah, that's great. But you are using better algorithm. The point was to understand how good can we optimize the simple (stupid) algorithm in different languages.Hellenistic
@ArtemIce: I understand that. That's why the comment is for @ pst who mentioned this algorithm.Cupp
btw, typed Python version with explicit for-loop is 15-20 times faster (accepted answer for Clojure also uses types). This version uses the exact same algorithm.Cupp
I've posted results of running the benchmark on RPython, Pypy, Cython, Jython, CPython 2.x, CPython 3.Cupp
In revision 11 that pastebin error indicates that you must have removed the : Boolean type annotation from the isPrime method (required on recursive methods due to limitations in type inference). The call site { _ forall isPrime } should be OK with any Scala versionEngineer
I've added your pure C code to the benchmark. Cython-based variant has almost the same performance as C. Note: the only difference between pure Python and Cython variants is @cython.locals(n=int, j=int) decorator that tells Cython about static types i.e., it is similar to optimized Clojure code in that regard (but faster).Cupp
X
30

Rough answers:

  1. Scala's static typing is helping it quite a bit here - this means that it uses the JVM pretty efficiently without too much extra effort.
  2. I'm not exactly sure on the Ruby/Python difference, but I suspect that (2...n).all? in the function is-prime? is likely to be quite well optimised in Ruby (EDIT: sounds like this is indeed the case, see Julian's answer for more detail...)
  3. Ruby 1.9.3 is just much better optimised
  4. Clojure code can certainly be accelerated a lot! While Clojure is dynamic by default, you can use type hints, primitive maths etc. to get close to Scala / pure Java speed in many cases when you need to.

Most important optimisation in the Clojure code would be to use typed primitive maths within is-prime?, something like:

(set! *unchecked-math* true) ;; at top of file to avoid using BigIntegers

(defn ^:static is-prime? [^long n]
  (loop [i (long 2)] 
    (if (zero? (mod n i))
      false
      (if (>= (inc i) n) true (recur (inc i))))))

With this improvement, I get Clojure completing 10k in 0.635 secs (i.e. the second fastest on your list, beating Scala)

P.S. note that you have printing code inside your benchmark in some cases - not a good idea as it will distort the results, especially if using a function like print for the first time causes initialisation of IO subsystems or something like that!

Xyloid answered 25/7, 2012 at 1:12 Comment(5)
I don't think the bit about Ruby and Python is necessarily true, but +1 otherwise ..Handiwork
Typing didn't show any measurable stable result, but your new is-prime? shows 2x improvement. ;)Hellenistic
couldn't this be made faster if there was an unchecked-mod ?Evaporation
@Evaporation - probably! not sure how well this gets optimised by the current Clojure compiler, there's probably room for improvement. Clojure 1.4 definitely helps a lot in general for this kind of stuff, 1.5 will probably be even better.Xyloid
(zero? (mod n i)) should be faster than (= (mod n i) 0)Valdavaldas
S
24

Here's a fast Clojure version, using the same basic algorithms:

(set! *unchecked-math* true)

(defn is-prime? [^long n]
  (loop [i 2]
    (if (zero? (unchecked-remainder-int n i))
      false
      (if (>= (inc i) n)
        true
        (recur (inc i))))))

(defn sexy-primes [m]
  (for [x (range 11 (inc m))
        :when (and (is-prime? x) (is-prime? (- x 6)))]
    [(- x 6) x]))

It runs about 20x faster than your original on my machine. And here's a version that leverages the new reducers library in 1.5 (requires Java 7 or JSR 166):

(require '[clojure.core.reducers :as r]) ;'

(defn sexy-primes [m]
  (->> (vec (range 11 (inc m)))
       (r/filter #(and (is-prime? %) (is-prime? (- % 6))))
       (r/map #(list (- % 6) %))
       (r/fold (fn ([] []) ([a b] (into a b))) conj)))

This runs about 40x faster than your original. On my machine, that's 100k in 1.5 seconds.

Semiprofessional answered 25/7, 2012 at 5:19 Comment(1)
Using unchecked-remainder-int or just rem instead of mod along with static typing results to 4x performance increase. Nice!Hellenistic
R
22

I'll answer just #2, since it's the only one I've got anything remotely intelligent to say, but for your Python code, you're creating an intermediate list in is_prime, whereas you're using .map in your all in Ruby which is just iterating.

If you change your is_prime to:

def is_prime(n):
    return all((n%j > 0) for j in range(2, n))

they're on par.

I could optimize the Python further, but my Ruby isn't good enough to know when I've given more of an advantage (e.g., using xrange makes Python win on my machine, but I don't remember if the Ruby range you used creates an entire range in memory or not).

EDIT: Without being too silly, making the Python code look like:

import time

def is_prime(n):
    return all(n % j for j in xrange(2, n))

def primes_below(x):
    return [(j-6, j) for j in xrange(9, x + 1) if is_prime(j) and is_prime(j-6)]

a = int(round(time.time() * 1000))
print(primes_below(10*1000))
b = int(round(time.time() * 1000))
print(str((b-a)) + " mils")

which doesn't change much more, puts it at 1.5s for me, and, with being extra silly, running it with PyPy puts it at .3s for 10K, and 21s for 100K.

Rutharuthann answered 25/7, 2012 at 0:35 Comment(9)
The generator makes a big difference here as it allows the function to bail on the first False (good catch).Speaker
I'm really looking forward to them getting numpy into PyPy ... That's going to be awesome.Speaker
Would you please run my answer in PyPy? I'm curious how much faster that would be.Coumarone
You are totally right about both iterating thing and xrange! I've fixed and now Python and Ruby show equal results.Hellenistic
@Coumarone I'll do so only if you promise to now go out and download PyPy yourself :)! pypy.org/download.html has binaries for all common OSes, and your package manager undoubtedly has it. Anyways, as for your benchmark, with a random lru_cache implementation for 2.7 found on AS, 100K runs in 2.3s.Rutharuthann
@Julian, I have already downloaded PyPy but I didn't successfully get it to bootstrap, and I got busy and never went back. I should try that again... PyPy is really cool.Coumarone
(n%j != 0) or just n % j might be more readable than (n%j > 0)Cupp
a very tiny optimisation: switch is_prime(j) and is_prime(j-6) in primes_below .)Ladida
I've posted results of running the benchmark on RPython, Pypy, Cython, Jython, CPython 2.x, CPython 3. Typed Cython version is 15-20 times faster.Cupp
E
16

You can make the Scala a lot faster by modifying your isPrime method to

  def isPrime(n: Int, i: Int = 2): Boolean = 
    if (i == n) true 
    else if (n % i != 0) isPrime(n, i + 1)
    else false

Not quite as concise but the program runs in 40% of the time!

We cut out the superfluous Range and anonymous Function objects, the Scala compiler recognizes the tail-recursion and turns it into a while-loop, which the JVM can turn into more or less optimal machine code, so it shouldn't be too far off the C version.

See also: How to optimize for-comprehensions and loops in Scala?

Engineer answered 25/7, 2012 at 1:42 Comment(3)
2x improvement. And nice link!Hellenistic
btw this method body is identical to i == n || n % i != 0 && isPrime(n, i + 1), which is shorter, albeit a bit harder to readEngineer
You should have added the @tailrec annotation, to ensure it will make that optimization.Alexandria
C
7

Never mind the benchmarks; the problem got me interested and I made some fast tweaks. This uses the lru_cache decorator, which memoizes a function; so when we call is_prime(i-6) we basically get that prime check for free. This change cuts the work roughly in half. Also, we can make the range() calls step through just the odd numbers, cutting the work roughly in half again.

http://en.wikipedia.org/wiki/Memoization

http://docs.python.org/dev/library/functools.html

This requires Python 3.2 or newer to get lru_cache, but could work with an older Python if you install a Python recipe that provides lru_cache. If you are using Python 2.x you should really use xrange() instead of range().

http://code.activestate.com/recipes/577479-simple-caching-decorator/

from functools import lru_cache
import time as time_

@lru_cache()
def is_prime(n):
    return n%2 and all(n%i for i in range(3, n, 2))

def primes_below(x):
    return [(i-6, i) for i in range(9, x+1, 2) if is_prime(i) and is_prime(i-6)]

correct100 = [(5, 11), (7, 13), (11, 17), (13, 19), (17, 23), (23, 29),
        (31, 37), (37, 43), (41, 47), (47, 53), (53, 59), (61, 67), (67, 73),
        (73, 79), (83, 89)]
assert(primes_below(100) == correct100)

a = time_.time()
print(primes_below(30*1000))
b = time_.time()

elapsed = b - a
print("{} msec".format(round(elapsed * 1000)))

The above took only a very short time to edit. I decided to take it one step further, and make the primes test only try prime divisors, and only up to the square root of the number being tested. The way I did it only works if you check numbers in order, so it can accumulate all the primes as it goes; but this problem was already checking the numbers in order so that was fine.

On my laptop (nothing special; processor is a 1.5 GHz AMD Turion II "K625") this version produced an answer for 100K in under 8 seconds.

from functools import lru_cache
import math
import time as time_

known_primes = set([2, 3, 5, 7])

@lru_cache(maxsize=128)
def is_prime(n):
    last = math.ceil(math.sqrt(n))
    flag = n%2 and all(n%x for x in known_primes if x <= last)
    if flag:
        known_primes.add(n)
    return flag

def primes_below(x):
    return [(i-6, i) for i in range(9, x+1, 2) if is_prime(i) and is_prime(i-6)]

correct100 = [(5, 11), (7, 13), (11, 17), (13, 19), (17, 23), (23, 29),
        (31, 37), (37, 43), (41, 47), (47, 53), (53, 59), (61, 67), (67, 73),
        (73, 79), (83, 89)]
assert(primes_below(100) == correct100)

a = time_.time()
print(primes_below(100*1000))
b = time_.time()

elapsed = b - a
print("{} msec".format(round(elapsed * 1000)))

The above code is pretty easy to write in Python, Ruby, etc. but would be more of a pain in C.

You can't compare the numbers on this version against the numbers from the other versions without rewriting the others to use similar tricks. I'm not trying to prove anything here; I just thought the problem was fun and I wanted to see what sort of easy performance improvements I could glean.

Coumarone answered 25/7, 2012 at 1:53 Comment(6)
lru_cache is definitely nifty. For certain classes of problem, such as generating successive Fibonacci numbers, it can give a huge speedup just by adding that one line decorator on the function! Here's a link to a Raymond Hettinger talk that covers lru_cache about 26 minutes in. blip.tv/pycon-us-videos-2009-2010-2011/…Coumarone
By using lru_cache, you actually use another algorithm rather than the raw code. So the performance is about the algorithm but not the language itself.Its
@Its -- I don't understand what you mean. lru_cache avoids repeating a calculation that was already done recently, and that is all; I don't see how that is "actually us[ing] another algorithm". And Python suffers from being slow, but benefits from having cool stuff like lru_cache; I don't see anything wrong with using the beneficial parts of a language. And I said that one shouldn't compare the run time of my answer against the other languages without making similar changes to the others. So, I don't understand what you mean.Coumarone
@Its is right, but on the other hand the convenience of a higher level language should be allowed unless additional constraints are given. lru_cache will sacrifice memory for speed, and adjusts the algorithmic complexity.Hardee
I copied your final script, downloaded a lru_cache implementation from code.activestate.com/recipes/… and ran your script in PyPy 1.8.0 and it completed in 1.213 seconds. Not too shabby!Tsushima
if you use another algorithm you could try Sieve of Eratosthenes. Python version produced an answer for 100K in under 0.03 seconds (30 ms).Cupp
I
7

Here is my scala version in both parallel and no-parallel, just for fun: (In my dual core compute, the parallel version takes 335ms while the no-parallel version takes 655ms)

object SexyPrimes {
  def isPrime(n: Int): Boolean = 
    (2 to math.sqrt(n).toInt).forall{ n%_ != 0 }

  def isSexyPrime(n: Int): Boolean = isPrime(n) && isPrime(n-6)

  def findPrimesPar(n: Int) {
    for(k <- (11 to n).par)
      if(isSexyPrime(k)) printf("%d %d\n",k-6,k)
  }

  def findPrimes(n: Int) {
    for(k <- 11 to n)
      if(isSexyPrime(k)) printf("%d %d\n",k-6,k)
  }


  def timeOf(call : =>Unit) {
    val start = System.currentTimeMillis
    call
    val end = System.currentTimeMillis
    println((end-start)+" mils")
  }

  def main(args: Array[String]) {
    timeOf(findPrimes(100*1000))
    println("------------------------")
    timeOf(findPrimesPar(100*1000))
  }
}

EDIT: According to Emil H's suggestion, I have changed my code to avoid the effects of IO and jvm warmup:

The result shows in my compute:

List(3432, 1934, 3261, 1716, 3229, 1654, 3214, 1700)

object SexyPrimes {
  def isPrime(n: Int): Boolean = 
    (2 to math.sqrt(n).toInt).forall{ n%_ != 0 }

  def isSexyPrime(n: Int): Boolean = isPrime(n) && isPrime(n-6)

  def findPrimesPar(n: Int) {
    for(k <- (11 to n).par)
      if(isSexyPrime(k)) ()//printf("%d %d\n",k-6,k)
  }

  def findPrimes(n: Int) {
    for(k <- 11 to n)
      if(isSexyPrime(k)) ()//printf("%d %d\n",k-6,k)
  }


  def timeOf(call : =>Unit): Long = {
    val start = System.currentTimeMillis
    call
    val end = System.currentTimeMillis
    end - start 
  }

  def main(args: Array[String]) {
    val xs = timeOf(findPrimes(1000*1000))::timeOf(findPrimesPar(1000*1000))::
             timeOf(findPrimes(1000*1000))::timeOf(findPrimesPar(1000*1000))::
             timeOf(findPrimes(1000*1000))::timeOf(findPrimesPar(1000*1000))::
             timeOf(findPrimes(1000*1000))::timeOf(findPrimesPar(1000*1000))::Nil
    println(xs)
  }
}
Its answered 25/7, 2012 at 6:47 Comment(3)
Is the code affected by jvm warmup? E.g. isSexyPrime might be (more) optimized when called from findPrimesPar and not so much when called from findPrimesGhostwrite
@EmilH Fair enough. I have changed my code to avoid the effect of the io and jvm warmup.Its
Only going up to sqrt(n) is a good optimization, but you're now benchmarking a different algorithm.Engineer
S
7

Don't forget Fortran! (Mostly joking, but I would expect similar performance to C). The statements with exclamation points are optional, but good style. (! is a comment character in fortran 90)

logical function isprime(n)
IMPLICIT NONE !
integer :: n,i
do i=2,n
   if(mod(n,i).eq.0)) return .false.
enddo
return .true.
end

subroutine findprimes(m)
IMPLICIT NONE !
integer :: m,i
logical, external :: isprime

do i=11,m
   if(isprime(i) .and. isprime(i-6))then
      write(*,*) i-6,i
   endif
enddo
end

program main
findprimes(10*1000)
end
Speaker answered 25/7, 2012 at 12:1 Comment(0)
P
6

I couldn't resist to do a few of the most obvious optimizations for the C version which made the 100k test now take 0.3s on my machine (5 times faster than the C version in the question, both compiled with MSVC 2010 /Ox).

int isprime( int x )
{
    int i, n;
    for( i = 3, n = x >> 1; i <= n; i += 2 )
        if( x % i == 0 )
            return 0;
    return 1;
}

void findprimes( int m )
{
    int i, s = 3; // s is bitmask of primes in last 3 odd numbers
    for( i = 11; i < m; i += 2, s >>= 1 ) {
        if( isprime( i ) ) {
            if( s & 1 )
                printf( "%d %d\n", i - 6, i );
            s |= 1 << 3;
        }
    }
}

main() {
    findprimes( 10 * 1000 );
}

Here is the identical implemention in Java:

public class prime
{
    private static boolean isprime( final int x )
    {
        for( int i = 3, n = x >> 1; i <= n; i += 2 )
            if( x % i == 0 )
                return false;
        return true;
    }

    private static void findprimes( final int m )
    {
        int s = 3; // s is bitmask of primes in last 3 odd numbers
        for( int i = 11; i < m; i += 2, s >>= 1 ) {
            if( isprime( i ) ) {
                if( ( s & 1 ) != 0 )
                    print( i );
                s |= 1 << 3;
            }
        }
    }

    private static void print( int i )
    {
        System.out.println( ( i - 6 ) + " " + i );
    }

    public static void main( String[] args )
    {
        // findprimes( 300 * 1000 ); // for some JIT training
        long time = System.nanoTime();
        findprimes( 10 * 1000 );
        time = System.nanoTime() - time;
        System.err.println( "time: " + ( time / 10000 ) / 100.0 + "ms" );
    }
}

With Java 1.7.0_04 this runs almost exactly as fast as the C version. Client or server VM doesn't show much difference, except that JIT training seems to help the server VM a bit (~3%) while it has almost no effect with the client VM. The output in Java seems to be slower than in C. If the output is replaced with a static counter in both versions, the Java version runs a little faster than the C version.

These are my times for the 100k run:

  • 319ms C compiled with /Ox and output to >NIL:
  • 312ms C compiled with /Ox and static counter
  • 324ms Java client VM with output to >NIL:
  • 299ms Java client VM with static counter

and the 1M run (16386 results):

  • 24.95s C compiled with /Ox and static counter
  • 25.08s Java client VM with static counter
  • 24.86s Java server VM with static counter

While this does not really answer your questions, it shows that small tweaks can have a noteworthy impact on performance. So to be able to really compare languages you should try to avoid all algorithmic differences as much as possible.

It also gives a hint why Scala seems rather fast. It runs on the Java VM and thus benefits from its impressive performance.

Pili answered 25/7, 2012 at 14:39 Comment(1)
It's faster to go to sqrt(x) instead of x>>1 for the prime check function.Denominational
U
4

In Scala try using Tuple2 instead of List, it should go faster. Just remove the word 'List' since (x, y) is a Tuple2.

Tuple2 is specialized for Int, Long and Double meaning it won't have to box/unbox those raw datatypes. Tuple2 source. List isn't specialized. List source.

Uchish answered 25/7, 2012 at 0:45 Comment(9)
Then you can't call forall on it. I also thought that this might not be the most efficient code (more because a big strict collection is created for large n instead of just using a view), but it is certainly short + elegant, and I was surprised how well it performed despite using a lot of functional style.Schweiker
Your are right, I thought 'forAll' was there. Still there should be a big improvement over List and it would no be that bad to have those 2 calls.Uchish
it is indeed faster, with def sexyPrimes(n: Int) = (11 to n).map(i => (i-6, i)).filter({ case (i, j) => isPrime(i) && isPrime(j) }) it is about 60% faster here, so should beat the C code :)Schweiker
Hmm, I only get a performance increase of 4 or 5 %Engineer
Yes, I also get huge fluctuations regarding the original test when run repeatedly (I saw speed up from 0% to 100% now :)Schweiker
Why did you put the contents of filter in a case? filter((i, j) => isPrime(i) && isPrime(j)) is enough. I would also try (11 to n) collect { case i if isPrime(i-6) && isPrime(i) => (i-6, i) }Uchish
Because filter receives a Function1 whose argument is a Tuple2 which must be extracted, you cannot use a Function2. I created the tuple first, so the i-6 doesn't appear twice, but certainly your example with collect works as well.Schweiker
Sure, you are right. filter(p => isPrime(p._1) && isPrime(p._2)) should have been bit is a bit ugly.Uchish
I find collect substantially slower. Faster is if you do the filter first and then map. withFilter is slightly faster because it doesn't actually create intermediate collections. (11 to n) withFilter (i => isPrime(i - 6) && isPrime(i)) map (i => (i - 6, i))Engineer
T
4

Here's the code for the Go (golang.org) version:

package main

import (
    "fmt"
)


func main(){
    findprimes(10*1000)
}

func isprime(x int) bool {
    for i := 2; i < x; i++ {
        if x%i == 0 {
            return false
        }
    }
    return true
}

func findprimes(m int){
    for i := 11; i < m; i++ {
        if isprime(i) && isprime(i-6) {
            fmt.Printf("%d %d\n", i-6, i)
        }
    }
}

It ran just as fast as the C version.

Using an Asus u81a Intel Core 2 Duo T6500 2.1GHz, 2MB L2 cache, 800MHz FSB. 4GB RAM

The 100k version: C: 2.723s Go: 2.743s

With 1000000 (1M instead of 100K): C: 3m35.458s Go: 3m36.259s

But I think that it would be fair to use Go's built in multithreading capabilities and compare that version with the regular C version (without multithreading), just because it's almost too easy to do multithreading with Go.

Update: I did a parallel version using Goroutines in Go:

package main

import (
  "fmt"
  "runtime"
)

func main(){
    runtime.GOMAXPROCS(4)
    printer := make(chan string)
    printer2 := make(chan string)
    printer3 := make(chan string)
    printer4 := make(chan string)
    finished := make(chan int)

    var buffer, buffer2, buffer3 string

    running := 4
    go findprimes(11, 30000, printer, finished)
    go findprimes(30001, 60000, printer2, finished)
    go findprimes(60001, 85000, printer3, finished)
    go findprimes(85001, 100000, printer4, finished)

    for {
      select {
        case i := <-printer:
          // batch of sexy primes received from printer channel 1, print them
          fmt.Printf(i)
        case i := <-printer2:
          // sexy prime list received from channel, store it
          buffer = i
        case i := <-printer3:
          // sexy prime list received from channel, store it
          buffer2 = i
        case i := <-printer4:
          // sexy prime list received from channel, store it
          buffer3 = i
        case <-finished:
          running--
          if running == 0 {
              // all goroutines ended
              // dump buffer to stdout
              fmt.Printf(buffer)
              fmt.Printf(buffer2)
              fmt.Printf(buffer3)
              return
          }
      }
    }
}

func isprime(x int) bool {
    for i := 2; i < x; i++ {
        if x%i == 0 {
            return false
        }
    }
    return true
}

func findprimes(from int, to int, printer chan string, finished chan int){
    str := ""
    for i := from; i <= to; i++ {
        if isprime(i) && isprime(i-6) {
            str = str + fmt.Sprintf("%d %d\n", i-6, i)
      }
    }
    printer <- str
    //fmt.Printf("Finished %d to %d\n", from, to)
    finished <- 1
}

The parallelized version used in average 2.743 seconds, the exact same time that the regular version used.

The parallelized version completed in 1.706 seconds. It used less than 1.5 Mb RAM.

One odd thing: My dual core kubuntu 64bit never peaked in both cores. It looked like Go was using just one core. Fixed with a call to runtime.GOMAXPROCS(4)

Update: I ran the paralellized version up to 1M numbers. One of My CPU cores was at 100% all the time, while the other wasn't used at all (odd). It took a whole minute more than the C and the regular Go versions. :(

With 1000000 (1M instead of 100K):

C: 3m35.458s Go: 3m36.259s Go using goroutines:3m27.137s2m16.125s

The 100k version:

C: 2.723s Go: 2.743s Go using goroutines: 1.706s

Tetragon answered 25/7, 2012 at 4:32 Comment(5)
How many cores you have used btw?Herwick
I have an Asus u81a Intel Core 2 Duo T6500 2.1GHz, 2MB L2 cache, 800MHz FSB. 4GB RAMImplacable
Did you actually compile the C version with optimizations enabled? The default Go compiler does not inline, and will usually suffer a massive performance hit against optimized C in these kinds of comparisons. Add -O3 or better.Hardee
I just did, not before, and the 100K version took the same amount of time with or without -O3Implacable
Same thing for the 1M version. Maybe this particular operations (we are testing a very small subset) are well optimized by default.Implacable
L
4

Just for the fun of it, here is a parallel Ruby version.

require 'benchmark'

num = ARGV[0].to_i

def is_prime?(n)
  (2...n).all?{|m| n%m != 0 }
end

def sexy_primes_default(x)
    (9..x).map do |i|
        [i-6, i]
    end.select do |j|
        j.all?{|j| is_prime? j}
    end
end

def sexy_primes_threads(x)
    partition = (9..x).map do |i|
        [i-6, i]
    end.group_by do |x|
        x[0].to_s[-1]
    end
    threads = Array.new
    partition.each_key do |k|
       threads << Thread.new do
            partition[k].select do |j|
                j.all?{|j| is_prime? j}
            end
        end
    end
    threads.each {|t| t.join}
    threads.map{|t| t.value}.reject{|x| x.empty?}
end

puts "Running up to num #{num}"

Benchmark.bm(10) do |x|
    x.report("default") {a = sexy_primes_default(num)}
    x.report("threads") {a = sexy_primes_threads(num)}
end

On my 1.8GHz Core i5 MacBook Air, the performance results are:

# Ruby 1.9.3
$ ./sexyprimes.rb 100000
Running up to num 100000
                 user     system      total        real
default     68.840000   0.060000  68.900000 ( 68.922703)
threads     71.730000   0.090000  71.820000 ( 71.847346)

# JRuby 1.6.7.2 on JVM 1.7.0_05
$ jruby --1.9 --server sexyprimes.rb 100000
Running up to num 100000
                user     system      total        real
default    56.709000   0.000000  56.709000 ( 56.708000)
threads    36.396000   0.000000  36.396000 ( 36.396000)

# JRuby 1.7.0.preview1 on JVM 1.7.0_05
$ jruby --server sexyprimes.rb 100000
Running up to num 100000
             user     system      total        real
default     52.640000   0.270000  52.910000 ( 51.393000)
threads    105.700000   0.290000 105.990000 ( 30.298000)

It looks like the JVM's JIT is giving Ruby a nice performance boost in the default case, while true multithreading helps JRuby perform 50% faster in the threaded case. What's more interesting is that JRuby 1.7 improves the JRuby 1.6 score by a healthy 17%!

Lactoscope answered 25/7, 2012 at 13:17 Comment(0)
D
3

Based on x4u's answer, I wrote a scala version using recursion, and I improved on it by only going to the sqrt instead of x/2 for the prime check function. I get ~250ms for 100k, and ~600ms for 1M. I went ahead and went to 10M in ~6s.

import scala.annotation.tailrec

var count = 0;
def print(i:Int) = {
  println((i - 6) + " " + i)
  count += 1
}

@tailrec def isPrime(n:Int, i:Int = 3):Boolean = {
  if(n % i == 0) return false;
  else if(i * i > n) return true;
  else isPrime(n = n, i = i + 2)
}      

@tailrec def findPrimes(max:Int, bitMask:Int = 3, i:Int = 11):Unit = {
  if (isPrime(i)) {
    if((bitMask & 1) != 0) print(i)
    if(i + 2 < max) findPrimes(max = max, bitMask = (bitMask | (1 << 3)) >> 1, i = i + 2)
  } else if(i + 2 < max) {
    findPrimes(max = max, bitMask = bitMask >> 1, i = i + 2)
  }
}

val a = System.currentTimeMillis()
findPrimes(max=10000000)
println(count)
val b = System.currentTimeMillis()
println((b - a).toString + " mils")

I also went back and wrote a CoffeeScript (V8 JavaScript) version, which gets ~15ms for 100k, 250ms for 1M, and 6s for 10M, by using a counter (ignoring I/O). If I turn on the output it takes ~150ms for 100k, 1s for 1M, and 12s for 10M. Couldn't use tail recursion here, unfortunately, so I had to convert it back into loops.

count = 0;
print = (i) ->
  console.log("#{i - 6} #{i}")
  count += 1
  return

isPrime = (n) ->
  i = 3
  while i * i < n
    if n % i == 0
      return false
    i += 2
  return true

findPrimes = (max) ->
  bitMask = 3
  for i in [11..max] by 2
    prime = isPrime(i)
    if prime
      if (bitMask & 1) != 0
        print(i)
      bitMask |= (1 << 3)
    bitMask >>= 1
  return

a = new Date()
findPrimes(1000000)
console.log(count)
b = new Date()
console.log((b - a) + " ms")
Denominational answered 25/7, 2012 at 19:19 Comment(0)
Z
2

The answer to your question #1 is that Yes, the JVM is incredably fast and yes static typing helps.

The JVM should be faster than C in the long run, possibly even faster than "Normal" assembly language--Of course you can always hand optimize assembly to beat anything by doing manual runtime profiling and creating a separate version for each CPU, you just have to be amazingly good and knowledgable.

The reasons for Java's speed are:

The JVM can analyze your code while it runs and hand-optimize it--for instance, if you had a method that could be statically analyzed at compile time to be a true function and the JVM noticed that you were often calling it with the same parameters, it COULD actually eliminate the call completely and just inject the results from the last call (I'm not sure if Java actually does this exactly, but it doest a lot of stuff like this).

Due to static typing, the JVM can know a lot about your code at compile time, this lets it pre-optimize quite a bit of stuff. It also lets the compiler optimize each class individually without knowledge of how another class is planning to use it. Also Java doesn't have arbitrary pointers to memory location, it KNOWS what values in memory may and may not be changed and can optimize accordingly.

Heap allocation is MUCH more efficient than C, Java's heap allocation is more like C's stack allocation in speed--yet more versatile. A lot of time has gone into the different algroithims used here, it's an art--for instance, all the objects with a short lifespan (like C's stack variables) are allocated to a "known" free location (no searching for a free spot with enough space) and are all freed together in a single step (like a stack pop).

The JVM can know quirks about your CPU architecture and generate machine code specifically for a given CPU.

The JVM can speed your code long after you shipped it. Much like moving a program to a new CPU can speed it up, moving it to a new version of the JVM can also give you huge speed performances taylored to CPUs that didn't even exist when you initially compiled your code, something c physically cannot do without a recomiple.

By the way, most of the bad rep for java speed comes from the long startup time to load the JVM (Someday someone will build the JVM into the OS and this will go away!) and the fact that many developers are really bad at writing GUI code (especially threaded) which caused Java GUIs to often become unresponsive and glitchy. Simple to use languages like Java and VB have their faults amplified by the fact that the capibilities of the average programmer tends to be lower than more complicated languages.

Zed answered 25/7, 2012 at 17:44 Comment(4)
Saying JVM's heap allocation is much more efficient than C is non-sensical, given JVM is written in C++.Alexandria
@DanielC.Sobral language is not as important as impelemntation--Java's "Heap" implementation code is nothing like C's. Java's is a replacable multi-stage system highly optomizable for various targets with many man-years of effort in research including cutting-edge techniques being developed today, C uses a heap--A simple data structure developed ages ago. Java's system is impossible to implement for C given that C allows pointers so it can never guarantee "Safe" moves of arbitrary allocated memory chunks without language changes (rendering it no longer C)Zed
Safeness is irrelevant -- you didn't claim it was safer, you claimed it was more efficient. Furthermore, you description in the comment of how C "heap" works has no bearing with reality.Alexandria
You must have misunderstood my meaning of "Safe"--Java is able to move ant arbitrary block of memory any time because it knows it can, C is unable to optomize memory allcoation because there may be a pointer that may reference it. Also A C heap is usually implemented as a heap which is a data structure. C++ heaps used to be implemented with heap structures like C was (Hence the name, "Heap") I haven't checked into C++ for a few years so this may no longer be true, but it is still limited by not being able to re-arrange small chunks of user's allocated memory at will.Zed

© 2022 - 2024 — McMap. All rights reserved.