Java Math.min/max performance

Asked 31/3, 2014 at 1:17 Answered 5/10, 2022 at 10:8

EDIT: maaartinus gave the answer I was looking for and tmyklebu's data on the problem helped a lot, so thanks both! :)

I've read a bit about how HotSpot has some "intrinsics" that injects in the code, specially for Java standard Math libs (from here)

So I decided to give it a try, to see how much difference HotSpot could make against doing the comparison directly (specially since I've heard min/max can compile to branchless asm).

public class OpsMath {
    public static final int max(final int a, final int b) {
        if (a > b) {
            return a;
        }
        return b;
    }
}

That's my implementation. From another SO question I've read that using the ternary operator uses an extra register, I haven't found significant differences between doing an if block and using a ternary operator (ie, return ( a > b ) ? a : b ).

Allocating a 8Mb int array (ie, 2 million values), and randomizing it, I do the following test:

try ( final Benchmark bench = new Benchmark( "millis to max" ) )
    {
        int max = Integer.MIN_VALUE;

        for ( int i = 0; i < array.length; ++i )
        {
            max = OpsMath.max( max, array[i] );
            // max = Math.max( max, array[i] );
        }
    }

I'm using a Benchmark object in a try-with-resources block. When it finishes, it calls close() on the object and prints the time the block took to complete. The tests are done separately by commenting in/out the max calls in the code above.

'max' is added to a list outside the benchmark block and printed later, so to avoid the JVM optimizing the whole block away.

The array is randomized each time the test runs.

Running the test 6 times, it gives these results:

Java standard Math:

millis to max 9.242167 
millis to max 2.1566199999999998
millis to max 2.046396 
millis to max 2.048616  
millis to max 2.035761
millis to max 2.001044

So fairly stable after the first run, and running the tests again gives similar results.

OpsMath:

millis to max 8.65418 
millis to max 1.161559  
millis to max 0.955851 
millis to max 0.946642 
millis to max 0.994543 
millis to max 0.9469069999999999

Again, very stable results after the first run.

The question is: Why? Thats quite a big difference there. And I have no idea why. Even if I implement my max() method exactly like Math.max() (ie, return (a >= b) ? a : b ) I still get better results! It makes no sense.

Specs:

CPU: Intel i5 2500, 3,3Ghz. Java Version: JDK 8 (public march 18 release), x64. Debian Jessie (testing release) x64.

I have yet to try with 32 bit JVM.

EDIT: Self contained test as requested. Added a line to force the JVM to preload Math and OpsMath classes. That eliminates the 18ms cost of the first iteration for OpsMath test.

// Constant nano to millis.
final double TO_MILLIS = 1.0d / 1000000.0d;
// 8Mb alloc.
final int[] array = new int[(8*1024*1024)/4];
// Result and time array.
final ArrayList<Integer> results = new ArrayList<>();
final ArrayList<Double> times = new ArrayList<>();
// Number of tests.
final int itcount = 6;
// Call both Math and OpsMath method so JVM initializes the classes.
System.out.println("initialize classes " + 
OpsMath.max( Math.max( 20.0f, array.length ), array.length / 2.0f ));
    
final Random r = new Random();
for ( int it = 0; it < itcount; ++it )
{
    int max = Integer.MIN_VALUE;
    
    // Randomize the array.
    for ( int i = 0; i < array.length; ++i )
    {
        array[i] = r.nextInt();
    }
    
    final long start = System.nanoTime();
    for ( int i = 0; i < array.length; ++i )
    {
        max = Math.max( array[i], max );
            // OpsMath.max() method implemented as described.
        // max = OpsMath.max( array[i], max );
    }
    // Calc time.
    final double end = (System.nanoTime() - start);
    // Store results.
    times.add( Double.valueOf( end ) );
    results.add( Integer.valueOf(  max ) );
}
// Print everything.
for ( int i = 0; i < itcount; ++i )
{
    System.out.println( "IT" + i + " result: " + results.get( i ) );
    System.out.println( "IT" + i + " millis: " + times.get( i ) * TO_MILLIS );
}

Java Math.max result:

IT0 result: 2147477409
IT0 millis: 9.636998
IT1 result: 2147483098
IT1 millis: 1.901314
IT2 result: 2147482877
IT2 millis: 2.095551
IT3 result: 2147483286
IT3 millis: 1.9232859999999998
IT4 result: 2147482828
IT4 millis: 1.9455179999999999
IT5 result: 2147482475
IT5 millis: 1.882047

OpsMath.max result:

IT0 result: 2147482689
IT0 millis: 9.003616
IT1 result: 2147483480
IT1 millis: 0.882421
IT2 result: 2147483186
IT2 millis: 1.079143
IT3 result: 2147478560
IT3 millis: 0.8861169999999999
IT4 result: 2147477851
IT4 millis: 0.916383
IT5 result: 2147481983
IT5 millis: 0.873984

Still the same overall results. I've tried with randomizing the array only once, and repeating the tests over the same array, I get faster results overall, but the same 2x difference between Java Math.max and OpsMath.max.

Illyrian answered 31/3, 2014 at 1:17 Comment(13)

I get essentially identical code and timings---admittedly on an older JVM and hardware. Post a complete, self-contained benchmark and as many details as you can about your environment if you want folks to look at it. Also, learn to use -XX:+PrintAssembly. It will save your ass when you get confused by things like this. – Electromechanical 31/3, 2014 at 1:47

All right I'll post a self contained test, I don't have a +PrintAssembly enabled JVM though. – Illyrian 31/3, 2014 at 2:43

Is that Benchmark from Caliper (code.google.com/p/caliper)? If not, read this: code.google.com/p/caliper/wiki/JavaMicrobenchmarks – Remonstrant 31/3, 2014 at 2:48

Posted self contained benchmark.It isn't Caliper's Benchmark. I know it isn't a representative case of real world usage (I certainly won't be calling max on 2 million integers any time soon). I'm just asking why the difference is so big (2x difference is quite important, and it scales if you use bigger array sizes). Its a pretty apples-to-apples comparison here, and the results are pretty much the same run after run. – Illyrian 31/3, 2014 at 3:7

Please post the code you're running. final double end = (System.nanoTime() - start; won't compile, so this isn't what you ran. – Point 31/3, 2014 at 3:15

added the parenthesis and changed "AltMath" to "OpsMath" if it makes such a difference :/ OpsMath is the real name of my math class. – Illyrian 31/3, 2014 at 3:20

Yeah, this benchmark isn't measuring what you want it to measure. I'm about to add an answer regardless that picks apart the effects I see (factor two difference in the opposite direction). You should read about the things that go wrong when you try to benchmark Java code without knowing what you're doing; among other things, it's not clear to me that you're timing fully-optimised code rather than interpreted or OSR code. – Electromechanical 31/3, 2014 at 3:46

afaik the compiling threshold is 1 thousand invocations. whatever Math.max optimization HotSpot has under its belt, it should kick in here pretty fast since it doing a couple million calls to it. – Illyrian 31/3, 2014 at 3:58

Math.max is far from the only thing that needs to be inlined and compiled properly. – Electromechanical 31/3, 2014 at 4:1

Out of sheer interest created a JMH-based benchmark - the same timings for Math.max and custom if-based and conditional-based implementations. As expected. – Hectorhecuba 31/3, 2014 at 6:27

@OlegEstekhin: Would you mind sharing it? I've just proposed to kick out the intrinsics. – Consociate 1/4, 2014 at 12:25

By the way, none of those finals has any performance impact – Standifer 15/4, 2014 at 16:14

yep I know, it's just an habit. specially on method parameters. – Illyrian 15/4, 2014 at 22:1

It's hard to tell why Math.max is slower than a Ops.max, but it's easy to tell why this benchmark strongly favors branching to conditional moves: On the n-th iteration, the probability of

Math.max( array[i], max );

being not equal to max is the probability that array[n-1] is bigger than all previous elements. Obviously, this probability gets lower and lower with growing n and given

final int[] array = new int[(8*1024*1024)/4];

it's rather negligible most of the time. The conditional move instruction is insensitive to the branching probability, it always take the same amount of time to execute. The conditional move instruction is faster than branch prediction if the branch is very hard to predict. On the other hand, branch prediction is faster if the branch can be predicted well with high probability. Currently, I'm unsure about the speed of conditional move compared to best and worst case of branching.¹

In your case all but first few branches are fairly predictable. From about n == 10 onward, there's no point in using conditional moves as the branch is rather guaranteed to be predicted correctly and can execute in parallel with other instructions (I guess you need exactly one cycle per iteration).

This seems to happen for algorithms computing minimum/maximum or doing some inefficient sorting (good branch predictability means low entropy per step).

¹ Both conditional move and predicted branch take one cycle. The problem with the former is that it needs its two operands and this takes additional instruction. In the end the critical path may get longer and/or the ALUs saturated while the branching unit is idle. Often, but not always, branches can be predicted well in practical applications; that's why branch prediction was invented in the first place.

As for the gory details of timing conditional move vs. branch prediction best and worst case, see the discussion below in comments. My my own benchmark shows that conditional move is significantly faster than branch prediction when branch prediction encounters its worst case, but I can't ignore contradictory results. We need some explanation for what exactly makes the difference. Some more benchmarks and/or analysis could help.

Consociate answered 31/3, 2014 at 6:42 Comment(16)

i did thought about the probability of taking the same branch almost always after a few iterations, but i didn't knew that conditional moves were a bit slower than predicted branches (and that mispredicted branches were slower than conditional moves). that explains the time difference nicely, thanks! – Illyrian 31/3, 2014 at 8:15

@TheStack: Some time ago I ran into a similar funny thing. The exact timing is hard to predict and JIT doesn't get it always right. I'm going to fix/extend my answer now. – Consociate 31/3, 2014 at 8:34

"while it's a bit slower than a (correctly) predicted branch, it's much faster than a mispredicted one" My measurements say it is another way around: cmov is much slower if the branch can be predicted well whereas only a bit faster if the branch prediction fails. See Why does tree vectorization make this sorting algorithm 2x slower?, the plot at the bottom of the question. To be clear here, I am only arguing "a bit" and "much" in the statement; as it stands now, I find the statement biased. Please remove "a bit" and "much". – Commitment 31/3, 2014 at 13:15

@Ali: The branch misprediction penalty is something like 14 cycles, while both cmov and predicted branch can execute in a single cycle (they both may be free if there are free units at a proper time; this is where the branch has a better chance.). The results you linked ate interesting, but my above link demonstrates that the misprediction penaly is pretty high. I'll edit my post when I get time to think hard about the case you linked. – Consociate 31/3, 2014 at 16:53

@Ali: Answer updated. The overall effect can be like described in your question, but the timing of individual instructions is as I wrote (maybe re-read Yakk's comments). Agreed? – Consociate 1/4, 2014 at 5:43

@Consociate No, sorry. Please re-read that question, starting from "Based on Yakk's comment, I created a test." As for your update: "the branching probability itself converges quickly to either 0 or 1, just like in this question." I think this is the part that you misunderstood. If you re-read the question, you will see that the branching probability is 1-p all the time. Branch prediction obviously breaks down miserably around 0.5 but even then it is only 1.58x slower. It is a maliciously hand-crafted worst-case situation to break down branch prediction as much as possible. – Commitment 1/4, 2014 at 11:4

@Consociate We are having this discussion because I find your answer in general OK but overly biased, suggesting that cmov is the winner. It isn't. It's not at all a clear cut even in the generic situation whether branch prediction or cmov is "better". Please either remove the overly biased statements, suggesting that cmov so much better than branch prediction and make the answer neutral w.r.t. branch prediction or create a not too twisted benchmark that breaks down branch prediction more than my example. We wouldn't be having this discussion if I didn't find your answer valuable. – Commitment 1/4, 2014 at 11:13

@Ali: I indeed have to re-read your question as I indeed missed the "branching probability is 1-p" part. I'm not claiming that cmov is always the winner. I know that there are cases when it loses and they're not rare (such is life, otherwise making a good compiler would be way easier). My benchmark results contradict yours, so I'm unsure what's right. – Consociate 1/4, 2014 at 11:51

@Consociate Maybe JIT / Java adds noise to your benchmark? My example was written in C++ where I find it is easier to reason about execution times. Unfortunately, your benchmark is far from trivial; I would have to re-write that in C++ and analyze it in great detail to understand what's going on in your code. – Commitment 1/4, 2014 at 11:58

@Consociate We could end this discussion if you allowed me to make small changes to your answer and tone it down. The only thing that I find problematic is that it is still easy to misinterpret your answer and after reading it, one would think that cmov is so much better than branch prediction. If I could make it neutral, we would be OK. Otherwise, I need a not too twisted benchmark. In my example, it clearly boils down to 3 instructions, given in the question, side-by-side. That's all. In your example, honestly, I don't see what's going on and it would take me hours to track it down. :( – Commitment 1/4, 2014 at 12:14

@Ali: My benchmark could be simplified to for (boolean x : arr) sum += x ? 1 : 0, but in C++ the fact that bool is actually a small int could come in the way. But something like for (int x : ar) sum += x>0 ? x : 0 should do. I've observed similar behavior in many benchmarks, so I don't think it won't happen in a very simple one. Yes, feel free to edit my answer... though I thought I already made it neutral. Would you mind posting a question like "How big is the misprediction penalty" to let people help us to resolve our discussion? – Consociate 1/4, 2014 at 12:21

@Consociate "Would you mind posting a question like "How big is the misprediction penalty" to let people help us to resolve our discussion?" Yes, it is on my agenda but it won't happen in the near future because I am way too busy to deal with this question. Sorry. :( It would be interesting though. – Commitment 1/4, 2014 at 16:2

@Consociate OK, I have changed your answer. I hope you don't mind and I also hope I have made your already good answer even better. As for your benchmark, it doesn't seem too complicated. Unfortunately, I won't have time to look at it before the week-end / early next week. If it could wait a little, I would appreciate it. Oh, and one more thing popped up: Branch prediction happens in the hardware. Perhaps it is the difference in hardware that causes the difference in timings in case of the two benchmarks (yours and mine). – Commitment 1/4, 2014 at 16:30

@Consociate If you can wait a week or two, I will do measurements on my laptop with your example and get back to you with the results. Maybe it all boils down to difference in hardware. – Commitment 1/4, 2014 at 16:31

@Ali: Don't hurry. I slightly disagree with some things in your edit, but really only slightly. Let's look what we find out next week. I don't think HW can make such a big difference unless you're running some 286. The quality of the branch predictor doesn't matter as no smart algorithm can help (it's just random). And the pipeline is pretty deep. – Consociate 1/4, 2014 at 16:49

@Consociate OK, thanks. Let's freeze this discussion until further measurements. – Commitment 1/4, 2014 at 17:8

When I run your (suitably-modified) code using Math.max on an old (1.6.0_27) JVM, the hot loop looks like this:

0x00007f4b65425c50: mov    %r11d,%edi         ;*getstatic array
                                              ; - foo146::bench@81 (line 40)
0x00007f4b65425c53: mov    0x10(%rax,%rdx,4),%r8d
0x00007f4b65425c58: mov    0x14(%rax,%rdx,4),%r10d
0x00007f4b65425c5d: mov    0x18(%rax,%rdx,4),%ecx
0x00007f4b65425c61: mov    0x2c(%rax,%rdx,4),%r11d
0x00007f4b65425c66: mov    0x28(%rax,%rdx,4),%r9d
0x00007f4b65425c6b: mov    0x24(%rax,%rdx,4),%ebx
0x00007f4b65425c6f: rex mov    0x20(%rax,%rdx,4),%esi
0x00007f4b65425c74: mov    0x1c(%rax,%rdx,4),%r14d  ;*iaload
                                              ; - foo146::bench@86 (line 40)
0x00007f4b65425c79: cmp    %edi,%r8d
0x00007f4b65425c7c: cmovl  %edi,%r8d
0x00007f4b65425c80: cmp    %r8d,%r10d
0x00007f4b65425c83: cmovl  %r8d,%r10d
0x00007f4b65425c87: cmp    %r10d,%ecx
0x00007f4b65425c8a: cmovl  %r10d,%ecx
0x00007f4b65425c8e: cmp    %ecx,%r14d
0x00007f4b65425c91: cmovl  %ecx,%r14d
0x00007f4b65425c95: cmp    %r14d,%esi
0x00007f4b65425c98: cmovl  %r14d,%esi
0x00007f4b65425c9c: cmp    %esi,%ebx
0x00007f4b65425c9e: cmovl  %esi,%ebx
0x00007f4b65425ca1: cmp    %ebx,%r9d
0x00007f4b65425ca4: cmovl  %ebx,%r9d
0x00007f4b65425ca8: cmp    %r9d,%r11d
0x00007f4b65425cab: cmovl  %r9d,%r11d         ;*invokestatic max
                                              ; - foo146::bench@88 (line 40)
0x00007f4b65425caf: add    $0x8,%edx          ;*iinc
                                              ; - foo146::bench@92 (line 39)
0x00007f4b65425cb2: cmp    $0x1ffff9,%edx
0x00007f4b65425cb8: jl     0x00007f4b65425c50

Apart from the weirdly-placed REX prefix (not sure what that's about), here you have a loop that's been unrolled 8 times that does mostly what you'd expect---loads, comparisons, and conditional moves. Interestingly, if you swap the order of the arguments to max, here it outputs the other kind of 8-deep cmovl chain. I guess it doesn't know how to generate a 3-deep tree of cmovls or 8 separate cmovl chains to be merged after the loop is done.

With the explicit OpsMath.max, it turns into a ratsnest of conditional and unconditional branches that's unrolled 8 times. I'm not going to post the loop; it's not pretty. Basically each mov/cmp/cmovl above gets broken into a load, a compare and a conditional jump to where a mov and a jmp happen. Interestingly, if you swap the order of the arguments to max, here it outputs an 8-deep cmovle chain instead. EDIT: As @maaartinus points out, said ratsnest of branches is actually faster on some machines because the branch predictor works its magic on them and these are well-predicted branches.

I would hesitate to draw conclusions from this benchmark. You have benchmark construction issues; you have to run it a lot more times than you are and you have to factor your code differently if you want to time Hotspot's fastest code. Beyond the wrapper code, you aren't measuring how fast your max is, or how well Hotspot understands what you're trying to do, or anything else of value here. Both implementations of max will result in code that's entirely too fast for any sort of direct measurement to be meaningful within the context of a larger program.

Electromechanical answered 31/3, 2014 at 4:30 Comment(4)

So what you're saying is that the timing results are a consequence of how the OpsMath.max changes the JIT output rather than any inherent property of it? (ie, it isn't faster it just changes the environment enough that times are sampled in a way that it looks faster). – Illyrian 31/3, 2014 at 4:50

I don't think so. I believe that your benchmark precisely measures.... the speed of your benchmark. For something more reality-related you really need either JMH or Caliper. Getting a Java microbenchmark right is way harder then you think. But at the first glance I can't see any mistake, you seem to did it well. – Consociate 31/3, 2014 at 6:29

@TheStack: max runs way too fast to be able to benchmark it in isolation on modern pipelined computers. The performance of a given max implementation will depend much more strongly on the instructions around it than the instructions in it. maaartinus seems to have identified the reason for the change; the ugly ratsnest of branches winds up being faster here because only log(n) of them ever get taken. – Electromechanical 31/3, 2014 at 14:28

@TheStack: While this is illustrative in the case of "find the max of my huge array in random order", it doesn't tell you what happens for "find the max of my huge array in some other order" or "which max should I use in my nontrivial piece of code." – Electromechanical 31/3, 2014 at 14:30

Using JDK 8:

java version "1.8.0"
Java(TM) SE Runtime Environment (build 1.8.0-b132)
Java HotSpot(TM) 64-Bit Server VM (build 25.0-b70, mixed mode)

On Ubuntu 13.10

I ran the following:

import java.util.Random;
import java.util.function.BiFunction;

public class MaxPerformance {
  private final BiFunction<Integer, Integer, Integer> max;
  private final int[] array;

  public MaxPerformance(BiFunction<Integer, Integer, Integer> max, int[] array) {
    this.max = max;
    this.array = array;
  }

  public double time() {
    long start = System.nanoTime();

    int m = Integer.MIN_VALUE;
    for (int i = 0; i < array.length; ++i) m = max.apply(m, array[i]);

    m = Integer.MIN_VALUE;
    for (int i = 0; i < array.length; ++i) m = max.apply(array[i], m);

    // total time over number of calls to max
    return ((double) (System.nanoTime() - start)) / (double) array.length / 2.0;
  }

  public double averageTime(int repeats) {
    double cumulativeTime = 0;
    for (int i = 0; i < repeats; i++)
      cumulativeTime += time();
    return (double) cumulativeTime / (double) repeats;
  }

  public static void main(String[] args) {
    int size = 1000000;
    Random random = new Random(123123123L);
    int[] array = new int[size];
    for (int i = 0; i < size; i++) array[i] = random.nextInt();

    double tMath = new MaxPerformance(Math::max, array).averageTime(100);
    double tAlt1 = new MaxPerformance(MaxPerformance::max1, array).averageTime(100);
    double tAlt2 = new MaxPerformance(MaxPerformance::max2, array).averageTime(100);

    System.out.println("Java Math: " + tMath);
    System.out.println("Alt 1:     " + tAlt1);
    System.out.println("Alt 2:     " + tAlt2);
  }

  public static int max1(final int a, final int b) {
    if (a >= b) return a;
    return b;
  }

  public static int max2(final int a, final int b) {
    return (a >= b) ? a : b; // same as JDK implementation
  }
}

And I got the following results (average nanoseconds taken for each call to max):

Java Math: 15.443555810000003
Alt 1:     14.968298919999997
Alt 2:     16.442204045

So on a long run it looks like the second implementation is the fastest, although by a relatively small margin.

In order to have a slightly more scientific test, it makes sense to compute the max of pairs of elements where each call is independent from the previous one. This can be done by using two randomized arrays instead of one as in this benchmark:

import java.util.Random;
import java.util.function.BiFunction;
public class MaxPerformance2 {
  private final BiFunction<Integer, Integer, Integer> max;
  private final int[] array1, array2;

  public MaxPerformance2(BiFunction<Integer, Integer, Integer> max, int[] array1, int[] array2) {
    this.max = max;
    this.array1 = array1;
    this.array2 = array2;
    if (array1.length != array2.length) throw new IllegalArgumentException();
  }

  public double time() {
    long start = System.nanoTime();

    int m = Integer.MIN_VALUE;
    for (int i = 0; i < array1.length; ++i) m = max.apply(array1[i], array2[i]);
    m += m; // to avoid optimizations!

    return ((double) (System.nanoTime() - start)) / (double) array1.length;
  }

  public double averageTime(int repeats) {
    // warm up rounds:
    double tmp = 0;
    for (int i = 0; i < 10; i++) tmp += time();
    tmp *= 2.0;

    double cumulativeTime = 0;
    for (int i = 0; i < repeats; i++)
        cumulativeTime += time();
    return cumulativeTime / (double) repeats;
  }

  public static void main(String[] args) {
    int size = 1000000;
    Random random = new Random(123123123L);
    int[] array1 = new int[size];
    int[] array2 = new int[size];
    for (int i = 0; i < size; i++) {
        array1[i] = random.nextInt();
        array2[i] = random.nextInt();
    }

    double tMath = new MaxPerformance2(Math::max, array1, array2).averageTime(100);
    double tAlt1 = new MaxPerformance2(MaxPerformance2::max1, array1, array2).averageTime(100);
    double tAlt2 = new MaxPerformance2(MaxPerformance2::max2, array1, array2).averageTime(100);

    System.out.println("Java Math: " + tMath);
    System.out.println("Alt 1:     " + tAlt1);
    System.out.println("Alt 2:     " + tAlt2);
  }

  public static int max1(final int a, final int b) {
    if (a >= b) return a;
    return b;
  }

  public static int max2(final int a, final int b) {
    return (a >= b) ? a : b; // same as JDK implementation
  }
}

Which gave me:

Java Math: 15.346468170000005
Alt 1:     16.378737519999998
Alt 2:     20.506475350000006

The way your test is set up makes a huge difference on the results. The JDK version seems to be the fastest in this scenario. This time by a relatively large margin compared to the previous case.

Somebody mentioned Caliper. Well if you read the wiki, one the first things they say about micro-benchmarking is not to do it: this is because it's hard to get accurate results in general. I think this is a clear example of that.

Runty answered 31/3, 2014 at 14:19 Comment(9)

(1) Have you inspected the assembly code to make sure it's inlining the target of the BiFunction call in both cases? (2) You sure you aren't timing interpreted or OSR code? I'm almost sure you are. – Electromechanical 31/3, 2014 at 14:35

(1) That's a good point. I haven't. (2) I tried a different version with warm up rounds and I get the same results. The bottom line is that the test conditions are the same for all functions. I thought that might give a more "sterile" environment. However I might be terribly wrong! This is the pain of microbenchmarking. – Runty 31/3, 2014 at 14:38

OSR = On Stack Replacement: azulsystems.com/blog/cliff/… – Hypallage 31/3, 2014 at 14:42

Yeah I looked it up and tried testing with a bunch of warmup rounds (calls to the time() function that don't add to the cumulative sum). I get the same. – Runty 31/3, 2014 at 14:43

The intention was correct: The timing should be averaged to smoot out potential "random" spikes. However, there are (at least) 2 issues: 1. You ignored the result (m) in the first test - so the functions could have been optimized away (you addressed this with your EDIT). And 2: Running the tests only once in predefined order could introduce distortions, because in the first run, the BiFunction is monomorphic (there is only one implementation), in the second it is bimorphic, and the third one is polymorphic (see javaspecialists.eu/archive/Issue158.html ). – Henshaw 31/3, 2014 at 14:44

@Henshaw Great comment! How would you recommend addressing these issues? The tests could be run 3! times (one for each permutation of the three methods) to address #2. Would that do it? – Runty 31/3, 2014 at 14:49

@Electromechanical How would I make sure the assembly code gets inlined? – Runty 31/3, 2014 at 14:50

First of all: In this example, I did not find a noticable difference between the mono/polymorphic cases. This might be due to some neat JIT tricks in Java8, because usually, this can make a significant difference. However, the order of the calls made a difference. The simplest soution for issue #2 would thus be to run only isolated tests for each case. (I think that in Caliper, each test is run by launching a new instance of the JVM!). ... ... ... – Henshaw 31/3, 2014 at 15:19

... Alternatively, one could pull out the code of the main into an own method, and call this method several times, possibly with an increasing size parameter, and ignoring the results of the first few runs. The order itself should then no longer be so important. But as you pointed out in your reference to Caliper: Even then it's hard to make a really meaningful benchmark here. For example, the results also differ significantly with the numbers that are contained in the array (e.g. when you use random.nextInt(10)) - may be related to stackoverflow.com/questions/11227809 ...? – Henshaw 31/3, 2014 at 15:20

Here's a branchless min operation, max can be implemented by replacing diff=a-b with diff=b-a.

public static final long min(final long a, final long b) {
    final long diff = a - b;
    // All zeroes if a>=b, all ones if a<b because the sign bit is propagated
    final long mask = diff >> 63;
    return (a & mask) | (b & (~mask));
}

It should be as fast as streaming the memory because the CPU operations should be hidden by the sequential memory read latency.

Spermatophyte answered 5/10, 2022 at 10:8 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags