Strange branching performance
Asked Answered
A

1

21

The results of my benchmark shows that the performance is worst when the branch has 15% (or 85%) percent probability rather than 50%.

Any explanation?

img

The code is too long but the relevant parts are here:

private int diff(char c) {
    return TABLE[(145538857 * c) >>> 27] - c;
}

@Benchmark int timeBranching(int reps) {
    int result = 0;
    while (reps-->0) {
        for (final char c : queries) {
            if (diff(c) == 0) {
                ++result;
            }
        }
    }
    return result;
}

It counts the number of BREAKING_WHITESPACE characters in the given string. The results shows a sudden time drop (performance increase) when the branching probability reaches about 0.20.

More details about the drop. Varying the seed shows more strange things happening. Note that the black line denoting the minimum and maximum values is very short except when close to the cliff.

cliff

Alternation answered 30/10, 2013 at 17:0 Comment(26)
I can't get rid of the feeling that there's a serial downvoter for all performance questions. Or even two of them. So comment what's wrong with the question - assuming you have something to say.Alternation
I think i like this question. I would be interested in the answer.Sulphone
downvoters, pleas explain the reason..!!Akira
@Alternation Probably because the code is linked instead of posted and there is zero context information. Just my guess though.Janie
As for an explanation, read this: en.wikipedia.org/wiki/Branch_predictor. Its probably different for each CPU architecture.Janie
@Durandal: Zero context is a good point, fixed. I know the Branch predictor stuff, it can't get better at >=20% as the input is random enough. See also the details about the sudden drop - this must be the JIT.Alternation
@Pankaj: It the meantime I've got an idea, but I'll wait for the other's answers.Alternation
Is this one run, or is it averaged across multiple runs?Oby
Could you be seeing the effects of the cache, i.e. some parts of the TABLE are accessed sequentially instead of randomly for some of the benchmarks?Turbofan
@Vivin Paliath: It's a caliper benchmark, doing warmup, multiple runs in separate JVM and all the stuff.Alternation
Well you have a point, its very odd how the execution time notably peaks at 0.15 and 0.85. But assuming your code has no oddities in random distribution of input only the branch prediction seems to be left as a candidate for the cuplrit.Janie
@jmiserez: The TABLE is pretty small, only 32 entries, which isn't obvious, but can be derived from the right shift as x >> 27 is always below 32.Alternation
Quite strange. I would have expected it to peak at 0.50 and then taper off. The performance of 0.20, 0.50, and 0.80 seem to be just a hair less than the performance for 0.05.Oby
Also your input is evenly distributed, correct?Oby
@Vivin Paliath: Evenly with the quality of java.util.Random (which should be good enough for this). And also with the prescribed number of matching chars, see the linked code.Alternation
Completely wild guess, but I noticed that you're always using 0 as a seed. I wonder if the distribution of values it is creating is favorable to the cases with 0.20, 0.50, and 0.80. Have you tested it with just new Random() and without the 0 seed?Oby
I'm trying to replicate your results. Which version of caliper are you using? I'm trying your code with 1.0-beta1 but it can't seem to find com.google.caliper.BeforeExperiment.Oby
@Vivin Paliath: I'm using the version from git, but you can simply leave the annotations out (and to make the method public, or whatever).Alternation
Do you have results for the varying seed similar to the first picture you posted?Oby
@VivinPaliath: Not exactly, but see the linked details.Alternation
@Alternation So you are passing a specific seed? I would try with just new Random() and see what happens and not track the seed at all. Each new invocation of Random() will give you a new generator that is unlikely to be similar to the previous one. Internally the constructor will create a seed.Oby
@Vivin Paliath: I've tried this too, it's about the same.Alternation
Have you controlled compilation and assembly code to make sure they are close in each case?Deliberate
@assylias: For the example just follow the link given above. I didn't look at the generated assembly, but I will. I believe I know now what's going on.Alternation
@Vivin Paliath: Could you reproduce my results? I've tried varying the seeds and everything.... it looks like the JIT does it wrong, could you confirm this?Alternation
@maartinus I will try when I get back home! Currently out of town.Oby
A
10

It looks like a minor JIT bug. For a small branch probability, it generates something like the following, just much more complicated due to unrolling (I'm simplifying a lot):

movzwl 0x16(%r8,%r10,2),%r9d

Get the char: int r9d = queries[r10]

imul   $0x8acbf29,%r9d,%ebx

Multiply: ebx = 145538857 * r9d

shr    $0x1b,%ebx

Shift: ebx >>>= 27

cmp    %edx,%ebx
jae    somewhere

Check bounds: if (ebx > edx || ebx < 0) goto somewhere (and throw there an IndexOutOfBoundsException.

cmp    %r9d,%ebx
jne    back

Jump back to loop beginning if not equal: if (r9d != ebx) goto back

inc    %eax

Increment the result: eax++

jne    back

Simply goto back

We can see one smart and one dumb thing:

  • The bound check gets done optimally with a single unsigned comparison.
  • It's completely redundant as x >>> 27 is always positive and less than the table length (32).

For a branch probability above 20%, these three instructions

cmp    %r9d,%ebx
jne    back
inc    %eax

get replaced by something like

mov    %eax,%ecx   // ecx = result
inc    %ecx        // ecx++
cmp    %r9d,%ebx   // if (r9b == index)
cmoveq %ecx,%ebx   //    result = ecx

using a conditional move instruction. This is one instruction more, but no branching and thus no branch misprediction penalty.

This explains the very constant time in the 20-80% range. The ramping below 20% is clearly due to branch mispredictions.

So it looks like a JIT failure to use the proper threshold of about 0.04 rather than 0.18.

thr

Alternation answered 30/10, 2013 at 21:6 Comment(7)
Very interesting. But I think your conclusion about it being a JIT glitch is off - consider you run the same method over and over, but the JIT will probably make its decisions based on only the warmup phase. Due to the structure of your benchmark you bait the JIT into assuming a branching ratio based of ths warmup, which is not really representative of the true distribution. As for the array bounds check being superflous, its probably a very hard problem to generically prove for an index expression that it will stay within certain bounds.Janie
@Durandal: But the warmup phase is very long and when trying a different seed, the same thing happens. If it was not representative, the JIT had to err in the other direction too, but it never happened.Alternation
@Durandal: I agree that it's hard in general, but here it's pretty trivial y == x >>> 27 implies y < 32 just like y = z && 31 does. This shifting is probably not common enough for the implementors to care, but it's usually a better idea than masking.Alternation
I believe the warmup phase ends (as far as the JIT is concerned) when it decides to compile the method. Unless it still collects branching data from compiled code (which I think you would have seen in the assembly code), it will get compiled almost immediately because the loop should make it cross the compilation threshold very quickly (Edit: so in essence I think the compilation takes place during the first tested branching ratio). As for the shift, why do you say its a better idea than masking? Do you mean in your particular case or in general?Janie
@Durandal: It's more complicated: There's tiered compilation and the profiling happens by periodically sampling the threads stacks, so you can't see it in the code. *** Shifting the bits down allows you to get the best bits from a multiplication and the multiplication is the best bit spreading operation. In this particular case, I can't see any comparably fast solution using masking. In general, I'd suggest a similar processing when computing the index from a hashCode.Alternation
@Alternation But isn't tiered compilation off by default? It goes from bytecode directly to C2 code. I have also never heard of profiling by asynchronous stack analysis. All the cases I am familiar with are about monitoring particular call sites, loops, etc., with profiling code being branched to explicitly by the interpreter as it reaches particular points in the code.Laynelayney
@ Marko Topolnik: I don't know if the tiered compilation is on or off by default, maybe I should investigate. Concerning profiling by asynchronous stack analysis, I don't know either - I believe it's doable in this case, so I used it as an argument against "the warmup phase ends ... when it decides to compile" only. However, the JIT is doing something wrongly here, agreed?Alternation

© 2022 - 2024 — McMap. All rights reserved.