Strange performance drop of JDK8 LocalDate.toEpochDay
Asked Answered
E

2

5

I was curious if we finally get a fast datetime library with JDK8. Nearly all LocalDate computations use toEpochDay so I looked at the source and the large number of divisions and branches made me curious if I could do better.

results2

I eliminated some branching and all but one division, however the speedup is worse than expected. So my first question how is it possible that an algorithm using multiple division takes only about 30 cycles (throughput). Holger's comments seem to have answered this: Division by a small constant gets JIT-ed to multiplication. I did it manually and now I'm consistently beating the original implementation by a factor of 2.

The benchmark is pretty straightforward, just iterate though an array of random LocalDates and convert each of them toEpochDay. Despite the randomness, the results are pretty consistent. The size of the array is a parameter and my main question is where the big slowdown between 2000 and 30000 comes from. There's should be some slowdown as the data no more fit in the L1 cache, but the memory accesses of both algorithms are exactly the same (i.e., none but fetching the date from the array).

The still open question is: How it comes that the behaviour of two simple memory-access-free implementations of the same function change when iterating over an array? The original algorithm suffers a much bigger slowdown than mine.

My algorithm is probably not worth copying here, it's undocumented and about as cryptic as the original, and there's only a very rudimentary test.

Epifaniaepifano answered 5/2, 2014 at 1:56 Comment(0)
R
1

I haven't catched the reason directly, but it is certainly a benchmarking framework shortcoming. Something related to GC and per-invocation costs. I have the same performance degradation with JMH, except bench with 100 dates shows better perf than with 2000 dates, too. I've tried to create the dates array always of maximum size, and iterate just first 100, 2000, 30000 elements. In this case all versions performed equally (15.3 +- 0.3 ns on my machine).

import org.openjdk.jmh.annotations.*;

import java.time.LocalDate;
import java.util.*;
import java.util.concurrent.ThreadLocalRandom;
import java.util.concurrent.TimeUnit;


@State(Scope.Benchmark)
@BenchmarkMode(Mode.AverageTime)
@OperationsPerInvocation(LocalDateBenchmark.ITERATIONS)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class LocalDateBenchmark {
    public static final int MAX_ITERATIONS = 1000000;
    public static final int ITERATIONS = 30000;

    private static final LocalDate MIN_DATE = LocalDate.of(1900, 1, 1);
    private static final LocalDate MAX_DATE = LocalDate.of(2100, 1, 1);
    private static final int DAYS_BETWEEN = (int) (MAX_DATE.toEpochDay() - MIN_DATE.toEpochDay());

    public LocalDate[] dates = new LocalDate[MAX_ITERATIONS];
    private Random random;

    @Setup(Level.Trial)
    public void setUpAll() {
        Random r = ThreadLocalRandom.current();
        for (int i=0; i< dates.length; ++i) {
            dates[i] = MIN_DATE.plusDays(r.nextInt(DAYS_BETWEEN));
        }
    }

    @Setup(Level.Iteration)
    public void setUpRandom() {
        random = new Random();
    }

    @GenerateMicroBenchmark
    public int timeToEpochDay(LocalDateBenchmark state) {
        int result = 0;
        LocalDate[] dates = state.dates;
        int offset = random.nextInt(MAX_ITERATIONS - ITERATIONS);
        for (int i = offset; i < offset + ITERATIONS; i++) {
            LocalDate date = dates[i];
            result += date.toEpochDay();
        }
        return result;
    }
}
Resinous answered 5/2, 2014 at 9:41 Comment(3)
I can't confirm it. For me the results hardly changed. Could you post your benchmark?Epifaniaepifano
I see. But this you're creating cache missing, fighting the prefetcher, and maybe even range check elimination. That explains why 100 is the loser and maybe also why the speed stays the same for other sizes. However, I don't think that it explains in the original case (and I wouldn't call it a framework glitch when it happens for both caliper and JMH).Epifaniaepifano
@Epifaniaepifano maybe that's because cache? looping through 100 dates within L1.Resinous
S
0

thats because there are no divisions in the algorithm. all the / 4 are replaced by shifts. and all the / 100 are actually * 0.01's. the divisions are there for readbility (hehe). im not sure if this optimization happens during bytecode emission or JIT compilation, it would be interesting to look at the class file and find out.

Stump answered 5/2, 2014 at 2:26 Comment(7)
There's surely no such optimization in the bytecode (javac does about nothing in this respect). In general /4 can't be replaced by >>2 (because the strange round-towards-zero semantics of division, here it can. Sometimes you could replace integer division by floating point multiplication and rounding, but I'm not sure if it helps. For a general long operand this is most probably impossible because double features 56 bit mantissa only.Epifaniaepifano
@maaartinus: if you look at the code you will see a if (y >= 0) right before the divisions. So the JVM can prove that the dividend is not negative in one case so replacing /4 by >>2 is applicable and even in the other case it can generate the right code to fix it, e.g. a -((-y)>>2) will do if you know that y is negative.Cwm
For the division by 100, there’s no need for using floating point. Since the source values are int and short and the division is done in the long range, “integer division of small numbers” optimizations can apply. Wikipedia has a brief explanation of the concept. Simply said, /100 is replaced by *X/Y with X and Y chosen so that Y is a power of 2 and hence /Y replaced by a shift and X/Y being close enough to 0.01 for the integer results.Cwm
@Holger: Concerning shift instead of division, yes (I even wrote "here it can" in my confused sentence above). However, this is only one of the four divisions.Epifaniaepifano
@Holger: Concerning the division by a small constant, the dividend (year) is long (at least in the version I linked) and you'd need a "multi-long" arithmetic in order to get an exact result. Or find out that the operand is never really big and add a guard condition. I don't know, it might be profitable, but I'd expect it to be way slower than my cheap solution.Epifaniaepifano
@maaartinus: It is long but initialized with a value coming from an int. The other values are even coming from shorts. That’s what I already wrote; it’s the perfect situation for that optimization as the JVM can prove that the upper bits of the long are unused. Maybe that’s the only reason for using long for y and m at this place.Cwm
I'm blind, I was pretty sure that LocalDate.year is long as well. As if representing billions of yours wasn't good enough. ;) This all makes now sense. The same optimization has to work even for int as the JVM can convert it itself, but maybe it's not smart enough. I guess you answered my first question, there are really no divisions there!Epifaniaepifano

© 2022 - 2024 — McMap. All rights reserved.