Extremely slow parsing of time zone with the new java.time API
Asked Answered
S

2

31

I was just migrating a module from the old java dates to the new java.time API, and noticed a huge drop in performance. It boiled down to parsing of dates with timezone (I parse millions of them at a time).

Parsing of date string without a time zone (yyyy/MM/dd HH:mm:ss) is fast - about 2 times faster than with the old java date, about 1.5M operations per second on my PC.

However, when the pattern contains a time zone (yyyy/MM/dd HH:mm:ss z), the performance drops about 15 times with the new java.time API, while with the old API it is about as fast as without a time zone. See the performance benchmark below.

Does anyone have an idea if I can somehow parse these strings quickly using the new java.time API? At the moment, as a workaround, I am using the old API for parsing and then convert the Date to Instant, which is not particularly nice.

import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.time.format.DateTimeFormatter;
import java.time.format.DateTimeFormatterBuilder;
import java.util.concurrent.TimeUnit;

import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Fork;
import org.openjdk.jmh.annotations.Measurement;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OperationsPerInvocation;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.annotations.Warmup;
import org.openjdk.jmh.infra.Blackhole;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;

@OutputTimeUnit(TimeUnit.MILLISECONDS)
@BenchmarkMode(Mode.AverageTime)
@OperationsPerInvocation(1)
@Fork(1)
@Warmup(iterations = 3)
@Measurement(iterations = 5)
@State(Scope.Thread)
public class DateParsingBenchmark {

    private final int iterations = 100000;

    @Benchmark
    public void oldFormat_noZone(Blackhole bh, DateParsingBenchmark st) throws ParseException {

        SimpleDateFormat simpleDateFormat = 
                new SimpleDateFormat("yyyy/MM/dd HH:mm:ss");

        for(int i=0; i<iterations; i++) {
            bh.consume(simpleDateFormat.parse("2000/12/12 12:12:12"));
        }
    }

    @Benchmark
    public void oldFormat_withZone(Blackhole bh, DateParsingBenchmark st) throws ParseException {

        SimpleDateFormat simpleDateFormat = 
                new SimpleDateFormat("yyyy/MM/dd HH:mm:ss z");

        for(int i=0; i<iterations; i++) {
            bh.consume(simpleDateFormat.parse("2000/12/12 12:12:12 CET"));
        }
    }

    @Benchmark
    public void newFormat_noZone(Blackhole bh, DateParsingBenchmark st) {

        DateTimeFormatter dateTimeFormatter = new DateTimeFormatterBuilder()
                .appendPattern("yyyy/MM/dd HH:mm:ss").toFormatter();

        for(int i=0; i<iterations; i++) {
            bh.consume(dateTimeFormatter.parse("2000/12/12 12:12:12"));
        }
    }

    @Benchmark
    public void newFormat_withZone(Blackhole bh, DateParsingBenchmark st) {

        DateTimeFormatter dateTimeFormatter = new DateTimeFormatterBuilder()
                .appendPattern("yyyy/MM/dd HH:mm:ss z").toFormatter();

        for(int i=0; i<iterations; i++) {
            bh.consume(dateTimeFormatter.parse("2000/12/12 12:12:12 CET"));
        }
    }

    public static void main(String[] args) throws RunnerException {
        Options opt = new OptionsBuilder().include(DateParsingBenchmark.class.getSimpleName()).build();
        new Runner(opt).run();    
    }
}

And the results for 100K operations:

Benchmark                                Mode  Cnt     Score     Error  Units
DateParsingBenchmark.newFormat_noZone    avgt    5    61.165 ±  11.173  ms/op
DateParsingBenchmark.newFormat_withZone  avgt    5  1662.370 ± 191.013  ms/op
DateParsingBenchmark.oldFormat_noZone    avgt    5    93.317 ±  29.307  ms/op
DateParsingBenchmark.oldFormat_withZone  avgt    5   107.247 ±  24.322  ms/op

UPDATE:

I just did some profiling of the java.time classes, and indeed, the time zone parser seems to be implemented quite inefficiently. Just parsing a standalone timezone is responsible for all the slowness.

@Benchmark
public void newFormat_zoneOnly(Blackhole bh, DateParsingBenchmark st) {

    DateTimeFormatter dateTimeFormatter = new DateTimeFormatterBuilder()
            .appendPattern("z").toFormatter();

    for(int i=0; i<iterations; i++) {
        bh.consume(dateTimeFormatter.parse("CET"));
    }
}

There is a class called ZoneTextPrinterParser in the java.time bundle, which is internally making a copy of the set of all available time zones in every parse() call (via ZoneRulesProvider.getAvailableZoneIds()), and this is accountable for 99% of the time spent in the zone parsing.

Well, an answer then might be to write my own zone parser, which would not be too nice either, because then I could not build the DateTimeFormatter via appendPattern().

Southdown answered 19/12, 2015 at 19:21 Comment(16)
I can confirm that in the class ZoneTextPrinterParser line 3718 calls the static method ZoneRulesProvider.getAvailableZoneIds() which is return new HashSet<>(ZONES.keySet());. However this is called only once every parse invocation for a time zone. Creating the set seems to be time consuming as it contains hundreds of objects.Mcdermott
Maybe this code has changed recently? I was testing it with java 1.8.60. And there, the ZoneRulesProvider.getAvailableZoneIds() is invoked on the line 3715, and it is invoked unconditionally every time, not only once per zone.Southdown
I believe that all the source code for all the versions is available online and it will be easy to check. It's just that right now I'm writing an answer for a potential workaround. By the way, I am looking at 1.8.25.Mcdermott
What I found is that in version 8u40-b25 It is 3718 and in version 8-b132 it is 3695. So definitely something changed in the enclosing DateTimeFormatterBuilder, though I'm not sure what and what effect it has on the behavior of the invocation.Mcdermott
Your benchmark is flawed. You want to test the parsing, but you're testing the both the creation of the parser and the parsing. To make the benchmark test correct, you should test the parser creation independently of the parsing. Benchmarks should only ever test one thing. Note: I'm not saying the results will change much, but simply that your benchmark doesn't respect the good practices.Pung
You are right, I am creating the parsers in the benchmark methods and it is not a good practice in general. In this case, however, it does not change the result at all (I parse 100000 times, and create the parser once). I believe that it is appropriate for this case, because it helps everyone to see quickly what is happening there, as all related code is on one place. I preferred readability over strict correctness here.Southdown
@OlivierGrégoire When I went through the source code with a debugger I manually timed each call with a stopwatch and confirmed the OP's results :)Mcdermott
@Mcdermott that's not how benchmarking should be done. Ever. The OP is right about the method, just not correct about the implementation.Pung
@OlivierGrégoire It's a Rolex, I'm pretty sure it's accurate...Mcdermott
@JanXMarek yes, but had you implemented it using good practices, you'd have found and isolated the issue before your update.Pung
@Mcdermott did I miss something? Are you speaking about watches? That's what you meant by "manually"? That's even worse than I thought...Pung
OpenJDK bug 8066291 seems related. The complaint there is from ZoneIdPrinterParser instead, but it appears to be the same issue.Prude
@JeffreyBosboom yes, that's the same issue, and it looks like it has a similar solution - moving the getAvailableZoneIds() call so that it gets invoked less frequently. I do not have an account in that Jira to add a comment though.Southdown
@JeffreyBosboom Yup. The calls are inside getTree of both ZoneIdPrinterParser and ZoneTextPrinterParser. Same issue, good find. I doubt it will be fixed before JDK 9, though.Mcdermott
@Mcdermott You seem to be quite an expert in this area, I am surely missing something important, please correct me if I am wrong. Why does the ZoneRulesProvider.getAvailableZoneIds() make a copy of the zones map into a new hashmap? I know it needs to be threadsafe. But the underlying map is a ConcurrentHashMap. So, it could be wrapped with Collections.unmodifiableSet() instead of being copied. It's better to fix the underlying method (which is a part of the public API BTW), rather than thweaking a handful of particular parsers that depend on it.Southdown
@JanXMarek It it difficult to know why it is written that way exactly, what with this being a concurrency issue that is easy to be wrong about. I'm not sure if you're talking about fixing it by Oracle or by the user. In my answer I suggested subclassing ZoneRulesProvider which would allow you to override this method as well, so you can try to do what you suggested. Oracle will "have to" figure this out by themselves.Mcdermott
M
11

As noted in your question and in my comment, ZoneRulesProvider.getAvailableZoneIds() creates a new set of all the available time zones' string representation (the keys of the static final ConcurrentMap<String, ZoneRulesProvider> ZONES) each time a time zone needs to be parsed.1

Fortunately, a ZoneRulesProvider is an abstract class which is designed to be subclassed. The method protected abstract Set<String> provideZoneIds() is responsible for populating ZONES. Thus, a subclass can provide only the needed time zones if it knows ahead of time of all time zones to be used. Since the class will provide less entries than the default provider, which contains hundreds of entries, it has the potential to significantly reduce the invocation time of getAvailableZoneIds().

The ZoneRulesProvider API provides instructions on how to register one. Note that providers can't be deregistered, only supplemented, so it is not a simple matter of removing the default provider and adding your own. The system property java.time.zone.DefaultZoneRulesProvider defines the default provider. If it returns null (via System.getProperty("...") then the JVM's notorious provider is loaded. Using System.setProperty("...", "fully-qualified name of a concrete ZoneRulesProvider class") one can supply their own provider, which is the one discussed in the 2nd paragraph.

To conclude, I suggest:

  1. Subclass the abstract class ZoneRulesProvider
  2. Implements the protected abstract Set<String> provideZoneIds() with only the needed time zones.
  3. Set the system property to this class.

I did not do it myself, but I am sure it will fail for some reason think it will work.


1 It is suggested in the comments of the question that the exact nature of the invocation might have changed between 1.8 versions.

Edit: more information found

The aforementioned default ZoneRulesProvider is final class TzdbZoneRulesProvider located in java.time.zone. The regions in that class are read from the path: JAVA_HOME/lib/tzdb.dat (in my case it's in the JDK's JRE). That file indeed contains many regions, here is a snippet:

 TZDB  2014cJ Africa/Abidjan Africa/Accra Africa/Addis_Ababa Africa/Algiers 
Africa/Asmara 
Africa/Asmera 
Africa/Bamako 
Africa/Bangui 
Africa/Banjul 
Africa/Bissau Africa/Blantyre Africa/Brazzaville Africa/Bujumbura Africa/Cairo Africa/Casablanca Africa/Ceuta Africa/Conakry Africa/Dakar Africa/Dar_es_Salaam Africa/Djibouti 
Africa/Douala Africa/El_Aaiun Africa/Freetown Africa/Gaborone 
Africa/Harare Africa/Johannesburg Africa/Juba Africa/Kampala Africa/Khartoum 
Africa/Kigali Africa/Kinshasa Africa/Lagos Africa/Libreville Africa/Lome 
Africa/Luanda Africa/Lubumbashi 
Africa/Lusaka 
Africa/Malabo 
Africa/Maputo 
Africa/Maseru Africa/Mbabane Africa/Mogadishu Africa/Monrovia Africa/Nairobi Africa/Ndjamena 
Africa/Niamey Africa/Nouakchott Africa/Ouagadougou Africa/Porto-Novo Africa/Sao_Tome Africa/Timbuktu Africa/Tripoli Africa/Tunis Africa/Windhoek America/Adak America/Anchorage America/Anguilla America/Antigua America/Araguaina America/Argentina/Buenos_Aires America/Argentina/Catamarca  America/Argentina/ComodRivadavia America/Argentina/Cordoba America/Argentina/Jujuy America/Argentina/La_Rioja America/Argentina/Mendoza America/Argentina/Rio_Gallegos America/Argentina/Salta America/Argentina/San_Juan America/Argentina/San_Luis America/Argentina/Tucuman America/Argentina/Ushuaia 
America/Aruba America/Asuncion America/Atikokan America/Atka 
America/Bahia

Then If one finds a way to create a similar file with only the needed zones and load that one instead, the performance issues will probably not surely be resolved.

Mcdermott answered 19/12, 2015 at 23:52 Comment(1)
I see. Thanks for your detailed answer. It helped me to dig a bit deeper and learn something about the java time implementation. There is no way to plug my own parser into the DateTimeFormatter builder via API, and the only way really seems to be bending the ZoneRulesProvider as you suggest. I think I will not go this way for now, it would require some extra deployment configuration, and might interfere with other parts of the system. Using the old API for parsing is all right for now, and maybe one day the new parser will be optimized.Southdown
O
5

This problem is caused by ZoneRulesProvider.getAvailableZoneIds() which copied the set of time-zones each time. Bug JDK-8066291 tracked the issue, and it has been fixed in Java SE 9. It will not be backported to Java SE 8 because the bug fix involved a specifiation change (the method now returns an immutable set instead of a mutable one).

As a side note, some other performance issues with parsing have been backported to Java SE 8, so always use the latest update release.

Oreilly answered 16/3, 2017 at 9:49 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.