SimpleDateFormat succeeds or throws depending on year number
Asked Answered
S

2

5

I'm working with legacy code that tries to parse dates that include an optional time component that's usually padded with zeroes, using the a format string ddMMyy that doesn't really match the input. In the spirit of true legacy code, nobody ever bothered to clean it up because it accidentally does what it's supposed to do. Except, in 2023, it no longer does.

Here's a (drastically simplified) version of the code:

import java.text.ParseException;
import java.text.SimpleDateFormat;

public class WeirdDateFormat {
    public static void main(String...args) throws ParseException {
        var df = new SimpleDateFormat("ddMMyy");
        df.setLenient(false);

        System.out.println(df.parse("09012023000000000"));
        System.out.println(df.parse("09012022000000000"));
    }
}

It prints:

Mon Jan 09 00:00:00 CET 70403584
Exception in thread "main" java.text.ParseException: Unparseable date: "09012022000000000"
    at java.base/java.text.DateFormat.parse(DateFormat.java:399)
    at WeirdDateFormat.main(WeirdDateFormat.java:10)

In other words, the first date (9 January 2023) parses fine, but gives a date with the year 70403584. The second date (9 January 2022) fails to parse and throws an exception.

(If we set lenient to true, the second date doesn't throw but ends up in the year 239492593.)

WTF is happening here? Why does it sometimes fail to parse, and sometimes not? And where do these bizarre year numbers come from?

(Found the issue in production running Java 8, but the behaviour on Java 17 is the same.)

EDIT Yes, I know, I know, I must fix the legacy code. No need to keep telling me, you're preaching to the choir! Unfortunately, my drastically simplified version of the code doesn't show you all the other legacy defects of the code base that also have to be addressed. I just want to understand what's going on here so I'll be better informed when I actually do refactor this code.

Showboat answered 9/1, 2023 at 11:32 Comment(14)
I suspect the difference is some sort of overflow. Create a breakpoint and debug this to find out more. The real question is why you don't solve the actual problem and cut off the time part. And why the pattern contains 2-digit years but your input has 4-digit years.Biosynthesis
ddMMyy doesn't really match your input - but you really shouldn't be using SimpleDateFormat anymore and instead prefer DateTimeFormatter insteadIndices
I changed it to ddMMyyyy and 09012022 and it worksIndices
Yes, I know that SimpleDateFormat is not very good, and I know that the input doesn't match the format, and that the input is weird in the first place. This is just what I encountered in a legacy codebase, and I'm simply curious to understand what's happening in the existing code, before I attempt to refactor.Showboat
So, using yyyy and 2023000000000 it generates a year of 70403584 so I'm "assuming" that's it's trying to consume all the trailing 0sIndices
If this is a bug in your legacy code and you need to fix the bug and test the code again, how much code wold you need to rewrite in order to avoid using SimpleDateFormat here? Certainly an option I would seriously consider.Statuette
I have reproduced. I agree with the others, though, that if 2023 in your string is the year, then the you need yyyy (not yy) to parse it.Statuette
@Showboat As a fellow dutchie, note that your legacy code is intricately bound up with fairly careless conversion between epoch-milli form and y/m/d form (GregorianCalendar and Date contain the careless code, to be clear), and very recent versions of the JVM (if you're on coretto 17, you'll have this mess, most other JDK editions don't, yet, but will soon), Europe/Amsterdam is broken, and any birthdates pre-1940 will shift by a day. Perhaps this disaster will let you claim the resources to fully refactor away from the legacy here.Custody
Which is the desired behaviour in your production code? Asking because both strings begin with 090120 , which appears to me to be correct for 9 January 2020. Are you wanting the exception or do you want parsing without exception? In the latter case, do you need to get the year correct (2023 and 2020 in your two examples)?Statuette
@Custody We're still stuck on Java 8 for the foreseeable future and also we're dealing with dates no earlier than 1990, so unfortunately this probably won't sway the powers that be. But thanks for the heads-up though!Showboat
This parses correctly since Java 9: LocalDateTime.parse("09012020000000000", DateTimeFormatter.ofPattern("ddMMuuuuHHmmssSSS")). Same if year is 2020 instead of 2023. In Java 8 it’s just a little more complicated.Statuette
@OleV.V. Fun fact, the desired behaviour in this case is the exception, which will cause the parser to fall-through to a second date format to attempt, which uses yyyy instead of yy. I honestly can't think of a better example for programming-by-coincidence than this.Showboat
So the quick fix would be to throw the exception manually regardless of the string you get (not knowing the rest of your code it’s hard for me to tell whether there is a better solution, I hope there is).Statuette
@Showboat for what its worth, if you want 'least impact', you could clone SimpleDateFormat and replace line 1940 in SimpleDateFormat, which reads: value = number.intValue(); with value = number.longValue() > Integer.MAX_VALUE ? -1 : number.intValue();, and use this cloned copy...Custody
C
12

OBVIOUS ANSWER: Get rid of this obsolete crud that never worked properly and do it right.

Let's first explain this result

I'm not sure why, but you wanted to know why this happens. I dived into the source of SimpleDateFormat for you.

Given that the yy part is the last, it takes all remaining digits. Thus, the "2022000000000" part of the string is parsed into a Long of value 2022000000000. This is then immediately converted to an int, and that's quite problematic; 2022000000000 overflows and turns into int value -929596416. A standard java.text.CalendarBuilder instance is then told to set its YEAR field to that value (-929596416). Which is fine.

When parsing is done, that builder is asked to produce a GregorianCalendar value. This doesn't work - the GregorianCalendar accepts -929596416 as YEAR value just fine, but SimpleDateFormat then asks this GregCal instance to calculate the time in millis since the epoch, and that fails; an exception throws an exception indicating this. This exception is caught by the SimpleDateFormat code and results in the Unparseable date exception that you are getting.

With 2023, you get the same effect: That is turned into an int without checking if it overflows; that overflows just the same, and results in int value 70403584. GregorianCalendar DOES accept this year. This then results in what you saw: Year 70403584 - which is explained as follows:

long y = 2023000000000L;
int i = (int) y;
System.out.println(i); // prints 70403584

A deeper dive then is: Why is 70403584 fine, and -929596416 isn't?

Mostly, 'because'. The GregCal internal methods getMinimum(field) and getMaximum(field), when passing the YEAR field (constant value 1) are respectively 1 and 292278994. That means 70403584 is accepted, and -929596416 is not. You told it to be non-lenient. "Lenient" here (the old j.u.Calendar stuff) is mostly a silly concept (trying to define what is acceptable in non-lenient mode is virtually impossible. Various utterly ridiculous dates nevertheless are acceptable even in non-lenient mode).

We can verify this:

GregorianCalendar cal = new GregorianCalendar();
cal.setLenient(false);
cal.set(Calendar.YEAR, -5);
System.out.println(cal.getTime());

gives you:

Exception in thread "main" java.lang.IllegalArgumentException: YEAR
    at java.base/java.util.GregorianCalendar.computeTime(GregorianCalendar.java:2609)
    at java.base/java.util.Calendar.updateTime(Calendar.java:3411)
    at java.base/java.util.Calendar.getTimeInMillis(Calendar.java:1805)
    at java.base/java.util.Calendar.getTime(Calendar.java:1776)

THE EXECUTIVE CONCLUSION: If you were expecting lenient mode to reject these patterns, I have some nasty news for you: non-lenient mode does not work and never did and you should not be relying on it. Specifically here, overflows are not checked (you'd think that in non-lenient mode, any overflow of any value means the value is rejected, but, alas), and 2023000000000 so happens to overflow into a ridiculous but nevertheless, acceptable (even in non-lenient) year, whereas 2022000000000 does not.

So how do you fix this?

You can't. SimpleDateFormat and GregorianCalendar are horrible API and broken implementations. The only fix is to ditch it. Use java.time. Make a new formatter using java.time.DateTimeFormatter, parse this value into a LocalDate, and go from there. You'll solve a whole host of timezone related craziness on the fly, too! (Because java.util.Date is lying and doesn't represent dates. It represents instants, hence why .getYear() and company are deprecated, because you can't ask an instant for a year without a timezone, and Date doesn't have one. Calendar is intricately interwoven with it all - hence, storing dates on one timezone and reading them on another causes wonkiness. LocalDate avoids all that).

EDIT: As a fellow dutchie, note that the most recent JDKs break the Europe/Amsterdam timezone (grumble grumble OpenJDK team doesn't understand what damage they are causing) - which means any conversion between epoch-millis and base dates is extra problematic for software running in dutch locales. For example, if you are storing birthdates and you dip through conversion like this, everybody born before 1940 will break and their birthday will shift by a day. LocalDate avoids this by never storing anything as epoch-millis in the first place.

Custody answered 9/1, 2023 at 12:6 Comment(0)
J
7

The reason is integer overflow. But seriously, fix the legacy code.

jshell> (int)9012023000000000L
$1 ==> 496985600

jshell> (int)9012022000000000L
$2 ==> -503014400

A negative year is out of range for the year component. A crazy large year is still considered valid.

Set a breakpoint on line 1543 of SimpleDateFormat.java. The exception itself is thrown in Line 2583 of GregorianCalendar.java:

for (int field = 0; field < FIELD_COUNT; field++) {
    int value = internalGet(field);
    if (isExternallySet(field)) {
        // Quick validation for any out of range values
        if (value < getMinimum(field) || value > getMaximum(field)) {
            throw new IllegalArgumentException(getFieldName(field));
        }
    }
    originalFields[field] = value;
}

The actual "year" value that ends up in this method is -929596416 ((int)2022000000000L). Year 2023 does not throw because the long value coerced into an integer ends up being 70403584 – which is allowed. The GregorianCalendar rejects values smaller than 1 (< 1) and greater than 292278994 (> 292278994).

Jefferey answered 9/1, 2023 at 12:3 Comment(2)
So why the ParseException on 9012022000000000 then?Showboat
@Showboat because 2022 results in a negative value, 2023 results in a positive value.Jefferey

© 2022 - 2024 — McMap. All rights reserved.