Java 21 problem with DateFormat.getDateTimeInstance().format(new Date())
Asked Answered
G

2

11

This is my code

import java.util.Date;
import java.text.DateFormat;

class DateTime {
    public static void main(String[] args) {
        String dt = DateFormat.getDateTimeInstance().format(new Date());
        System.out.println(dt);
    }
}

When compiled and executed with Java 21, the call to 'format()' returns a UTF-16 string containing invalid bytes, represented by a question mark:

Oct 3, 2023, 7:01:17?PM

Has anyone else seen this problem? Thanks.

Genesis answered 3/10, 2023 at 23:14 Comment(6)
First, I'd suggest using the newer java.time.* APIs over the older Date/Calendar APIsMeteoric
I get the same result. Further inspection shows it is U+202F NARROW NO-BREAK SPACE. My locale is en_US.Underpay
What is your JVM's locale?Theotokos
Just tried the same thing with OpenJDK 17, got U+0020 SPACE, as expected. My JDK 21 is also OpenJDK.Underpay
Looks like a "feature" since Java 20, see bugs.openjdk.org/browse/JDK-8304925 It is related to Unicode CLDR update 42.Folkway
Where exactly does that ? character appear, what tool or app? Is this on a console from the System.out.prinln? If so, where is that code running, such as Terminal.app on a Mac, or in an IDE like IntelliJ?Kronstadt
K
15

New feature, not a bug

The Answer by David Conrad is correct. What you are seeing is a new feature, not a bug.

New version of CLDR

The localization rules defined in the Unicode Consortium’s Common Locale Data Repository (CLDR) are continually evolving. Modern Java relies upon the CLDR as its main source of localization rules. So new versions of the CLDR bring new behaviors in Java.

This is life in the real world. Never harden your expectation of localized values. Those localizations may change in future versions of the CLDR, Java, and human cultures.

If localization behavior is critical to some logic in your code, write unit tests to verify that behavior.

Detecting NNBSP character

We can verify Conrad’s claim that you are indeed seeing a U+202F NARROW NO-BREAK SPACE (NNBSP). Let's examine each character in your output.

We can inspect each character to get its number assigned by the Unicode Consortium, its code point. Our NNBSP character has a code point of 8,239 decimal, 202F hex.

String dt = DateFormat.getDateTimeInstance ( ).format ( new Date ( ) );
System.out.println ( dt );
String codePoints = dt.codePoints ( ).boxed ( ).toList ( ).toString ( );
System.out.println ( "codePoints = " + codePoints );

When run:

Oct 3, 2023, 6:02:35 PM
codePoints = [79, 99, 116, 32, 51, 44, 32, 50, 48, 50, 51, 44, 32, 54, 58, 48, 50, 58, 51, 53, 8239, 80, 77]

Sure enough, we see the 8239 of our NNBSP is third from the end, before the P and the M.

Change is good

I would like to add a note about this change in the CLDR: This change is a good one, and makes sense. In logical typographical thinking, the AM/PM of a time-of-day should never be separated from the hours-minutes-seconds. Wrapping AM/PM on another line makes for clumsy reading. Using a non-breaking space rather than a plain breaking space makes sense. Being "thin" is a judgement I'll leave to the typography experts, but I gather makes sense as well.

Solution: Fix your console

The immediate solution to your problem of a ? replacement character appearing is to 👉🏾 change the character-encoding of your console app. Whatever console app you are using (which you neglected to mention in your Question) is apparently configured for some archaic character encoding rather than a modern Unicode-savvy character encoding such as UTF-8.

Change the character encoding of your console app (see Comment). Than your errant ? should appear as the true character, a thin non-breaking space.


Avoid legacy date-time classes

You are using terribly flawed date-time classes that were years ago supplanted by the modern java.time defined in JSR 310. This use of legacy date-time classes should be avoided, instead using java.time for date-time work.

Your choice of legacy classes is not a factor in the particular issue of your Question. But just FYI, let me show you the modern version of your code.

An Instant object represents a moment as seen in UTC, that is, with an offset from UTC of zero hours-minutes-seconds. You can adjust that moment into a time zone, obtaining a ZonedDateTime. Same point on the timeline, but different wall-clock time/calendar.

Instant instant = Instant.now ( ); // `java.util.Date` was years ago replaced by `java.time.Instant`.
ZoneId z = ZoneId.of ( "Asia/Tokyo" );  // Or, `ZoneId.systemDefault`. 
ZonedDateTime zdt = instant.atZone ( z );
Locale locale = Locale.US;  
DateTimeFormatter f = DateTimeFormatter.ofLocalizedDateTime ( FormatStyle.MEDIUM ).withLocale ( locale );
String output = zdt.format ( f );
System.out.println ( "output = " + output );
System.out.println ( output.codePoints ( ).boxed ( ).toList ( ).toString ( ) );

When run.

output = Oct 4, 2023, 10:21:32 AM
[79, 99, 116, 32, 52, 44, 32, 50, 48, 50, 51, 44, 32, 49, 48, 58, 50, 49, 58, 51, 50, 8239, 65, 77]

We see the same 8239 before the A and the M.

We can examine the characters by their official Unicode names.

output.codePoints ( ).mapToObj ( Character :: getName ).forEach ( System.out :: println );

When run:

LATIN CAPITAL LETTER O
LATIN SMALL LETTER C
LATIN SMALL LETTER T
SPACE
DIGIT FIVE
COMMA
SPACE
DIGIT TWO
DIGIT ZERO
DIGIT TWO
DIGIT THREE
COMMA
SPACE
DIGIT ONE
DIGIT ZERO
COLON
DIGIT ZERO
DIGIT TWO
COLON
DIGIT TWO
DIGIT SIX
NARROW NO-BREAK SPACE
LATIN CAPITAL LETTER A
LATIN CAPITAL LETTER M

Notice the NARROW NO-BREAK SPACE, third from last.

And we can examine the characters by their code point in hexadecimal rather than decimal.

output.codePoints ( ).mapToObj ( ( int codePoint ) -> String.format ( "U+%04X" , codePoint ) ).forEach ( System.out :: println );

When run:

U+004F
U+0063
U+0074
U+0020
U+0035
U+002C
U+0020
U+0032
U+0030
U+0032
U+0033
U+002C
U+0020
U+0031
U+0030
U+003A
U+0030
U+0035
U+003A
U+0031
U+0037
U+202F
U+0041
U+004D

Notice the U+202F, third from last.


For Unicode geeks

This topic turns out to be an interesting can of worms for Unicode geeks like me.

Section 1 of the Unicode Consortium document, Proposal to synchronize the Core Specification explains that character U+202F NARROW NO-BREAK SPACE (NNBSP) has been incorrectly described as a narrow version of U+00A0 NO-BREAK SPACE. This means the Width variation section of the Non-breaking space page on Wikipedia is incorrect. That Unicode document argues that NNBSP is actually a non-breaking version of U+2009 THIN SPACE.

Another interesting note in that document is that the NNBSP character has largely served two purposes. I quote (my bullets):

  • The NNBSP can be used to represent the narrow space occurring around punctuation characters in French typography, which is called an “espace fine insécable.”
  • It is used especially in Mongolian text, before certain grammatical suffixes, to provide a small gap that not only prevents word breaking and line breaking, but also triggers special shaping for those suffixes.

Apparently we can now add a third major use to this use: formatting in date-time formats defined by the CLDR.

Kronstadt answered 4/10, 2023 at 1:9 Comment(8)
Another way I use to check the code points that make up a string is dt.codePoints().mapToObj(Character::getName).toList(). Handy.Underpay
Or dt.codePoints().mapToObj(cp -> String.format("U+%04X", cp)).toList()Underpay
…and combined: output.codePoints().forEach(codePoint -> System.out.printf( "%s (U+%04X)%n" , Character.getName(codePoint), codePoint));Comptometer
For folks on Windows using cmd and wondering "how to fix your console", this process worked for me: Windows Settings > Time & Region > Language & Region > Administrative Language Settings > Change System locale... Then check "Beta: Use Unicode-8 for worldwide language support". You'll also want to make sure your console's font includes the NNBSP character.Dicho
Thanks. The latest Android Galaxy Tablet update introduced this dumb non-breaking space, and it broke our tests. We already take care of typesetting and preventing the string from breaking. Bad feature. Unwanted. Should have been a pattern for those who need it.Gang
@Gang Lesson learned: Do not write tests for exact localized values. Localizations change, sometimes through evolution of human language and cultural norms, sometime through correction of mistakes. Write approximate checks: look for string length to be non-empty, look for particular words/numbers likely to always be present (to not evolve away).Kronstadt
Lesson learned: I write tests for anything that moves, and we pin down the locale because we have an engineering app and all output must be consistent. NNBSP is a typesetting feature, not part of default date formatting, and - like I said - we already handle the typesetting correctly with CSS 'nowrap'Gang
@Gang If you want precise reliable textual representation of date-time values, use only standard formats such as ISO 8601. Localization is for human reading, not machine reading.Kronstadt
U
9

There was a change made in JDK 20 to upgrade to CLDR data version 42 from The Unicode Common Locale Data Repository, which changed to a non-breaking space (nbsp), aka NARROW NO-BREAK SPACE.

Bug 8304925 has been filed but the workarounds listed amount to: get used to it, ask Unicode to revert the change (unlikely), or

Use the legacy locale data by designating -Djava.locale.providers=COMPAT at the launcher command line. (This option limits some newer functionalities though.)

Underpay answered 4/10, 2023 at 0:1 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.