New feature, not a bug
The Answer by David Conrad is correct. What you are seeing is a new feature, not a bug.
New version of CLDR
The localization rules defined in the Unicode Consortium’s Common Locale Data Repository (CLDR) are continually evolving. Modern Java relies upon the CLDR as its main source of localization rules. So new versions of the CLDR bring new behaviors in Java.
This is life in the real world. Never harden your expectation of localized values. Those localizations may change in future versions of the CLDR, Java, and human cultures.
If localization behavior is critical to some logic in your code, write unit tests to verify that behavior.
Detecting NNBSP character
We can verify Conrad’s claim that you are indeed seeing a U+202F NARROW NO-BREAK SPACE (NNBSP)
. Let's examine each character in your output.
We can inspect each character to get its number assigned by the Unicode Consortium, its code point. Our NNBSP character has a code point of 8,239 decimal, 202F hex.
String dt = DateFormat.getDateTimeInstance ( ).format ( new Date ( ) );
System.out.println ( dt );
String codePoints = dt.codePoints ( ).boxed ( ).toList ( ).toString ( );
System.out.println ( "codePoints = " + codePoints );
When run:
Oct 3, 2023, 6:02:35 PM
codePoints = [79, 99, 116, 32, 51, 44, 32, 50, 48, 50, 51, 44, 32, 54, 58, 48, 50, 58, 51, 53, 8239, 80, 77]
Sure enough, we see the 8239
of our NNBSP is third from the end, before the P
and the M
.
Change is good
I would like to add a note about this change in the CLDR: This change is a good one, and makes sense. In logical typographical thinking, the AM
/PM
of a time-of-day should never be separated from the hours-minutes-seconds. Wrapping AM/PM on another line makes for clumsy reading. Using a non-breaking space rather than a plain breaking space makes sense. Being "thin" is a judgement I'll leave to the typography experts, but I gather makes sense as well.
Solution: Fix your console
The immediate solution to your problem of a ?
replacement character appearing is to 👉🏾 change the character-encoding of your console app. Whatever console app you are using (which you neglected to mention in your Question) is apparently configured for some archaic character encoding rather than a modern Unicode-savvy character encoding such as UTF-8.
Change the character encoding of your console app (see Comment). Than your errant ?
should appear as the true character, a thin non-breaking space.
Avoid legacy date-time classes
You are using terribly flawed date-time classes that were years ago supplanted by the modern java.time defined in JSR 310. This use of legacy date-time classes should be avoided, instead using java.time for date-time work.
Your choice of legacy classes is not a factor in the particular issue of your Question. But just FYI, let me show you the modern version of your code.
An Instant
object represents a moment as seen in UTC, that is, with an offset from UTC of zero hours-minutes-seconds. You can adjust that moment into a time zone, obtaining a ZonedDateTime
. Same point on the timeline, but different wall-clock time/calendar.
Instant instant = Instant.now ( ); // `java.util.Date` was years ago replaced by `java.time.Instant`.
ZoneId z = ZoneId.of ( "Asia/Tokyo" ); // Or, `ZoneId.systemDefault`.
ZonedDateTime zdt = instant.atZone ( z );
Locale locale = Locale.US;
DateTimeFormatter f = DateTimeFormatter.ofLocalizedDateTime ( FormatStyle.MEDIUM ).withLocale ( locale );
String output = zdt.format ( f );
System.out.println ( "output = " + output );
System.out.println ( output.codePoints ( ).boxed ( ).toList ( ).toString ( ) );
When run.
output = Oct 4, 2023, 10:21:32 AM
[79, 99, 116, 32, 52, 44, 32, 50, 48, 50, 51, 44, 32, 49, 48, 58, 50, 49, 58, 51, 50, 8239, 65, 77]
We see the same 8239
before the A
and the M
.
We can examine the characters by their official Unicode names.
output.codePoints ( ).mapToObj ( Character :: getName ).forEach ( System.out :: println );
When run:
LATIN CAPITAL LETTER O
LATIN SMALL LETTER C
LATIN SMALL LETTER T
SPACE
DIGIT FIVE
COMMA
SPACE
DIGIT TWO
DIGIT ZERO
DIGIT TWO
DIGIT THREE
COMMA
SPACE
DIGIT ONE
DIGIT ZERO
COLON
DIGIT ZERO
DIGIT TWO
COLON
DIGIT TWO
DIGIT SIX
NARROW NO-BREAK SPACE
LATIN CAPITAL LETTER A
LATIN CAPITAL LETTER M
Notice the NARROW NO-BREAK SPACE
, third from last.
And we can examine the characters by their code point in hexadecimal rather than decimal.
output.codePoints ( ).mapToObj ( ( int codePoint ) -> String.format ( "U+%04X" , codePoint ) ).forEach ( System.out :: println );
When run:
U+004F
U+0063
U+0074
U+0020
U+0035
U+002C
U+0020
U+0032
U+0030
U+0032
U+0033
U+002C
U+0020
U+0031
U+0030
U+003A
U+0030
U+0035
U+003A
U+0031
U+0037
U+202F
U+0041
U+004D
Notice the U+202F
, third from last.
For Unicode geeks
This topic turns out to be an interesting can of worms for Unicode geeks like me.
Section 1 of the Unicode Consortium document, Proposal to synchronize the Core Specification explains that character U+202F NARROW NO-BREAK SPACE (NNBSP)
has been incorrectly described as a narrow version of U+00A0 NO-BREAK SPACE
. This means the Width variation section of the Non-breaking space page on Wikipedia is incorrect. That Unicode document argues that NNBSP is actually a non-breaking version of U+2009 THIN SPACE
.
Another interesting note in that document is that the NNBSP character has largely served two purposes. I quote (my bullets):
- The NNBSP can be used to represent the narrow space occurring around punctuation characters in French typography, which is called an “espace fine insécable.”
- It is used especially in Mongolian text, before certain grammatical suffixes, to provide a small gap that not only prevents word breaking and line breaking, but also triggers special shaping for those suffixes.
Apparently we can now add a third major use to this use: formatting in date-time formats defined by the CLDR.
java.time.*
APIs over the olderDate
/Calendar
APIs – Meteoric?
character appear, what tool or app? Is this on a console from theSystem.out.prinln
? If so, where is that code running, such asTerminal.app
on a Mac, or in an IDE like IntelliJ? – Kronstadt