The behavior described in the question was caused by a bug in the checker (validator) code that’s fixed now; see https://github.com/validator/galimatias/pull/2. The bug had gone unnoticed due to the test suite not having had coverage for the case of a relative URL that starts with a slash followed by a code point greater than U+FFFF — like the U+1F30 🌈 (rainbow) character in the question. So the test suite was also updated to add coverage for that case; see https://github.com/web-platform-tests/wpt/pull/36213.
Incidentally, the reason the U+2b50 (⭐) case wasn’t affected by the bug while the U+1F308 (🌈) case was is: Java uses UTF-16, and U+1F308 is in the range of so-called supplementary characters (that is, the set of code points above U+FFFF), and so — as noted in a comment above — in UTF-16 the code point U+1F308 is represented by a surrogate pair of two char
values while U+2b50 is represented by a single char
value.
And the reason the difference in how many char
values affects how the URL is parsed is that the state machine in the HTML checker’s URL-parsing code maintains a character index and decrements it during state changes. And so, if it’s handling a URL segment that can contain code points above U+FFFF, it must be smart about how many characters it decrements the index by — it needs to decrement it by 2 for code points above U+FFFF, and by 1 otherwise.
And to do that, the code has a decrIdx()
method that calls Character.charCount()
:
Determines the number of char
values needed to represent the specified character (Unicode code point). If the specified character is equal to or greater than 0x10000, then the method returns 2. Otherwise, the method returns 1.
So the code change that got made to the checker replaced a simple idx--
decrementing of the index value with a smarter Character.charCount()
-enabled decrIdx()
call.
⭐
and🌈
is that the latter consists of two surrogate code points in UTF16. Why that is a problem, and only at the begin of a path segment, I have no idea. – Hermie/⭐
apparently not affected by the bug? From the fix I’d expect most multi-byte characters to have had the same problem. – Leisure