Why is /🌈 an invalid path when /a🌈 is valid?
Asked Answered
S

1

14

I’m trying to understand why certain HTML attributes are failing W3C validation. I encountered this in a real codebase, but here’s a minimal reproduction:

<!DOCTYPE html><html lang="en"><head><title>a</title></head><body>

<img alt="1" src="⭐">
<img alt="2" src="/⭐">
<img alt="3" src="/a⭐">
<img alt="4" src="/a/⭐">
<img alt="5" src="🌈">
<img alt="6" src="/🌈"> <!-- Only this is invalid. -->
<img alt="7" src="/a🌈">
<img alt="8" src="/a/🌈">

</body></html>

The W3C validator reports only one error, affecting the sixth image:

  1. Error: Bad value /🌈 for attribute src on element img: Illegal character in path segment: ? is not allowed.

    <img alt="6" src="/🌈">
    

Why is only that one a problem, and not the others? What’s different about it?

Sneakbox answered 25/9, 2022 at 15:57 Comment(9)
The difference between and 🌈 is that the latter consists of two surrogate code points in UTF16. Why that is a problem, and only at the begin of a path segment, I have no idea.Hermie
You may want to submit this as a possible bug via their GitHub: github.com/validator/validator/issuesPenitence
Please be aware that, despite being hosted on w3.org, the validator is a somewhat unofficial project that is supported by one person IIRC, and so they may be slow to implement validation support for new things, especially if they are esoteric and not practically used in the wild (like using emoji in file paths). The W3C working groups certainly don't wait for the validator to be updated before they publish new versions of their specifications.Protostele
By the way, when I become aware of a bug like this one that’s in a part of the code that intends to fully conform to the relevant specs but doesn’t, I pretty much stop whatever else I’m doing and work on it until it’s fixed — which is usually within a few hours of when I first find out about the bug. I only very accidentally came across this SO question a few hours ago. So if/when you ever come across some other problem in the checker that you think might be a bug, please really do raise an issue at github.com/validator/validator/issues, as others here have suggested.Draggle
Also, if/when you do ever post other questions here on SO about the behavior of the checker, please tag them with the w3c-validation tag. I watch that tag — and any time somebody posts there, I get notified within 15 minutes (and there are currently 77 other people watching that tag too). And if it’s about URL validity and you’re unsure what the expected behavior should actually be, the WHATWG Matrix room at matrix.to/#/#whatwg:matrix.org is a good place to ask. And for URL questions here, url-parsing is a helpful tag to use.Draggle
@ sideshowbarker The w3c-validation tag had been on this question already, I'm not sure why @Protostele removd it - probably because he thinks it's no an official w3c projectHermie
Thank you! I’d planned to leave this question up for a week before investigating any further. One thing I still don’t understand: Why was /⭐ apparently not affected by the bug? From the fix I’d expect most multi-byte characters to have had the same problem.Leisure
@Hermie I removed it because it's a bad tag. If you want to use a tag to indicate a question is about validation, just use validation.Protostele
@Protostele There's nothing bad about the tag, it's clearly useful. And no, the question is not about some arbitrary validation, but specifically about the w3c validation services that check whether something adheres to the w3c standards. Please take this discussion to meta before removing the tag again.Hermie
D
13

The behavior described in the question was caused by a bug in the checker (validator) code that’s fixed now; see https://github.com/validator/galimatias/pull/2. The bug had gone unnoticed due to the test suite not having had coverage for the case of a relative URL that starts with a slash followed by a code point greater than U+FFFF — like the U+1F30 🌈 (rainbow) character in the question. So the test suite was also updated to add coverage for that case; see https://github.com/web-platform-tests/wpt/pull/36213.


Incidentally, the reason the U+2b50 (⭐) case wasn’t affected by the bug while the U+1F308 (🌈) case was is: Java uses UTF-16, and U+1F308 is in the range of so-called supplementary characters (that is, the set of code points above U+FFFF), and so — as noted in a comment above — in UTF-16 the code point U+1F308 is represented by a surrogate pair of two char values while U+2b50 is represented by a single char value.

And the reason the difference in how many char values affects how the URL is parsed is that the state machine in the HTML checker’s URL-parsing code maintains a character index and decrements it during state changes. And so, if it’s handling a URL segment that can contain code points above U+FFFF, it must be smart about how many characters it decrements the index by — it needs to decrement it by 2 for code points above U+FFFF, and by 1 otherwise.

And to do that, the code has a decrIdx() method that calls Character.charCount():

Determines the number of char values needed to represent the specified character (Unicode code point). If the specified character is equal to or greater than 0x10000, then the method returns 2. Otherwise, the method returns 1.

So the code change that got made to the checker replaced a simple idx-- decrementing of the index value with a smarter Character.charCount()-enabled decrIdx() call.

Draggle answered 3/10, 2022 at 1:47 Comment(1)
“Java uses UTF-16” This is the missing piece for me. I’d read through the state machine and couldn’t figure out how it would work in UTF-8.Leisure

© 2022 - 2024 — McMap. All rights reserved.