What does 'case insensitive' mean in RFC 3986 with respect to non-English characters?
Asked Answered
G

1

7

RFC 3986 specifies that the host component of a URI is 'case insensitive'. However, it doesn't specify what 'case insensitive' means in terms of UCS or UTF-8 characters.

Examples given in the RFC (e.g. "<HTTP://www.EXAMPLE.com/> is equivalent to <http://www.example.com/>") allow us to infer that 'case insensitive' means at least that the characters A-Z are considered equivalent to the character 32 ahead of them in the UTF-8 character set, i.e. a-z. However, no mention is made of how characters outside this range should be treated. So, given an non-encoded, non-normalised registered name of www.OLÉ.com, I see three potential forms of normalisation permissible by the RFC:

  1. Lower case to www.olé.com then percent encode to www.ol%E9.com
  2. Lower case only A-Z characters to www.olÉ.com and then percent encode to www.ol%C9.com
  3. Percent encode to www.OL%C9.com, and then lower case the non-percent encoded parts to www.ol%C9.com, producing the same result as 2.

So the question is: Which is correct? If it's case 1., what defines which characters are considered upper case, and which are considered lower case (and which characters don't have a case)?

Gwenn answered 15/10, 2011 at 20:14 Comment(7)
Why are you percent-encoding? That is not a valid domain name (encoded or not encoded). Perhaps there is something in the stuff relating to punycode? (E.g. does punycode do case-normalization?)Gulden
The RFC explicitly specifies that percent encoding is valid, and that domain names registered in DNS are not the only kind of registered name that can be used.Gwenn
The RFC: "When a non-ASCII registered name represents an internationalized domain name intended for resolution via the DNS, the name must be transformed to the IDNA encoding [RFC3490] prior to name lookup. URI producers should provide these registered names in the IDNA encoding, rather than a percent-encoding, if they wish to maximize interoperability with legacy URI resolvers."Byrom
RFC 3490 builds on top of NAMEPREP (RFC 3491) and PUNYCODE (RFC 3492), and NAMEPREP takes you to STRINGPREP (RFC 3454). And RFC 3454 section 3.2 "Case folding" gives you the answer on what "case insensitive" means in IDN (International Domain Names) context.Byrom
@MihaiNita: I think your comments would make a good answer.Resupinate
Agreed: @MihaiNita, if you make this an answer, I'll accept it.Gwenn
@MihaiNita RFC 3490 (IDNA2003) has been superceded by RFC 5890 (IDNA2008). The latter does away with the NAMEPREP stage and simply disallows all uppercase characters. RFC 5895 suggests that applications should use the standard Unicode case mapping algorithm to convert IDNs to lowercase.Epitome
B
4

Hostnames resolved by DNS are always lowercase.

It is not possible to have UTF-8 characters in DNS hostnames (RFC 1123), however, a workaround has been put in place with "internationalized domain names". This workaround is commonly known as punycode.

Punycode enables non ASCII characters to be represented by ASCII characters.

non-ASCII characters are represented by ASCII characters that are allowed in host name labels (letters, digits, and hyphens).

-- https://www.ietf.org/rfc/rfc3492.txt

As for the example that you have provided in your question (www.olé.com), the domain name that would be resolved is not www.ol%E9.com.

If you are getting percentage signs in your domain name, it means that you have URL-encoded the hostname, and that is not correct, at least not for resolving.

For example, it will work correctly to have an a tag that looks like this:

<a href="//www.ol%C3%A9.com">Click Here</a>

However, the DNS server will not resolve www.ol%C3%A9.com, but rather, the converted domain name as punycode:

Example

www.ol%C3%A9.com

becomes

www.olé.com

which in punycode translates to:

www.xn--ol-cja.com

Web browsers will generally convert uppercase characters to the lowercase version. For example, both www.olé.com and www.olÉ.com translate to the same DNS hostname (www.xn--ol-cja.com), because www.olÉ.com was lowercased to www.olé.com.

I recommend two tools to check IDN domain names to see what a domain name looks like once it goes through the punycode translation:

Verisign's IDN tool is much stricter. Try both tools with www.olÉ.com as the input to see what I mean.

The rules for IDNA (Internationalized Domain Names for Applications) are complicated, but there are two main RFC's that are worth a look at:

  • Internationalized Domain Names for Applications (IDNA): Background, Explanation, and Rationale
    https://www.rfc-editor.org/rfc/rfc5894
  • The Unicode Code Points and Internationalized Domain Names for Applications
    https://www.rfc-editor.org/rfc/rfc5892

rfc5894 section 3.1.3 specifies that characters may not be allowed if:

  • The character is an uppercase form or some other form that is mapped to another character by Unicode case folding.
Breezy answered 30/10, 2015 at 3:5 Comment(4)
www.olé.com and www.olÉ.com have different representations in Punycode. But the user agent (browser) typically converts the hostname to lowercase be converting it.Epitome
www.olÉ.com does not have a representation in punycode.Breezy
But it does. The Punycode representation is www.xn--ol-lga.com. It's impossible to register this domain, though, because IDNA only allows lowercase names. Punycode allows to encode arbitrary integer sequences but most online converters will do some preprocessing on Unicode strings. Try this converter that doesn't preprocess the input.Epitome
Sorry, I meant to say, www.olÉ.com does not have a valid IDNA representation.Breezy

© 2022 - 2024 — McMap. All rights reserved.