f1
uses the ö character, f2
uses an o and a diacritic ¨ as a separate character.
f1
is in Normal Form C (composed) and f2
in Normal Form D (decomposed). In general Normal Form C is the most common on Windows and the web, with the Unicode FAQ describing it as “the best form for general text”. Unfortunately the Apple world plumped for Normal Form D in order to be gratuitously different.
The strings are canonically equivalent by the rules of Unicode equivalence.
What comparison can I do that will show these two strings to be "equal"?
In general, you convert both strings to one Normal Form of your choosing and then compare them. For example in Python:
>>> import unicodedata
>>> a= u'\u00F6' # ö composed
>>> b= u'o\u0308' # o then combining umlaut
>>> unicodedata.normalize('NFC', a)==unicodedata.normalize('NFC', b)
True
Similarly Java has the Normalizer
class, .NET has String.Normalize
, and may languages have bindings available to the ICU library which also offers this feature.
Unfortunately, JavaScript has no native Unicode normalisation ability. This means either:
doing it yourself, carting around large Unicode data tables to cover it all in JavaScript (see eg here for an example implementation); or
sending it back to the server-side (eg via XMLHttpRequest), where you've got a better-equipped language to do it.
<meta charset="utf-8">
and the form (a file input is the source of the first string) declaresaccept-charset="UTF-8"
. And, of course, the HTTP request and response are also UTF-8. I think this is just a case of different systems (browser vs. server) using different Unicode canonicalization. (Or using versus not using canonicalization.) – Garland