Javascript string comparison fails when comparing unicode characters
Asked Answered
N

5

18

I want to compare two strings in JavaScript that are the same, and yet the equality operator == returns false. One string contains a special character (eg. the danish å).

JavaScript code:

var filenameFromJS = "Designhåndbog.pdf";
var filenameFromServer = "Designhåndbog.pdf";

print(filenameFromJS == filenameFromServer); // This prints false why?

The solution What worked for me is unicode normalization as slevithan pointed out.

I forked my original jsfiddle to make a version using the normalization lib suggested by slevithan. Link: http://jsfiddle.net/GWZ8j/1/.

Nomanomad answered 29/5, 2012 at 19:50 Comment(3)
See this article about == vs. === #359994Sizemore
@Sizemore When both operands are of the same type, it does not matter if you use loose or strict comparison.Spartacus
This is also very useful: joelonsoftware.com/2003/10/08/… (What every developer needs to know about unicode and character sets)Tomasz
N
15

Unlike what some other people here have said, this has nothing to do with encodings. Rather, your two strings use different code points to render the same visual characters.

To solve this correctly, you need to perform Unicode normalization on the two strings before comparing them. Unforunately, JavaScript doesn't have this functionality built in. Here is a JavaScript library that can perform the normalization for you: https://github.com/walling/unorm

Nomenclator answered 29/5, 2012 at 20:3 Comment(4)
Oh, I was hoping not to get this answer :-) That I was just missing the obvious and wouldn't need a library for this simple task. Thanks for the answer I'll give it a try.Nomanomad
You are right, I have missed that CC 8A is the UTF-8 code sequence for U+30A COMBINING RING ABOVE, which is preceded by a. The other string has C3 A5 which encodes U+00E5 LATIN SMALL LETTER A WITH RING ABOVE in UTF-8. IIRC, Mac OS prefers the combining characters, while other OSes prefer the single-glyph form. It should be possible to have the server convert either one, though, so there is no large client-side library necessary.Spartacus
PointedEars, that's not necessarily possible or ideal. E.g., you might not want to do a server round trip just to perform a string comparison, or you might be using JavaScript on the server. @Tougher ,There is a proposal to add Unicode normalization to future versions of JavaScript. See strawman:unicode_normalization.Nomenclator
There is now a String#normalize() method natively available in JS.Therapeutic
W
6

The JavaScript equality operator == will appear to be failing under the following circumstances. In all cases it is programmer error. Not a bug in JavaScript.

  1. The two strings do not contain the same number and sequence of characters.

  2. There is whitespace or newlines before, within or after one string. Use a trim() operator on both and look closely at both strings.

  3. Surprise typecasting. The programmer is comparing datatypes that are incompatible.

  4. There are unicode characters which look identical to other unicode characters but in fact are different unicode characters.

Whitefly answered 29/10, 2013 at 3:17 Comment(3)
+1, because this answer is way more informative than the accepted one and doesn't contain something with nodeJS or jQuery.Mulford
in this case number 4 was the culpritFurlong
Different unicode normalisation is not about different characters, but means different unicode code point sequences were used to refer to the same character.Amora
N
1

UTF-8 is a complex thing. The charset has two different ways to encode characters such as á, é etc.

Certain Unicode characters can be represented in a composed and decomposed form. For example, the German umlaut-u ü can be represented either by the single character ü or by u followed by ¨, which a text renderer would then combine.

(The Wikipedia article on Unicode equivalence has more details.)

As you already see in the URL encoded version, the HEX bytes of which the character is made differ for both versions.

In JavaScript, you can use String.prototype.normalize() to get a normalized form of a string.

For example:

var normalizedFilenameFromJS = "Designhåndbog.pdf".normalize();
var normalizedFilenameFromServer = "Designhåndbog.pdf".normalize();

console.log(normalizedFilenameFromJS === normalizedFilenameFromServer); // This prints true

.normalize() can be called with a parameter to specify the normalization form; see the linked Mozilla Developer article for available options.

Nobe answered 29/5, 2012 at 19:54 Comment(5)
JFTR: Unicode is not UTF-8. Unicode is a standard for a character set and several encodings; UTF-8 is one of those encodings.Spartacus
Now you are saying that UTF-8 was a character set. It is not. I am also rather certain that your premise is false: a UTF-8 code sequence may not begin with 0xCC.Spartacus
You're right, I should have called it "encoding", as it appears (w3.org/TR/html4/charset.html). The HTML code is <meta charset=UTF-8> (HTML5) or <meta http-equiv=Content-Type content='text/html; charset=UTF-8'> however, so that's somewhat misleading.Nobe
Yes, I guess we will have to live with that mistake from the early Internet drafts (I'm talking RFC 822 and friends here) for a long time to come.Spartacus
I was wrong about 0xCC. Richard Ishida's excellent Unicode tools proved it.Spartacus
R
0

I had this same problem.

Adding

<meta charset="UTF-8">

to the HTML file fixed the issue.

In my case the templating engine was baking a json string into the HTML file. This string was in unicode.

While the template was also a unicode file, the JS engine was treating the string I wrote into the template as a latin-1 encoded string, until I added the meta tag.

I was comparing the typed in string to one of the JSON objects items (location.title == "Mühle")

Roofer answered 6/8, 2017 at 21:12 Comment(0)
A
0

Let the browser normalize unicode for you. This approach worked for me:

function normalizeUnicode(s) {
    let div = $('<div style="display: none"></div>').html(s).appendTo('body');
    let res = div.html();
    div.remove();
    return res;
}

normalizeUnicode(unicodeVal1) == normalizeUnicode(unicodeVal2)
Ajax answered 11/8, 2021 at 14:53 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.