Javascript string comparison fails when comparing unicode characters

Asked 29/5, 2012 at 19:50 Answered 11/8, 2021 at 14:53

Solved javascript string unicode data-transfer unicode-normalization

I want to compare two strings in JavaScript that are the same, and yet the equality operator == returns false. One string contains a special character (eg. the danish å).

JavaScript code:

var filenameFromJS = "Designhåndbog.pdf";
var filenameFromServer = "Designhåndbog.pdf";

print(filenameFromJS == filenameFromServer); // This prints false why?

The solution What worked for me is unicode normalization as slevithan pointed out.

I forked my original jsfiddle to make a version using the normalization lib suggested by slevithan. Link: http://jsfiddle.net/GWZ8j/1/.

Nomanomad answered 29/5, 2012 at 19:50 Comment(3)

See this article about == vs. === #359994 – Sizemore 29/5, 2012 at 19:53

@Sizemore When both operands are of the same type, it does not matter if you use loose or strict comparison. – Spartacus 29/5, 2012 at 20:1

This is also very useful: joelonsoftware.com/2003/10/08/… (What every developer needs to know about unicode and character sets) – Tomasz 29/3, 2018 at 10:48

Unlike what some other people here have said, this has nothing to do with encodings. Rather, your two strings use different code points to render the same visual characters.

To solve this correctly, you need to perform Unicode normalization on the two strings before comparing them. Unforunately, JavaScript doesn't have this functionality built in. Here is a JavaScript library that can perform the normalization for you: https://github.com/walling/unorm

Nomenclator answered 29/5, 2012 at 20:3 Comment(4)

Oh, I was hoping not to get this answer :-) That I was just missing the obvious and wouldn't need a library for this simple task. Thanks for the answer I'll give it a try. – Nomanomad 29/5, 2012 at 20:21

You are right, I have missed that CC 8A is the UTF-8 code sequence for U+30A COMBINING RING ABOVE, which is preceded by a. The other string has C3 A5 which encodes U+00E5 LATIN SMALL LETTER A WITH RING ABOVE in UTF-8. IIRC, Mac OS prefers the combining characters, while other OSes prefer the single-glyph form. It should be possible to have the server convert either one, though, so there is no large client-side library necessary. – Spartacus 29/5, 2012 at 21:47

PointedEars, that's not necessarily possible or ideal. E.g., you might not want to do a server round trip just to perform a string comparison, or you might be using JavaScript on the server. @Tougher ,There is a proposal to add Unicode normalization to future versions of JavaScript. See strawman:unicode_normalization. – Nomenclator 30/5, 2012 at 3:56

There is now a String#normalize() method natively available in JS. – Therapeutic 20/4, 2022 at 9:14

The JavaScript equality operator == will appear to be failing under the following circumstances. In all cases it is programmer error. Not a bug in JavaScript.

The two strings do not contain the same number and sequence of characters.
There is whitespace or newlines before, within or after one string. Use a trim() operator on both and look closely at both strings.
Surprise typecasting. The programmer is comparing datatypes that are incompatible.
There are unicode characters which look identical to other unicode characters but in fact are different unicode characters.

Whitefly answered 29/10, 2013 at 3:17 Comment(3)

+1, because this answer is way more informative than the accepted one and doesn't contain something with nodeJS or jQuery. – Mulford 21/2, 2014 at 16:13

in this case number 4 was the culprit – Furlong 21/8, 2015 at 18:56

Different unicode normalisation is not about different characters, but means different unicode code point sequences were used to refer to the same character. – Amora 2/1, 2018 at 22:14

UTF-8 is a complex thing. The charset has two different ways to encode characters such as á, é etc.

Certain Unicode characters can be represented in a composed and decomposed form. For example, the German umlaut-u ü can be represented either by the single character ü or by u followed by ¨, which a text renderer would then combine.

(The Wikipedia article on Unicode equivalence has more details.)

As you already see in the URL encoded version, the HEX bytes of which the character is made differ for both versions.

In JavaScript, you can use String.prototype.normalize() to get a normalized form of a string.

For example:

var normalizedFilenameFromJS = "Designhåndbog.pdf".normalize();
var normalizedFilenameFromServer = "Designhåndbog.pdf".normalize();

console.log(normalizedFilenameFromJS === normalizedFilenameFromServer); // This prints true

.normalize() can be called with a parameter to specify the normalization form; see the linked Mozilla Developer article for available options.

Nobe answered 29/5, 2012 at 19:54 Comment(5)

JFTR: Unicode is not UTF-8. Unicode is a standard for a character set and several encodings; UTF-8 is one of those encodings. – Spartacus 29/5, 2012 at 20:2

Now you are saying that UTF-8 was a character set. It is not. I am also rather certain that your premise is false: a UTF-8 code sequence may not begin with 0xCC. – Spartacus 29/5, 2012 at 20:12

You're right, I should have called it "encoding", as it appears (w3.org/TR/html4/charset.html). The HTML code is <meta charset=UTF-8> (HTML5) or <meta http-equiv=Content-Type content='text/html; charset=UTF-8'> however, so that's somewhat misleading. – Nobe 29/5, 2012 at 20:24

Yes, I guess we will have to live with that mistake from the early Internet drafts (I'm talking RFC 822 and friends here) for a long time to come. – Spartacus 29/5, 2012 at 21:14

I was wrong about 0xCC. Richard Ishida's excellent Unicode tools proved it. – Spartacus 29/5, 2012 at 21:49

I had this same problem.

Adding

<meta charset="UTF-8">

to the HTML file fixed the issue.

In my case the templating engine was baking a json string into the HTML file. This string was in unicode.

While the template was also a unicode file, the JS engine was treating the string I wrote into the template as a latin-1 encoded string, until I added the meta tag.

I was comparing the typed in string to one of the JSON objects items (location.title == "Mühle")

Roofer answered 6/8, 2017 at 21:12 Comment(0)

Let the browser normalize unicode for you. This approach worked for me:

function normalizeUnicode(s) {
    let div = $('<div style="display: none"></div>').html(s).appendTo('body');
    let res = div.html();
    div.remove();
    return res;
}

normalizeUnicode(unicodeVal1) == normalizeUnicode(unicodeVal2)

Ajax answered 11/8, 2021 at 14:53 Comment(0)

Recommended topics

Hot tags