JavaScript Unicode normalization
Asked Answered
S

4

15

I'm under the impression that JavaScript interpreter assumes that the source code it is interpreting has already been normalized. What, exactly does the normalizing? It can't be the text editor, otherwise the plaintext representation of the source would change. Is there some "preprocessor" that does the normalization?

Sextain answered 14/10, 2011 at 19:23 Comment(1)
The browser engine I believe is what handles it. Which is why you have the discrepancies between browsers on what they do and don't support.Sewell
E
15

No, there is no Unicode Normalization feature used automatically on—or even available to—JavaScript as per ECMAScript 5. All characters remain unchanged as their original code points, potentially in a non-Normal Form.

eg try:

<script type="text/javascript">
    var a= 'café';          // caf\u00E9
    var b= 'café';          // cafe\u0301
    alert(a+' '+a.length);  // café 4
    alert(b+' '+b.length);  // café 5
    alert(a==b);            // false
</script>

Update: ECMAScript 6 will introduce Unicode normalization for JavaScript strings.

Ectoparasite answered 14/10, 2011 at 21:25 Comment(1)
It should be pointed out that JavaScript PREDATES UTF-16 and actually exposes UCS-2. (What it uses internally may or may not be UTF-16, but it kicks UCS-2 out.)Thomasina
D
18

ECMAScript 6 introduces String.prototype.normalize() which takes care of Unicode normalization for you.

unorm is a JavaScript polyfill for this method, so that you can already use String.prototype.normalize() today even though not a single engine supports it natively at the moment.

For more information on how and when to use Unicode normalization in JavaScript, see JavaScript has a Unicode problem – Accounting for lookalikes.

Dermatogen answered 9/12, 2013 at 14:20 Comment(0)
E
15

No, there is no Unicode Normalization feature used automatically on—or even available to—JavaScript as per ECMAScript 5. All characters remain unchanged as their original code points, potentially in a non-Normal Form.

eg try:

<script type="text/javascript">
    var a= 'café';          // caf\u00E9
    var b= 'café';          // cafe\u0301
    alert(a+' '+a.length);  // café 4
    alert(b+' '+b.length);  // café 5
    alert(a==b);            // false
</script>

Update: ECMAScript 6 will introduce Unicode normalization for JavaScript strings.

Ectoparasite answered 14/10, 2011 at 21:25 Comment(1)
It should be pointed out that JavaScript PREDATES UTF-16 and actually exposes UCS-2. (What it uses internally may or may not be UTF-16, but it kicks UCS-2 out.)Thomasina
R
11

If you're using node.js, there is a unorm library for this.

https://github.com/walling/unorm

Ramires answered 11/12, 2011 at 15:56 Comment(0)
S
1

I've updated @bobince 's answer:

var cafe4= 'caf\u00E9';
var cafe5= 'cafe\u0301';


console.log (
  cafe4+' '+cafe4.length,                  // café 4
  cafe5+' '+cafe5.length,                  // café 5
  cafe4 === cafe5,                         // false
  cafe4.normalize() === cafe5.normalize()  // true
);
Schuller answered 6/1, 2018 at 18:53 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.