How do I make toLowerCase() and toUpperCase() consistent across browsers
Asked Answered
O

1

3

Are there JavaScript polyfill implementations of String.toLowerCase() and String.toUpperCase(), or other methods in JavaScript that can work with Unicode characters and are consistent across browsers?

Background info

Performing the following will give difference results in browsers, or even between browser versions (E.g FireFox 54 vs 55):

document.write(String.fromCodePoint(223).normalize("NFKC").toLowerCase().toUpperCase().toLowerCase())

In Firefox 55 it gives you ss, in Firefox 54 it gives you ß.

Generally this is fine, and mechanisms such as Locales handle a lot of the cases you'd want; however, when you need consistent behavior across platforms such as talking to BaaS systems like it can greatly simplify interactions where you're essentially processing internal data on the client.

Osana answered 26/11, 2018 at 19:48 Comment(9)
Have you tried the same with toLocaleLowerCase()? developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/…Vicegerent
This doesn't work because I need consistent behavior across browsers (all clients need the same result) and hence need it to be locale independent.Osana
Why are you even converting to and fro?Vicegerent
To the close voters, this is definitely on-topic as a specific programming/algorithm question for JavaScript/Unicode as it pertains to different browser implementations.Osana
Have you checked the changelog for FF55?Vicegerent
@connexo, I'm updating my answer to #48096563 to include details on unicode case folding as the current approach results in different data/results across clients. This isn't an issue in languages like Python/Java as you can ensure a consistent implementation, JavaScript in browsers is uniquely broken for this case. This issue is not specific to FF, but browsers in general - e.g. Chrome vs IE are different.Osana
There have been changes as of ES6 regarding the handling of Unicode 16 characters, also reflecting in a change in the spec of how to implement toLowerCase(): ecma-international.org/ecma-262/6.0/… vs ecma-international.org/ecma-262/5.1/#sec-15.5.4.16Vicegerent
Note that String.fromCodePoint(223) can be written as a string literal to simplify the example a bit: '\xDF'.Heeler
@MathiasBynens I was using fromCodePoint as I was cheating in my testing by passing data URL schemes to browsershots and it helped me avoid URL escaping that garbled the JavaScript. I know, terrible :)Osana
H
3

Note that this issue only seems to affect outdated versions of Firefox, so unless you explicitly need to support those old versions, you could choose to just not bother at all. The behavior for your example is the same in all modern browsers (since the change in Firefox). This can be verified using jsvu + eshost:

$ jsvu # Update installed JavaScript engine binaries to the latest version.

$ eshost -e '"\xDF".normalize("NFKC").toLowerCase().toUpperCase().toLowerCase()'
#### Chakra
ss

#### V8 --harmony
ss

#### JavaScriptCore
ss

#### V8
ss

#### SpiderMonkey
ss

#### xs
ss

But you asked how to solve this problem, so let’s continue.

Step 4 of https://tc39.github.io/ecma262/#sec-string.prototype.tolowercase states:

Let cuList be a List where the elements are the result of toLowercase(cpList), according to the Unicode Default Case Conversion algorithm.

This Unicode Default Case Conversion algorithm is specified in section 3.13 Default Case Algorithms of the Unicode standard.

The full case mappings for Unicode characters are obtained by using the mappings from SpecialCasing.txt plus the mappings from UnicodeData.txt, excluding any of the latter mappings that would conflict. Any character that does not have a mapping in these files is considered to map to itself.

[…]

The following rules specify the default case conversion operations for Unicode strings. These rules use the full case conversion operations, Uppercase_Mapping(C), Lowercase_Mapping(C), and Titlecase_Mapping(C), as well as the context-dependent mappings based on the casing context, as specified in Table 3-17.

For a string X:

  • R1 toUppercase(X): Map each character C in X to Uppercase_Mapping(C).
  • R2 toLowercase(X): Map each character C in X to Lowercase_Mapping(C).

Here’s an example from SpecialCasing.txt, with my annotation added below:

00DF  ; 00DF   ; 0053 0073; 0053 0053;                      # LATIN SMALL LETTER SHARP S
<code>; <lower>; <title>  ; <upper>  ; (<condition_list>;)? # <comment>

This line says that U+00DF ('ß') lowercases to U+00DF (ß) and uppercases to U+0053 U+0053 (SS).

Here’s an example from UnicodeData.txt, with my annotation added below:

0041  ; LATIN CAPITAL LETTER A; Lu;0;L;;;;;N;;;; 0061   ;
<code>; <name>                ; <ignore>       ; <lower>; <upper>

This line says that U+0041 ('A') lowercases to U+0061 ('a'). It doesn’t have an explicit uppercase mapping, meaning it uppercases to itself.

Here’s another example from UnicodeData.txt:

0061  ; LATIN SMALL LETTER A; Ll;0;L;;;;;N;; ;0041;        ; 0041
<code>; <name>              ; <ignore>            ; <lower>; <upper>

This line says that U+0061 ('a') uppercases to U+0041 ('A'). It doesn’t have an explicit lowercase mapping, meaning it lowercases to itself.

You could write a script that parses these two files, reads each line following these examples, and builds lowercase/uppercase mappings. You could then turn those mappings into a small JavaScript library that provides spec-compliant toLowerCase/toUpperCase functionality.

This seems like a lot of work. Depending on the old behavior in Firefox and what exactly changed (?) you could probably limit the work to just the special mappings in SpecialCasing.txt. (I’m making this assumption that only the special casings changed in Firefox 55, based on the example you provided.)

// Instead of…
function normalize(string) {
  const normalized = string.normalize('NFKC');
  const lowercased = normalized.toLowerCase();
  return lowercased;
}

// …one could do something like:
function lowerCaseSpecialCases(string) {
  // TODO: replace all SpecialCasing.txt characters with their lowercase
  // mapping.
  return string.replace(/TODO/g, fn);
}
function normalize(string) {
  const normalized = string.normalize('NFKC');
  const fixed = lowerCaseSpecialCases(normalized); // Workaround for old Firefox 54 behavior.
  const lowercased = fixed.toLowerCase();
  return lowercased;
}

I wrote a script that parses SpecialCasing.txt and generates a JS library that implements the lowerCaseSpecialCases functionality mentioned above (as toLower) as well as toUpper. Here it is: https://gist.github.com/mathiasbynens/a37e3f3138069729aa434ea90eea4a3c Depending on your exact use case, you might not need the toUpper and its corresponding regex and map at all. Here’s the full generated library:

const reToLower = /[\u0130\u1F88-\u1F8F\u1F98-\u1F9F\u1FA8-\u1FAF\u1FBC\u1FCC\u1FFC]/g;
const toLowerMap = new Map([
  ['\u0130', 'i\u0307'],
  ['\u1F88', '\u1F80'],
  ['\u1F89', '\u1F81'],
  ['\u1F8A', '\u1F82'],
  ['\u1F8B', '\u1F83'],
  ['\u1F8C', '\u1F84'],
  ['\u1F8D', '\u1F85'],
  ['\u1F8E', '\u1F86'],
  ['\u1F8F', '\u1F87'],
  ['\u1F98', '\u1F90'],
  ['\u1F99', '\u1F91'],
  ['\u1F9A', '\u1F92'],
  ['\u1F9B', '\u1F93'],
  ['\u1F9C', '\u1F94'],
  ['\u1F9D', '\u1F95'],
  ['\u1F9E', '\u1F96'],
  ['\u1F9F', '\u1F97'],
  ['\u1FA8', '\u1FA0'],
  ['\u1FA9', '\u1FA1'],
  ['\u1FAA', '\u1FA2'],
  ['\u1FAB', '\u1FA3'],
  ['\u1FAC', '\u1FA4'],
  ['\u1FAD', '\u1FA5'],
  ['\u1FAE', '\u1FA6'],
  ['\u1FAF', '\u1FA7'],
  ['\u1FBC', '\u1FB3'],
  ['\u1FCC', '\u1FC3'],
  ['\u1FFC', '\u1FF3']
]);
const toLower = (string) => string.replace(reToLower, (match) => toLowerMap.get(match));

const reToUpper = /[\xDF\u0149\u01F0\u0390\u03B0\u0587\u1E96-\u1E9A\u1F50\u1F52\u1F54\u1F56\u1F80-\u1FAF\u1FB2-\u1FB4\u1FB6\u1FB7\u1FBC\u1FC2-\u1FC4\u1FC6\u1FC7\u1FCC\u1FD2\u1FD3\u1FD6\u1FD7\u1FE2-\u1FE4\u1FE6\u1FE7\u1FF2-\u1FF4\u1FF6\u1FF7\u1FFC\uFB00-\uFB06\uFB13-\uFB17]/g;
const toUpperMap = new Map([
  ['\xDF', 'SS'],
  ['\uFB00', 'FF'],
  ['\uFB01', 'FI'],
  ['\uFB02', 'FL'],
  ['\uFB03', 'FFI'],
  ['\uFB04', 'FFL'],
  ['\uFB05', 'ST'],
  ['\uFB06', 'ST'],
  ['\u0587', '\u0535\u0552'],
  ['\uFB13', '\u0544\u0546'],
  ['\uFB14', '\u0544\u0535'],
  ['\uFB15', '\u0544\u053B'],
  ['\uFB16', '\u054E\u0546'],
  ['\uFB17', '\u0544\u053D'],
  ['\u0149', '\u02BCN'],
  ['\u0390', '\u0399\u0308\u0301'],
  ['\u03B0', '\u03A5\u0308\u0301'],
  ['\u01F0', 'J\u030C'],
  ['\u1E96', 'H\u0331'],
  ['\u1E97', 'T\u0308'],
  ['\u1E98', 'W\u030A'],
  ['\u1E99', 'Y\u030A'],
  ['\u1E9A', 'A\u02BE'],
  ['\u1F50', '\u03A5\u0313'],
  ['\u1F52', '\u03A5\u0313\u0300'],
  ['\u1F54', '\u03A5\u0313\u0301'],
  ['\u1F56', '\u03A5\u0313\u0342'],
  ['\u1FB6', '\u0391\u0342'],
  ['\u1FC6', '\u0397\u0342'],
  ['\u1FD2', '\u0399\u0308\u0300'],
  ['\u1FD3', '\u0399\u0308\u0301'],
  ['\u1FD6', '\u0399\u0342'],
  ['\u1FD7', '\u0399\u0308\u0342'],
  ['\u1FE2', '\u03A5\u0308\u0300'],
  ['\u1FE3', '\u03A5\u0308\u0301'],
  ['\u1FE4', '\u03A1\u0313'],
  ['\u1FE6', '\u03A5\u0342'],
  ['\u1FE7', '\u03A5\u0308\u0342'],
  ['\u1FF6', '\u03A9\u0342'],
  ['\u1F80', '\u1F08\u0399'],
  ['\u1F81', '\u1F09\u0399'],
  ['\u1F82', '\u1F0A\u0399'],
  ['\u1F83', '\u1F0B\u0399'],
  ['\u1F84', '\u1F0C\u0399'],
  ['\u1F85', '\u1F0D\u0399'],
  ['\u1F86', '\u1F0E\u0399'],
  ['\u1F87', '\u1F0F\u0399'],
  ['\u1F88', '\u1F08\u0399'],
  ['\u1F89', '\u1F09\u0399'],
  ['\u1F8A', '\u1F0A\u0399'],
  ['\u1F8B', '\u1F0B\u0399'],
  ['\u1F8C', '\u1F0C\u0399'],
  ['\u1F8D', '\u1F0D\u0399'],
  ['\u1F8E', '\u1F0E\u0399'],
  ['\u1F8F', '\u1F0F\u0399'],
  ['\u1F90', '\u1F28\u0399'],
  ['\u1F91', '\u1F29\u0399'],
  ['\u1F92', '\u1F2A\u0399'],
  ['\u1F93', '\u1F2B\u0399'],
  ['\u1F94', '\u1F2C\u0399'],
  ['\u1F95', '\u1F2D\u0399'],
  ['\u1F96', '\u1F2E\u0399'],
  ['\u1F97', '\u1F2F\u0399'],
  ['\u1F98', '\u1F28\u0399'],
  ['\u1F99', '\u1F29\u0399'],
  ['\u1F9A', '\u1F2A\u0399'],
  ['\u1F9B', '\u1F2B\u0399'],
  ['\u1F9C', '\u1F2C\u0399'],
  ['\u1F9D', '\u1F2D\u0399'],
  ['\u1F9E', '\u1F2E\u0399'],
  ['\u1F9F', '\u1F2F\u0399'],
  ['\u1FA0', '\u1F68\u0399'],
  ['\u1FA1', '\u1F69\u0399'],
  ['\u1FA2', '\u1F6A\u0399'],
  ['\u1FA3', '\u1F6B\u0399'],
  ['\u1FA4', '\u1F6C\u0399'],
  ['\u1FA5', '\u1F6D\u0399'],
  ['\u1FA6', '\u1F6E\u0399'],
  ['\u1FA7', '\u1F6F\u0399'],
  ['\u1FA8', '\u1F68\u0399'],
  ['\u1FA9', '\u1F69\u0399'],
  ['\u1FAA', '\u1F6A\u0399'],
  ['\u1FAB', '\u1F6B\u0399'],
  ['\u1FAC', '\u1F6C\u0399'],
  ['\u1FAD', '\u1F6D\u0399'],
  ['\u1FAE', '\u1F6E\u0399'],
  ['\u1FAF', '\u1F6F\u0399'],
  ['\u1FB3', '\u0391\u0399'],
  ['\u1FBC', '\u0391\u0399'],
  ['\u1FC3', '\u0397\u0399'],
  ['\u1FCC', '\u0397\u0399'],
  ['\u1FF3', '\u03A9\u0399'],
  ['\u1FFC', '\u03A9\u0399'],
  ['\u1FB2', '\u1FBA\u0399'],
  ['\u1FB4', '\u0386\u0399'],
  ['\u1FC2', '\u1FCA\u0399'],
  ['\u1FC4', '\u0389\u0399'],
  ['\u1FF2', '\u1FFA\u0399'],
  ['\u1FF4', '\u038F\u0399'],
  ['\u1FB7', '\u0391\u0342\u0399'],
  ['\u1FC7', '\u0397\u0342\u0399'],
  ['\u1FF7', '\u03A9\u0342\u0399']
]);
const toUpper = (string) => string.replace(reToUpper, (match) => toUpperMap.get(match));
Heeler answered 27/11, 2018 at 19:8 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.