How to avoid browsers Unicode normalization when submitting a form with Unicode

Asked 24/6, 2012 at 10:12 Answered 16/9, 2015 at 18:17

Solved forms unicode normalization unicode-normalization

When rendering the following Unicode text in HTML, it turns out that the browser (Google Chrome) do some form of Unicode normalization when posting the data back to the server. (Probably in Form C).

But when using Biblical Hebrew (בְּרִיךְ הוּא) text, this can easily break the text, as it outlined in here (page 9).

Is there any way to avoid the browsers auto text normalization?

I wrote a blog post that describe in more details the issue that I'm facing: http://blog.hibernatingrhinos.com/12449/would-it-be-possible-to-have-a-web-browser-based-editor-for-an-hebrew-text

Noontide answered 24/6, 2012 at 10:12 Comment(10)

@Hans no. Why do you think so? – Noontide 24/6, 2012 at 11:14

Can't you simply apply the workaround described in the same document? – Judicious 24/6, 2012 at 11:38

And which specific browsers are you asking about? There is no single standardized API for "disable normalization when submitting forms", as far as I know. Individual browsers may or may not have an option to control this. And do you want a way for your website to disable normalization, or a way for the user of the browser to configure his browser to not normalize? – Judicious 24/6, 2012 at 11:42

No, I don't have any algorithm to automate adding the CombiningGraphemeJoiner char, and even if I had, I want to avoid the normalization at all, in order to preserve the character meanings. – Noontide 24/6, 2012 at 11:45

What makes you think that Google Chrome normalizes text when posting form data? Can you please provide an example? – Novation 24/6, 2012 at 15:18

@JukkaK.Korpela On the server, I tried to lookup the data on the server and nothing was match. It turned out that the text have changed somehow. Than I compared the original string, and the one that I get from the browser after a successful post-back. Than I found that this is part of the HTML spec, to normalize the text, w3.org/TR/charmod-norm. – Noontide 25/6, 2012 at 7:3

@FitzchakYitzchaki, the W3C document is a working draft for a policy document, effectively from year 2005, now being revised (note the text “This version of this document was published to indicate the Internationalization Core Working Group's intention to substantially alter or replace the recommendations found here with very different recommendations in the near future.”). I don’t think browsers apply it or intentionally perform any normalization on form data. Please provide an example for analysis. It is quite possible that normalization is performed server-side when storing the data. – Novation 25/6, 2012 at 10:7

@JukkaK.Korpela no, the normalization is happens by the browser. I'm writing a blog post that show this, which much more details. – Noontide 25/6, 2012 at 10:9

Blog post is here blog.hibernatingrhinos.com/12449/… – Noontide 25/6, 2012 at 11:35

I ran across this here, though looking at the bug it appears to be fixed now. – Aerify 5/5, 2017 at 1:44

This seems to a be a feature/bug in WebKit browsers (Chrome, Safari); they normalize form data to NFC, which means, among other things, reordering consecutive combining marks to a “canonical” order. This was new to me, and bad news in cases like this. The worst thing is that different browsers behave differently.

Using a simplified version of your test case http://blog.hibernatingrhinos.com/12449/would-it-be-possible-to-have-a-web-browser-based-editor-for-an-hebrew-text (using a server-side script that just echoes the raw data), I noticed that Chrome and Safari reorder the diacritic marks in U+05E9 U+05C1 U+05B5 (SHIN, SHIN DOT, TSERE), whereas IE, Firefox, and Opera do not.

I also ran a simple test with Latin letter e followed by combinining diaeresis U+0308. WebKit browsers convert it to the single character ë, as per NFC rules, whereas other browsers keep the character pair intact.

This seems to be an intentional feature, ever since 2006; https://bugs.webkit.org/show_bug.cgi?id=8769 proudly announces this as part of a bug fix! This might explain the status of the W3C policy document; its current version is WebKit-minded in this issue, but other browser vendors either aren’t interested or knowingly oppose the idea of “early normalization.”

I don’t think there is a way to prevent this. But you could warn users against using Chrome and Safari. You could even use a hidden field containing a simple problem case, then check server side whether it was transmitted as−is, and tell the user to change browser if it isn’t.

Fixing the order server-side isn’t simple, because common normalization routines apparently do not support the order needed. You could normalize to fully decomposed form (NFD), then reorder combining marks using your own code for the purpose. Perhaps simpler and safer, you could just run an ad hoc replacement routine that replaces sequences of combining marks with other sequences. This would be safer because it would not affect characters other than those you want to affect, whereas NFD decomposes Latin letters with diacritics, among other things.

According to Unicode principles, canonically equivalent strings (e.g., differing only in the order of consecutive diacritic marks) are different representations of the same data but distinct as sequences of Unicode characters (code points); they are not expected to differ in presentation, but they may, and often do. Generally, you should not expect programs to treat canonically equivalent strings as different, though programs may make a difference. See Unicode Normalization FAQ.

The FAQ entry claims that the problems of biblical Hebrew have been solved by the introduction of COMBINING GRAPHEME JOINER. Although it prevents the reordering in Chrome, it’s a clumsy method, and it may mess up rendering (it does in web browsers; diacritic marks may get badly misplaced).

Novation answered 25/6, 2012 at 13:8 Comment(6)

I think that this is more a bug than a feature, since the normalization occur not on text rendering, but of form submit. At this point, normalization decisions should be the server one, not the browser. – Noontide 26/6, 2012 at 7:26

I created an issue for that, code.google.com/p/chromium/issues/… – Noontide 26/6, 2012 at 9:43

+1: "But you could warn users against using Chrome and Safari." Usually user is warned about using ie6-8. – Ashliashlie 26/9, 2012 at 19:33

I am Los facing the same (Bug?). Posted here code.google.com/p/chromium/issues/detail?id=117128#. So what's the solution? Alter characters on server side? I wrote a s riot for that too. – Radiotelegraph 21/11, 2012 at 0:11

Is it not better to store all strings in the database in some determined normalization? Browsers know how to render/edit UTF8 in any normalization you give them. Send out UTF-8 in (say NFC) then let them edit, etc and then normalize to NFC on storage. Settling on a normalization allows searching, not doing it settles on returning the same byte pattern the user typed. Its your choice. We usually store verbatim, then when searching use special non - ascii old school search algorithms. – Forestall 10/5, 2017 at 19:59

Tom Andersen, the issue here was that a particular non-NFC form, entered by the user, was needed for correct rendering (of Biblical Hebrew text). So it was about preserving the order as written, and browser intervention (early normalization) makes that impossible. – Novation 11/5, 2017 at 4:20

It is possible to avoid the string normalization by sending a Uint8Array rather than a string. First, get the UTF-8 data of your string as a Uint8Array as described here by @Moshev:

function utf8AbFromStr(str) {
    var strUtf8 = unescape(encodeURIComponent(str));
    var ab = new Uint8Array(strUtf8.length);
    for (var i = 0; i < strUtf8.length; i++) {
        ab[i] = strUtf8.charCodeAt(i);
    }
    return ab;
}

Then you can POST that Uint8Array with plain XHR or your favorite Ajax library. If you're using jQuery, keep in mind that you need to specify processData: false to prevent jQuery from trying to stringify it and undoing all of your hard work.

Broomrape answered 16/9, 2015 at 18:17 Comment(1)

Can nowadays be utterly simplified by using a TextDecoder: utf8AbFromStr = (str) => new TextEncoder().encode(str);. – Amadis 10/9, 2020 at 11:50

You can manipulate the text on the client side before you submit. If inserting a Combining Grapheme Joiner works, you could insert it via JavaScript.

As a staring point, but here's a JSFiddle that gets the characters letter by letter (tested in Safari and it doesn't normalize text): http://jsfiddle.net/TmtnA/

Rizzio answered 5/9, 2012 at 10:46 Comment(0)

Recommended topics

Hot tags