How to avoid browsers Unicode normalization when submitting a form with Unicode
Asked Answered
N

3

16

When rendering the following Unicode text in HTML, it turns out that the browser (Google Chrome) do some form of Unicode normalization when posting the data back to the server. (Probably in Form C).

But when using Biblical Hebrew (בְּרִיךְ הוּא) text, this can easily break the text, as it outlined in here (page 9).

Is there any way to avoid the browsers auto text normalization?

I wrote a blog post that describe in more details the issue that I'm facing: http://blog.hibernatingrhinos.com/12449/would-it-be-possible-to-have-a-web-browser-based-editor-for-an-hebrew-text

Noontide answered 24/6, 2012 at 10:12 Comment(10)
@Hans no. Why do you think so?Noontide
Can't you simply apply the workaround described in the same document?Judicious
And which specific browsers are you asking about? There is no single standardized API for "disable normalization when submitting forms", as far as I know. Individual browsers may or may not have an option to control this. And do you want a way for your website to disable normalization, or a way for the user of the browser to configure his browser to not normalize?Judicious
No, I don't have any algorithm to automate adding the CombiningGraphemeJoiner char, and even if I had, I want to avoid the normalization at all, in order to preserve the character meanings.Noontide
What makes you think that Google Chrome normalizes text when posting form data? Can you please provide an example?Novation
@JukkaK.Korpela On the server, I tried to lookup the data on the server and nothing was match. It turned out that the text have changed somehow. Than I compared the original string, and the one that I get from the browser after a successful post-back. Than I found that this is part of the HTML spec, to normalize the text, w3.org/TR/charmod-norm.Noontide
@FitzchakYitzchaki, the W3C document is a working draft for a policy document, effectively from year 2005, now being revised (note the text “This version of this document was published to indicate the Internationalization Core Working Group's intention to substantially alter or replace the recommendations found here with very different recommendations in the near future.”). I don’t think browsers apply it or intentionally perform any normalization on form data. Please provide an example for analysis. It is quite possible that normalization is performed server-side when storing the data.Novation
@JukkaK.Korpela no, the normalization is happens by the browser. I'm writing a blog post that show this, which much more details.Noontide
Blog post is here blog.hibernatingrhinos.com/12449/…Noontide
I ran across this here, though looking at the bug it appears to be fixed now.Aerify
N
13

This seems to a be a feature/bug in WebKit browsers (Chrome, Safari); they normalize form data to NFC, which means, among other things, reordering consecutive combining marks to a “canonical” order. This was new to me, and bad news in cases like this. The worst thing is that different browsers behave differently.

Using a simplified version of your test case http://blog.hibernatingrhinos.com/12449/would-it-be-possible-to-have-a-web-browser-based-editor-for-an-hebrew-text (using a server-side script that just echoes the raw data), I noticed that Chrome and Safari reorder the diacritic marks in U+05E9 U+05C1 U+05B5 (SHIN, SHIN DOT, TSERE), whereas IE, Firefox, and Opera do not.

I also ran a simple test with Latin letter e followed by combinining diaeresis U+0308. WebKit browsers convert it to the single character ë, as per NFC rules, whereas other browsers keep the character pair intact.

This seems to be an intentional feature, ever since 2006; https://bugs.webkit.org/show_bug.cgi?id=8769 proudly announces this as part of a bug fix! This might explain the status of the W3C policy document; its current version is WebKit-minded in this issue, but other browser vendors either aren’t interested or knowingly oppose the idea of “early normalization.”

I don’t think there is a way to prevent this. But you could warn users against using Chrome and Safari. You could even use a hidden field containing a simple problem case, then check server side whether it was transmitted as−is, and tell the user to change browser if it isn’t.

Fixing the order server-side isn’t simple, because common normalization routines apparently do not support the order needed. You could normalize to fully decomposed form (NFD), then reorder combining marks using your own code for the purpose. Perhaps simpler and safer, you could just run an ad hoc replacement routine that replaces sequences of combining marks with other sequences. This would be safer because it would not affect characters other than those you want to affect, whereas NFD decomposes Latin letters with diacritics, among other things.

According to Unicode principles, canonically equivalent strings (e.g., differing only in the order of consecutive diacritic marks) are different representations of the same data but distinct as sequences of Unicode characters (code points); they are not expected to differ in presentation, but they may, and often do. Generally, you should not expect programs to treat canonically equivalent strings as different, though programs may make a difference. See Unicode Normalization FAQ.

The FAQ entry claims that the problems of biblical Hebrew have been solved by the introduction of COMBINING GRAPHEME JOINER. Although it prevents the reordering in Chrome, it’s a clumsy method, and it may mess up rendering (it does in web browsers; diacritic marks may get badly misplaced).

Novation answered 25/6, 2012 at 13:8 Comment(6)
I think that this is more a bug than a feature, since the normalization occur not on text rendering, but of form submit. At this point, normalization decisions should be the server one, not the browser.Noontide
I created an issue for that, code.google.com/p/chromium/issues/…Noontide
+1: "But you could warn users against using Chrome and Safari." Usually user is warned about using ie6-8.Ashliashlie
I am Los facing the same (Bug?). Posted here code.google.com/p/chromium/issues/detail?id=117128#. So what's the solution? Alter characters on server side? I wrote a s riot for that too.Radiotelegraph
Is it not better to store all strings in the database in some determined normalization? Browsers know how to render/edit UTF8 in any normalization you give them. Send out UTF-8 in (say NFC) then let them edit, etc and then normalize to NFC on storage. Settling on a normalization allows searching, not doing it settles on returning the same byte pattern the user typed. Its your choice. We usually store verbatim, then when searching use special non - ascii old school search algorithms.Forestall
Tom Andersen, the issue here was that a particular non-NFC form, entered by the user, was needed for correct rendering (of Biblical Hebrew text). So it was about preserving the order as written, and browser intervention (early normalization) makes that impossible.Novation
B
4

It is possible to avoid the string normalization by sending a Uint8Array rather than a string. First, get the UTF-8 data of your string as a Uint8Array as described here by @Moshev:

function utf8AbFromStr(str) {
    var strUtf8 = unescape(encodeURIComponent(str));
    var ab = new Uint8Array(strUtf8.length);
    for (var i = 0; i < strUtf8.length; i++) {
        ab[i] = strUtf8.charCodeAt(i);
    }
    return ab;
}

Then you can POST that Uint8Array with plain XHR or your favorite Ajax library. If you're using jQuery, keep in mind that you need to specify processData: false to prevent jQuery from trying to stringify it and undoing all of your hard work.

Broomrape answered 16/9, 2015 at 18:17 Comment(1)
Can nowadays be utterly simplified by using a TextDecoder: utf8AbFromStr = (str) => new TextEncoder().encode(str);.Amadis
R
0

You can manipulate the text on the client side before you submit. If inserting a Combining Grapheme Joiner works, you could insert it via JavaScript.

As a staring point, but here's a JSFiddle that gets the characters letter by letter (tested in Safari and it doesn't normalize text): http://jsfiddle.net/TmtnA/

Rizzio answered 5/9, 2012 at 10:46 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.