HTML5 History.pushState mangles URL's containing percent encoded non-Ascii (Unicode) chars
Asked Answered
L

2

12

In an OSS web app, we have JS code that performs some Ajax update (uses jQuery, not relevant). After the page update, a call is made to the html5 history interface History.pushState, in the following code:

var updateHistory = function(url) {
    var context = { state:1, rand:Math.random() };
    /* -----> bedfore the problem call <------- */
    History.pushState( context, "Questions", url );
    /* -----> after the problem call <------- */
    setTimeout(function (){
        /* HACK: For some weird reson, sometimes something overrides the above pushState so we re-aplly it
                 This might be caused by some other JS plugin.
                 The delay of 10msec allows the other plugin to override the URL.
        */
        History.replaceState( context, "Questions", url );
    }, 10);
};

[Please note: the full code segment is provided for context, the HACK part is not the issue of this question]

The app is i18n'ed and is using URL encoded Unicode segments in the URL's, so just before the marked problem call in the above code, the URL argument contains (as inspected in Firebug):

"/%D8%A7%D9%84%D8%A3%D8%B3%D8%A6%D9%84%D8%A9/scope:all/sort:activity-desc/page:1/"

The encoded segment is utf-8 in percent encoding. The URL in the browser window is: (just for completeness, doesn't really matter)

http://<base-url>/%D8%A7%D9%84%D8%A3%D8%B3%D8%A6%D9%84%D8%A9/

Just after the call, the URL displayed in the browser window changes to:

http://<base-url>/%C3%98%C2%A7%C3%99%C2%84%C3%98%C2%A3%C3%98%C2%B3%C3%98%C2%A6%C3%99%C2%84%C3%98%C2%A9/scope:all/sort:activity-desc/page:1/

The URL encoded segment is just mojibake, the result of using the wrong encoding at some level. The correct URL would've been:

http://<base-url>/%D8%A7%D9%84%D8%A3%D8%B3%D8%A6%D9%84%D8%A9/scope:all/sort:activity-desc/page:1/

This behavior has been tested on both FF and Chrome.

The history interface specs don't mention anything about encoded URL's, but I assume the default standard for URL formation (utf-8 and percent encoding etc) would apply when using URL's in function calls for the interface.

Any idea on what's going on here.

Edit:

I wasn't paying attention to the uppercase H in History - this code is actually using the History.js wrapper for the history interface. I replaced with a direct call to history.pushState (notice the lowercase h) without going through the wrapper, and the code is working as expected as far as I can tell. The issue with the original code still stands - so an issue with the History.js library it seems.

Luhey answered 17/6, 2012 at 4:24 Comment(10)
urls dont accept : those need to be url encoded to their value to keep browser happy.Niemeyer
I just tried with encoded colons, but didn't help. The issue is in the Unicode segment - I think.Luhey
I'd either look at the content encoding in the header of the page/js or if you cant change that look to urlencode before pushing to historySpiritualist
Thanks! Turns out we were doing the exact same thing. We were calling "H"istory instead of "h"istory. Thanks!!!Reeve
. . History.js expects unencoded URLs, but even considering that, it encodes your string in a really weird way...Ichneumon
@DiegoNunes It does look weird at first glance, but there's a simple enough explanation. :) Basically, it's using an old unescape function that takes multibyte UTF-8 characters and interprets them as several single-byte characters. The weird URL after the call is what you get when that string is re-encoded.Placencia
Jordan, I tried to stablish a pattern but really couldn't. Copy these lines and paste in a text editor one above the other: %D____8 %A____7 %D____9 %8____4 %D____8 %A____3 %D____8 %B____3 %D____8 %A____6 %D____9 %8____4 %D____8 %A____9 and %C3%9 8 %C2%A 7 %C3%9 9 %C2%8 4 %C3%9 8 %C2%A 3 %C3%9 8 %C2%B 3 %C3%9 8 %C2%A 6 %C3%9 9 %C2%8 4 %C3%9 8 %C2%A 9.Ichneumon
The spaces help and you can clearly see that there is a repeating pattern (always a %C3%9 for a %D and %C2%B for a %B, for example). But even then I couldn't find the exact algorithm or logic that the encode function is using, nor could I repeat this buggy behavior in a controlled environment trying to call "encodeURI" multiple times and even looking through History.js source code. If you have any tips, I would love to hear them, just for the sake of learning.Ichneumon
@DiegoNunes Try running window.encodeURI(window.unescape('%D8%A7%D9%84%D8%A3%D8%B3%D8%A6%D9%84%D8%A9')) in your JS console; the output should be the same as History.js' output. My answer below goes into more detail, if that helps!Placencia
@DiegoNunes More detail: Take the first octet, %D8. This is bigger than 7F, so it gets split into two bytes when encoded in UTF-8, formatted as 11xxxxxx 10xxxxxx. (See the Wikipedia page on UTF-8 for why.) So first we convert it to binary and get. 11011000. Now, substitute those bits for the xs in the format pattern, using leading zeroes to pad it out: 11000011 10011000. Convert to hex, and you'll get C3 98. Voilà!Placencia
P
10

Update

As Doug S explains in the comments below, the latest version of History.js includes a fix for this behaviour. He also found that my solution caused double-encoding when used in browsers (such as IE 9 and below) which require the hash fallback, so I recommend that instead of using the fix detailed below, just download the latest version.

I've kept my original answer below, since it does explain what's going on in much more detail.


Basel found a resolution of sorts, but there's still some confusion about what's happening under the hood. This answer goes into detail about the problem and suggests a better fix. (You can skip straight to the fix if you want.)

The problem

First, open your browser's JS console and run this:

window.encodeURI(window.unescape('%D8%A7%D9%84%D8%A3%D8%B3%D8%A6%D9%84%D8%A9'))

Does that look familiar? It should—that's what your URL is being mangled to. The problem lies in the implementation of History.unescapeString, specifically this line:

tmp = window.unescape(result);

window.unescape is a DOM Level 0 function—which is to say, an unstandardised relic from the hoary days of Netscape 2. It uses the escaping rules defined in RFC 2396, according to which characters outside of the unreserved range (alphanumerics and a small set of punctuation symbols) are encoded as octets.

This works fine for the US-ASCII range, but not all (indeed, the vast majority) of the characters in UTF-8 can be represented in a single byte. Since URIs do not have a built-in way of representing the character set being used, window.unescape just assumes each character maps to a single octet and blithely mangles any that don't.

In this example, the first letter in your URL is the Arabic letter alef (ا), represented by two bytes: 0xD8 0xA7. window.unescape interprets these as two separate characters: 0x00 0xD8 (Ø—capital O with stroke) and 0x00 0xA7 (§—section sign).

This is a known issue with History.js.

The fix

As noted above by the asker, the issue can be sidestepped by using the native implementation of the History API instead of the History.js wrapper, i.e. history.pushState instead of History.pushState.

This works for browsers that support the History API, but loses the benefit of having a polyfill for those that don't. Fortunately, there's a better fix. Open up the History.js source you're referencing and find this line (~1059 in my copy):

tmp = window.unescape(result);

Replace it with:

tmp = window.unescape(encodeURIComponent(result));

Or, if you're using the compressed source, replace a.unescape(c) with a.unescape(encodeURIComponent(c)).

To test this change, I ran the History.js HTML5 jQuery test suite on a local web server inside an Arabic-named directory. Before making the change, test 14 fails; after the change, all tests passed.

Credit

Though I found the problem and solution independently, Damien Antipa deserves credit for finding it first and making a pull request with the fix.

Placencia answered 16/5, 2013 at 11:57 Comment(4)
Thanks for the very nice explanation, Jordan. I really didn't know that quirk with unescape.Ichneumon
This fix actually caused a bug for me. (The fix to modify the source to "a.unescape(encodeURIComponent(c))".) With IE 9 and older, which use the hash fallback for URLs, it caused encoding/escaping to occur multiple times. However, I was able to fix it all the right way by downloading the latest version of History.js, which has fixed this original issue.Trencherman
@DougS Thanks for that, Doug; I'll update my answer to mention both that the fix may break things and that the latest version of History.js now has the fix. (I'm really surprised at myself for not testing in older IE; I almost always would, so not sure how I slipped up this time!)Placencia
Yeah a.unescape(encodeURIComponent(c)) it works fine. Saved a day.Idolatrous
Z
1

I'm still able to reproduce this in the following case:

History.pushState(null, null, "?" + some_Unicode_String_Or_A_String_With_Whitespace);
document.location.hash += "&someStuff";

In this case the _suid parameter gets removed and &someStuff as well. If the string is not unicode or doesn't have whitespaces (so no % chars) - this does not happen.

This workaround worked for me:

History.pushState(null, null, "?" + some_Unicode_String_Or_A_String_With_Whitespace + "&someStuff");
Zebrawood answered 9/1, 2014 at 9:5 Comment(2)
That's interesting. Are you having this problem with the latest version of History.js? This was supposedly fixed in the library—if it's not, one of us should probably open a new issue to let them know! :)Placencia
PS: could you throw together a jsFiddle or test page demonstrating the issue?Placencia

© 2022 - 2024 — McMap. All rights reserved.