Non-ascii characters in URL

Asked 21/6, 2013 at 19:24 Answered 30/3, 2023 at 3:4

url special-characters non-ascii-characters

I ran into a new problem that I've never seen before: My client is adding files to a project we built and some of the filenames have special characters in them because some of the words are spanish.

For example a file I'm testing has an á in it. I am calling that image in a css file as a background image but in Safari it doesnt show up. But it does on FF and Chrome.

As a test I pasted the link into the browser and the same thing. Works on FF and Chrome but Safari throws an error. So the language characters are throwing it I guess?

Firefox converts the following url and changes the á to a%CC%81 and loads the image.

http://www.themediacouncil.com/test/nonascii/LA-MAR_Cebiche-Clássico_foto-Henrique-Peron-470x120-1371827671.jpg

You can see it breaks above... but FF and Chrome convert that to: http://www.themediacouncil.com/test/nonascii/LA-MAR_Cebiche-Cla%CC%81ssico_foto-Henrique-Peron-470x120-1371827671.jpg

You can also see this in action here: http://jsfiddle.net/Md4gZ/2/

.testbox { width:340px; height:100px; background:url('http://www.themediacouncil.com/test/nonascii/LA-MAR_Cebiche-Clássico_foto-Henrique-Peron-470x120-1371827671.jpg') no-repeat top left; }

So whats the right way to handle this. I'm developing in PHP and WORDPRESS. I'd rather not have to tell the client to go back and replace all files with special characters.

Any help is appreciated. Thanks!

Untwine answered 21/6, 2013 at 19:24 Comment(0)

I believe what is becoming the standard is to convert non-ascii characters to UTF-8 byte sequences, and include those sequences as %HH hex codes in the URL. The á character is U+00E1 (Unicode), which in UTF-8 makes the two bytes 0xC3 0xA1. Hence, Clássico would become Cl%C3%A1ssico.

The conversion you report from Firefox, Cla%CC%81ssico, did this slightly differently: it changed the á into a followed by U+0301, the COMBINING ACUTE ACCENT character. In UTF-8, U+0301 makes 0xCC 0x81.

Which representation you should choose – unicode “á” or “a followed by combining accent” – depends on what the web server needs for matching the right thing. In your case, maybe the filename actually contains the combining-character accent, and that's why it worked (hard to tell).

Another, older, way to handle non-ascii latin characters is to use an 8-bit latin charset representation (ISO-8859-1 or something similar, such as Windows-1252) and encode that as one byte. That would make Clássico into Cl%E1ssico. But since this only works for latin charsets, and is ambiguous for some of their characters, it is hopefully and probably disappearing.

Rosarosabel answered 17/1, 2014 at 12:33 Comment(2)

do you suggest declaring it in the header so the browser converts the characters on their own or some kind of a script? I have the same setup as the OP (Wordress) – Hockenberry 30/1, 2017 at 20:33

I can't say about any specific setup, but in general I suggest that URL:s are encoded in the code where they appear (in html or whatever) using the same convention that the web server does. If you have any influence of what convention the web server does use, I suggest using UTF-8 byte sequences using %HH. Declaring it in the header? Not sure that would have any effect, and probably browser dependent. – Rosarosabel 1/2, 2017 at 10:36

@njlarsson has already explained what to do well:

The conversion you report from Firefox, Cla%CC%81ssico, did this slightly differently: it changed the á into a followed by U+0301, the COMBINING ACUTE ACCENT character. In UTF-8, U+0301 makes 0xCC 0x81.

More generally I wanted to know why and how that's correct, so here's my thinking.

Why might one be motivated to do this?

Beyond of course the original - that a Spanish user should not need to know anything about encoding or decoding (unless they're an engineer or developer tasked with fixing broken implementations), another example can be found in the Google JavaScript style guide, which applies independent of programming language:

Tip: Never make your code less readable simply out of fear that some programs might not handle non-ASCII characters properly. If that happens, those programs are broken and they must be fixed.

At a high level, in the URL using percent sign % encoding is consistent with IETF RFC 1738 Section 2.2. Note it doesn't say what the % encoding means, though by convention, the web is UTF-8 as can be seen from Firefox and Chrome's correct behaviour back in 2013.

Where this breaks down is that in PHP (and so in Wordpress), it's likely the file name string is not encoded in UTF-8. Which one could be a natural question?

Encoding, decoding and re-encoding

The string could be provided as encoded initially in UTF-8, decoded to some internal format, perhaps UCS-2LE (which can make some string operations faster, but break for others, like emoji 😉 as they're encoded outside the basic multilingual plane), and then re-encoded for printing as UTF-8.

Continuing in PHP, for example using mb_convert_encoding, which may require the php-cli or server has php-mbstring installed:

php > $encoded = "http://www.themediacouncil.com/test/nonascii/LA-MAR_Cebiche-Cla%CC%81ssico_foto-Henrique-Peron-470x120-1371827671.jpg";
php > $decoded = mb_convert_encoding($encoded, "UTF-8", "UCS-2LE");
php > $reencoded = mb_convert_encoding($decoded, "UCS-2LE", "UTF-8");
php > echo $reencoded;
http://www.themediacouncil.com/test/nonascii/LA-MAR_Cebiche-Cla%CC%81ssico_foto-Henrique-Peron-470x120-1371827671.jpg

Or the string might not initially be encoded in UTF-8 at all, it'll depend on things like were it came from, which aren't provided here.

Aside: The $decoded string is likely to be nonsense if naively printed - which looks like a bit like the Python 2 "mojibake" problem:

php > echo $decoded;  # UCS-2LE printed naively likely shows nonsense
瑨灴⼺眯睷琮敨敭楤捡畯据汩挮浯琯獥⽴潮慮捳楩䰯ⵁ䅍归敃楢档ⵥ汃╡䍃㠥猱楳潣晟瑯ⵯ效牮煩敵倭牥湯㐭〷ㅸ〲ㄭ㜳㠱㜲㜶⸱灪?

How to perform the UTF-8 conversion?

The precise low-level details and mathematics, assuming one is curious enough to think about how the computer physically represents the data as binary or hexadecimal, can be found elsewhere on StackOverflow.

Columbary answered 30/3, 2023 at 3:4 Comment(0)

Why might one be motivated to do this?

Encoding, decoding and re-encoding

How to perform the UTF-8 conversion?

Recommended topics

Hot tags