What Character Encoding is best for multinational companies
Asked Answered
L

3

21

If you had a website that was to be translated into every language in the world and therefore had a database with all these translations what character encoding would be best? UTF-128?

If so do all browsers understand the chosen encoding? Is character encoding straight forward to implement or are there hidden factors?

Thanks in advance.

Lavish answered 20/4, 2011 at 15:43 Comment(7)
The ease of implementation depends on programming language/platform. Sadly there are still some widely used programming languages without native support for Unicode.Emelda
If you want all (all) browsers to understand and represent your encoding well, it is safest to stick to ASCII / rendered images.Grussing
@kotlinski: what about my IBM mainframe text browser that only supports EBCDIC? On a serious note: if you reduce the set of browsers to "all sane browsers produced in the last few years" (which might even include IE 5.5 in this specific case), then UTF-8 and UTF-16 are equally valid.Ginsberg
@Joachim: gah! Kill it with fire!Papism
Joachim: The browser will understand it, but one problem is that many operating systems out there will not have representations for all characters.Grussing
@kotlinski: that's true, but you'll find that this is much less of a problem if the language you're using is the native language of your users: Users in countries that need special fonts usually do have those fonts (and the OS to support them).Ginsberg
@Joachim: It's a relatively small problem, but fortunately these days, a bigger problem than deciding which encoding to use :)Grussing
A
43

If you want to support a variety of languages for web content, you should use an encoding that covers the entire Unicode range. The best choice for this purpose is UTF-8. UTF-8 is the preferred encoding for the web; from the HTML5 draft standard:

Authors are encouraged to use UTF-8. Conformance checkers may advise authors against using legacy encodings. [RFC3629]

Authoring tools should default to using UTF-8 for newly-created documents. [RFC3629]

UTF-8 and Windows-1252 are the only encodings required to be supported by browsers, and UTF-8 and UTF-16 are the only encodings required to be supported by XML parsers. UTF-8 is thus the only common encoding that everything is required to support.


The following is more of an expanded response to Liv's answer than an answer on its own; it's a description of why UTF-8 is preferable to UTF-16 even for CJK content.

For characters in the ASCII range, UTF-8 is more compact (1 byte vs 2) than UTF-16. For characters between the ASCII range and U+07FF (which includes Latin Extended, Cyrillic, Greek, Arabic, and Hebrew), UTF-8 also uses two bytes per character, so it's a wash. For characters outside the Basic Multilingual Plane, both UTF-8 and UTF-16 use 4 bytes per character, so it's a wash there.

The only range in which UTF-16 is more efficient than UTF-8 is for characters from U+07FF to U+FFFF, which includes Indic alphabets and CJK. Even for a lot of text in that range, UTF-8 winds up being comparable, because the markup of that text (HTML, XML, RTF, or what have you) is all in the ASCII range, for which UTF-8 is half the size of UTF-16.

For example, if I pick a random web page in Japanese, the home page of nhk.or.jp, it is encoded in UTF-8. If I transcode it to UTF-16, it grows to almost twice its original size:

$ curl -o nhk.html 'http://www.nhk.or.jp/'
$ iconv -f UTF-8 -t UTF-16 nhk.html > nhk.16.html
$ ls -al nhk*
-rw-r--r--  1 lambda  lambda  32416 Mar 13 13:06 nhk.16.html
-rw-r--r--  1 lambda  lambda  18337 Mar 13 13:04 nhk.html

UTF-8 is better in almost every way than UTF-16. Both of them are variable width encodings, and so have the complexity that entails. In UTF-16, however, 4 byte characters are fairly uncommon, so it's a lot easier to make fixed width assumptions and have everything work until you run into a corner case that you didn't catch. An example of this confusion can be seen in the encoding CESU-8, which is what you get if you convert UTF-16 text into UTF-8 by just encoding each half of a surrogate pair as a separate character (using 6 bytes per character; three bytes to encode each half of the surrogate pair in UTF-8), instead of decoding the pair to its codepoint and encoding that into UTF-8. This confusion is common enough that the mistaken encoding has actually been standardized so that at least broken programs can be made to interoperate.

UTF-8 is much smaller than UTF-16 for the vast majority of content, and if you're concerned about size, compressing your text will always do better than just picking a different encoding. UTF-8 is compatible with APIs and data structures that use a null-terminated sequence of bytes to represent strings, so as long as your APIs and data structures either don't care about encoding or can already handle different encodings in their strings (such as most C and POSIX string handling APIs), UTF-8 can work just fine without having to have a whole new set of APIs and data structures for wide characters. UTF-16 doesn't specify endianness, so it makes you deal with endianness issues; actually there are three different related encodings, UTF-16, UTF-16BE, and UTF-16LE. UTF-16 could be either big endian or little endian, and so requires a BOM to specify. UTF-16BE and LE are big and little endian versions, with no BOM, so you need to use an out-of-band method (such as a Content-Type HTTP header) to signal which one you're using, but out-of-band headers are notorious for being wrong or missing.

UTF-16 is basically an accident, that happened because people thought that 16 bits would be enough to encode all of Unicode at first, and so started changing their representation and APIs to use wide (16 bit) characters. When they realized they would need more characters, they came up with a scheme for using some reserved characters for encoding 32 bit values using two code units, so they could still use the same data structures for the new encoding. This brought all of the disadvantages of a variable-width encoding like UTF-8, without most of the advantages.

Alphonsoalphonsus answered 21/4, 2011 at 16:9 Comment(2)
+100: VERY WELL SAID! I despise UTF‑16, although UCS‑2 makes me even madder. Dan Kogai says in the manpage for his Encode::Unicode Perl module: “To say the least, surrogate pairs were the biggest mistake of the Unicode Consortium. But according to the late Douglas Adams in The Hitchhiker’s Guide to the Galaxy Trilogy, In the beginning the Universe was created. This has made a lot of people very angry and been widely regarded as a bad move. Their mistake was not of this magnitude so let’s forgive them.”Wheatley
Very useful, UTF-8 all the way. ThanksLavish
P
14

UTF-8 is the de facto standard character encoding for Unicode.

UTF-8 is like UTF-16 and UTF-32, because it can represent every character in the Unicode character set. But unlike UTF-16 and UTF-32, it possesses the advantages of being backward-compatible with ASCII. And it has the advantage of avoiding the complications of endianness and the resulting need to use byte order marks (BOM). For these and other reasons, UTF-8 has become the dominant character encoding for the World-Wide Web, accounting for more than half of all Web pages.

There is no such thing as UTF-128.

Papism answered 20/4, 2011 at 15:44 Comment(2)
UTF-128 Would be awesome. A 64 character string weights 1kb!Emelda
I went for an interview recently and was asked about this encoding and rolled my eyes!Lavish
H
1

You need to take more into consideration when dealing with this. For instance you can represent chinese, japanese and pretty much everything in UTF-8 -- but it will use a set of escape characters for each such "foreign" character -- and as such your data representation might take a lot of storage due to these extra markers. You could look at UTF-16 as well which doesn't need escape/markers for the likes of chinese, japanese and so on -- however, each character takes now 2 bytes to represent; so if you're dealing mainly with Latin charsets you've just doubled the size of your data storage with no benefit. There's also shift-jis dedicated for Japanese which represents these charset better than UTF-8 or UTF-16 but then you don't have support for Latin chars. I would say, if you know upfront you will have a lot of foreign characters, consider UTF-16; if you're mainly dealing with accents and Latin chars, use UTF-8; if you won't be using any Latin characters then consider shift-jis and the likes.

Howling answered 20/4, 2011 at 15:49 Comment(10)
How much does storage matter these days? If your text content grows by 50% what would be the greatest effect? Chances are the Chinese and Japanese text will still be on par with the English in size.Gowk
@Mark: storage may not matter, but bandwidth does matter.Emelda
@Martinho Fernandes, when bandwidth matters use compression. I've never tested it, but I'm guessing UTF-8 and UTF-16 compress to nearly the same size.Gowk
as i said it depends -- but you would be surprised how much doubling your storage can matter. I worked for a company in the past that was analysing semantically web pages and as such we were storing basically the contents of tons of pages on the net. With a storage of TB, doubling that begins to matter!Howling
Martinho also has it right -- it's not just storage but the fact that you have to pump all these bytes back to the browser -- and doubling the bandwidth is expensive.Howling
@Mark: you're probably right. UTF-8 and UTF-16 have a 1-to-1 mapping, so I don't think it is unreasonable to expect them to compress to approximately the same size.Emelda
@Haraldo: This answer is incorrect. It is not true that UTF-16 “does not need escaping for the likes of chinese, japanese, and such.” UTF-16 is variable-width encoding just like UTF-8. Good luck with U+2F967 CJK COMPATIBILITY IDEOGRAPH-2F967, eh! Plus the language about “escaping” is pretty lame. Matt Ball’s answer is the right one; this one is lame. You should switch the acceptance checkbox.Wheatley
This answer is incorrect and misleading. UTF-8 takes 1 byte for characters in the ASCII range, 2 bytes for characters from U+0080 through U+07FF, 3 bytes for characters from U+0800 through U+FFFF, and 4 bytes for characters from U+010000 through U+1FFFFF. UTF-16 takes 2 bytes for characters from U+0000 through U+FFFF and 4 bytes for characters U+010000 through U+1FFFFF. So, for the ASCII range UTF-8 is smaller, and for anything in the 2 byte range of Unicode (like Hebrew, Arabic, Greek, Russian), and for everything outside the BMP, UTF-8 takes the same amount of storage as UTF-16.Alphonsoalphonsus
Also, Shift JIS does support Latin characters, but Shift JIS is a more complicated standard, it doesn't cover all of Unicode, and there are multiple incompatible encodings that are frequently referred to as Shift JIS. It should only be used for legacy compatibility.Alphonsoalphonsus
Oh, and finally, for web content, being smaller for the ASCII range is generally more important than being smaller on CJK characters, even for CJK content. Most web content is stored with HTML or XML markup, all of which is in the ASCII range. Since UTF-16 is double the size of UTF-8 for characters in the ASCII range, while UTF-8 is only half again the size of UTF-16 for CJK characters, and since so much of the content actually is markup, transcoding web pages from UTF-8 to UTF-16 will almost always increase their size, even if they are all written in Japanese or Chinese.Alphonsoalphonsus

© 2022 - 2024 — McMap. All rights reserved.