What is the most common encoding of each language?

Asked 14/12, 2011 at 17:56 Answered 19/12, 2011 at 20:16

I am developing a plain-text reader application. Sometimes app can't auto determine the encoding of a file, so user needs select an encoding from a list of encodings. If this list contains all supported encodings, it will be too long. I want to provide a simplified list, only contains most common encodings of each language.

This is some relationship I am known:

Traditional Chinese: Big5
Simplified Chinese: GB18030
Japanese: Shift-JIS, EUC-JP
Russian: KOI8-R

If you know any other language's most common encoding, please tell me.

Widen answered 14/12, 2011 at 17:56 Comment(2)

Did you intentionally leave out the Unicode families? UTF-8, UTF-16 and UTF-32 are definitely used at least as much as the ones you named. – Inarch 14/12, 2011 at 20:19

@TomvanderWoerdt Yes, I need a list of regional encodings, exclude Unicode encodings. For example, GB18030 is national standard of PRC, so it is widely used in mainland China. – Widen 14/12, 2011 at 20:49

On the web, UTF-8 is by far the most common encoding for all languages.

That being said, here are the Windows XP locales grouped by default character encoding ("Language for non-Unicode programs"):

Big5: zh_HK, zh_MO, zh_TW
GBK (≈GB2312): zh_CN, zh_SG
Windows-31J (≈Shift_JIS): ja_JP
windows-874 (≈TIS-620, ISO-8859-11): th_TH
windows-949 (≈EUC-KR): ko_KR
windows-1250: bs_BA, cs_CZ, hr_BA, hr_HR, hu_HU, pl_PL, ro_RO, sk_SK, sl_SI, sq_AL, sr_BA, sr_SP
windows-1251: az_AZ, be_BY, bg_BG, kk_KZ, ky_KG, mk_MK, mn_MN, ru_RU, sr_BA, sr_SP, tt_RU, uk_UA, uz_UZ
windows-1252 (≈ISO-8859-1): af_ZA, arn_CL, ca_ES, cy_GB, da_DK, de_AT, de_CH, de_DE, de_LI, de_LU, en_AU, en_BZ, en_CA, en_CB, en_GB, en_IE, en_JM, en_NZ, en_PH, en_TT, en_US, en_ZA, en_ZW, es_AR, es_BO, es_CL, es_CO, es_CR, es_DO, es_EC, es_ES, es_GT, es_HN, es_MX, es_NI, es_PA, es_PE, es_PR, es_PY, es_SV, es_UY, es_VE, eu_ES, fi_FI, fil_PH, fo_FO, fr_BE, fr_CA, fr_CH, fr_FR, fr_LU, fr_MC, fy_NL, ga_IE, gl_ES, id_ID, is_IS, it_CH, it_IT, iu_CA, iv_IV, lb_LU, moh_CA, ms_BN, ms_MY, nb_NO, nl_BE, nl_NL, nn_NO, ns_ZA, pt_BR, pt_PT, qu_BO, qu_EC, qu_PE, rm_CH, se_FI, se_NO, se_SE, sv_FI, sv_SE, sw_KE, tn_ZA, xh_ZA, zu_ZA
windows-1253: el_GR
windows-1254 (≈ISO-8859-9): az_AZ, tr_TR, uz_UZ
windows-1255: he_IL
windows-1256: ar_AE, ar_BH, ar_DZ, ar_EG, ar_IQ, ar_JO, ar_KW, ar_LB, ar_LY, ar_MA, ar_OM, ar_QA, ar_SA, ar_SY, ar_TN, ar_YE, fa_IR, ps_AF, ur_PK
windows-1257: et_EE, lt_LT, lv_LV
windows-1258: vi_VN

and the most common encodings overall on the Web as of October 30th 2020:

UTF-8 95.7%
ISO-8859-1 1.8%
Windows-1251 1.0%
Windows-1252 0.4%
GB2312 0.3%
Shift JIS 0.2%
GBK 0.1%
EUC-KR 0.1%
ISO-8859-9 0.1%
Windows-1254 0.1%
EUC-JP 0.1%
Big5 0.1%

Samovar answered 16/12, 2011 at 1:39 Comment(0)

The HTML5 draft contains a table of default encodings for languages, reflecting what is regarded as common. However, note that it is supposed to be based on the user locale, i.e. the language of the browser or the operating system, not the language of the document—obviously because the latter is usually unknown, at least before you actually read the document, based on some assumption about the encoding.

I think you could in practice copy the list of encodings in a popular web browser. If it works well there, it probably works reasonably well in your application. Browsers do some clever things with the list and its order, but in practice, I think it would suffice to have a short list like utf-8, utf-16, windows-1252, and maybe a few others, followed by an option of getting the full list. Note that although utf-16 is practically unused and useless for web pages, it is common for plain text files around. It is important to name the encodings well, preferably with a common English (or other language) name together with the IANA “charset” name in parentheses—much like browsers do.

User answered 19/12, 2011 at 20:16 Comment(0)

I would recommend the menu structure like the one used by browsers. For instance Firefox: View -> Character Encoding -> More Encoding -> East Asian -> Chinese/Japanese/Korean. (ok, easier if you just look). And View -> Encoding -> More in IE.

Might seem too deep and clunky, but it is very familiar. And does not drop useful encodings (Why KOI8-R for Russian, for instance? And what happens if I use Windows 1251 and is not in the list?)

Hierophant answered 15/12, 2011 at 12:3 Comment(0)

Recommended topics

Hot tags