Language codes for simplified Chinese and traditional Chinese?
Asked Answered
I

4

106

We are creating multi-language subsites on our website.

I would like to use the 2-letter language codes. Spanish and French are easy. They will get URLs like:

mydomain.com/es
mydomain.com/fr

but I run into a problem with Traditional and Simplified chinese. Are there standards for which 2 letter codes to use for these languages?

mydomain.com/zh
mydomain.com/?
Intermit answered 3/2, 2011 at 22:16 Comment(2)
You say Spanish and French are easy, but the CLDR database lists 26 and 47 country-specific variants respectively for each! It just depends upon how much the resources you are providing are dependent upon the differences.Doggo
loc.gov/standards/iso639-2/faq.html#23Homemade
G
214

@dkarp gives an excellent general answer. I will add some additional specifics regarding Chinese:

There are several countries where Chinese is the main written language. The major difference between them is whether they use simplified or traditional characters, but there are also minor regional differences (in vocabulary, etc). The standard way to distinguish these would be with a country code, e.g. zh_CN for mainland China, zh_SG for Singapore, zh_TW for Taiwan, or zh_HK for Hong Kong.

Mainland China and Singapore both use simplified characters, and the others use traditional characters. Since China and Taiwan are the two with the biggest populations, just zh_CN and zh_TW are often used to distinguish the simplified and traditional character versions of a website.

More technically correct but not commonly used in practice, however, would be to use zh_HANS for (generic) simplified Chinese characters, and zh_HANT for traditional Chinese characters, except for rare cases when it is meaningful to distinguish different countries.

Griffis answered 4/2, 2011 at 4:40 Comment(6)
This is a great answer -- well-written and probably not something most people know. And it draws a nice line between what's more technically correct (zh_HANS) and what's actually out there in general use (zh_CN). You can do a Google search for the two terms -- it's about an 7-to-1 difference in favor of zh_CN, which is honestly less than I expected.Dardan
Actually, the difference in URLs is as large as I expected. inurl:zh_CN gives 4.3M hits; inurl:zh_HANS gives 20K. Still, a really informative answer.Dardan
The difference between HANS and HANT is much less useful than CN and TW, as the difference is more than the characters, but region-specific usage. E.g. subroutine is translated as 子程序 in mainland China, but as 子程式 in Taiwan. In this example, the characters are the same in Simplified and Traditional Chinese, but the translation should still be different.Vaunting
I am trying to understand why it starts with zh, not ch, hopefully not because zhao familyActinozoan
@AlexBinZhao The language code "zh" comes from the Chinese name for Chinese, "Zhōngwén" (中文). You can find a list of all ISO 639-1 language codes here: en.wikipedia.org/wiki/List_of_ISO_639-1_codesGriffis
@AlexBinZhao Todd Owen is right that zh is the code in an ISO standard, which comes from the Chinese word for Chinese-language. However that ignores Korean or Japanese having words for their language which better romanized differently than ko and ja respectively, but the ISO standardized on those. Further the ISO is based in Switzerland which has the country code CH via the latin for Confoederatio Helvetica, carefully using a dead language to show no preference to any of its four official languages. I think back then using ch for Chinese would have mixed-up with ch for Switzerland too easily.Greasy
D
43

There is indeed a standard representation for this. As people have run into the exact same problem you are seeing -- same language, but different dialects or characters -- they've extended the two-letter language code with a two-letter region code. So you might have a universal French page at mydomain.com/fr, but internationalizing for French Canadian readers might leave you with a mydomain.com/fr_CA (Canada) and mydomain.com/fr_FR (France). Some platforms use a dash instead of an underscore to separate the language and region codes (hence fr-CA and fr-FR).

The standard locale for simplified Chinese is zh_CN. The standard locale for traditional Chinese is zh_TW.

I hesitate to point you towards the actual BCP 47 standards documents, as they're, uh, a little heavy on the detail and a little light on the readability. Just go with standard locale identifiers, like the ones in used by Java, and you'll be fine.

Dardan answered 4/2, 2011 at 4:1 Comment(1)
I feel like using an underscore is non-standard, while using a dash is: en.wikipedia.org/wiki/IETF_language_tag The ISO norms are defining the valid codes (fr, zh, Hans, SG, CA), the IETF standard defines how to combine them (fr-CA, zh-Hans, zh-SG), using dashesKristinkristina
I
10

I'm just going to leave this here.

CODE LANG FORM REGION
zh Chinese - -
zh_Hans Chinese Han Simplified -
zh_Hans_CN Chinese Han Simplified China
zh_Hans_HK Chinese Han Simplified Hong Kong SAR China
zh_Hans_MO Chinese Han Simplified Macau SAR China
zh_Hans_SG Chinese Han Simplified Singapore
zh_Hant Chinese Han Traditional -
zh_Hant_HK Chinese Han Traditional Hong Kong SAR China
zh_Hant_MO Chinese Han Traditional Macau SAR China
zh_Hant_TW Chinese Han Traditional Taiwan
Inspiration answered 20/2, 2022 at 20:41 Comment(0)
D
3

Language is dependent upon where it is spoken (doh!), so language and locale codes reflect that reality. zh is the basic language code, but because there are two major forms of it, there are zh_Hans and zh_Hant, but they are still only language codes, not locales.

Location-specific

To fully specify which language is used in a particular location, the country code still has to be suffixed, so making zh_Hans_HK and zh_Hant_HK for simplified and traditional Chinese, respectively, both as spoken in Hong Kong.

Actually, the reality is that something more specific than country code is often required in many countries, but that is likely to exponentially increase the complexity and maintenance of databases like CLDR, plus the support infrastructure to feed into it, like IP to location details extraction, is not generally available or accurate enough.

Fixed text

Now, if the code is just to specify which set of fixed strings to use in the user interface, or even whole pages sets on a site, a country suffix is not really necessary, unless there are more than a few places where the language varies significantly enough (location-based info) to bother creating a whole separate resource set.

The larger the resource set, the more likely that a language code based upon locale [in this context, just a language attribute, rather than a true locale, so you can call it what you like!] will be required, but at least you only have to do that when necessary.

On-the-fly values

However, if wanting to format particular variable values, like dates, times, currencies and numbers, on-the-fly, locales become important, because all the tools that support such functionality (like those based upon Unicode CLDR data) expect them. The locale for these needs to be a separate setting to the code for which an in-house-generated UI language is set to use, unless you want to create a resource set for every known locale, and maintain them ad nauseum!

Browser language tools

Note that when specifying locale for a web page that can be edited, as in input boxes, and spellcheck in attributes or css has been enabled for the field, the browser's language tools will spellcheck the field according to that locale.

Criteria

You have to be clear about what the resource set is providing, so consider:

  • Fixed strings? Language only.
  • Formatting on-the-fly? Locale.
  • Spellchecking in the viewing environment? Locale.
  • Whole pages/subsite? Language only, else locale (as a language variant) if significantly different content required.

Spreadsheet to minimise maintenance overhead

I use a spreadsheet to hold UI strings where each language code has a parent code, so that the cell for its version of a string has a formula that gets its string from the parent. To create a custom string for that language and string, I just overwrite the cell formula with the exact text. That minimises the amount of resource maintenance. I run a macro at the end that generates a complete resource file for each language.

Doggo answered 7/1, 2017 at 4:9 Comment(1)
My thinking is if your programming language (like Java) or language-matching framework can support zh-hans_CN type format then go for it. If it doesn't, then having the Country implies the 'script', like Hans is assumed for zh_CN, zh_SG, and Hant is assumed for zh_TW,zh_HK. So the script part can be left off. If your system does't have country-matching at all, like it has en/fr/de/es for most languages .... then it might have the zh_hans/zh_hant type format at least for certain langs (like Drupal is mostly this way so i allow my mobile apps to send up this info to match in my Drupal CMS API)Tamaru

© 2022 - 2024 — McMap. All rights reserved.