How to mark all CJK text in a document?

Asked 7/5, 2012 at 13:23 Answered 19/5, 2012 at 17:4

unicode multilingual cjk character-properties

I have a file, file1.txt, containing text in English, Chinese, Japanese, and Korean. For use in ConTeXt, I need to mark each region of text within the file according to language, except for English, and output a new file, e.g., here is a sample line:

The 恐龙 ate 鱼.

As this contains text in Chinese characters, this will get marked like this:

The \language[cn]{恐龙} ate \language[cn]{鱼}.

The document is saved as UTF-8.
Text in Chinese should be marked \language[cn]{*}.
Text in Japanese should be marked \language[ja]{*}.
Text in Korean should be marked \language[ko]{*}.
The content never continues from one line to the next.
If the code is ever in doubt about whether something is Chinese, Japanese, or Korean, it is best if it defaults to Chinese.

How can I mark the text according to the language present?

Townswoman answered 7/5, 2012 at 13:23 Comment(6)

How will you determine if a particular character is Chinese or Japanese? They share many characters. – Darlenedarline 7/5, 2012 at 13:29

If the three languages do not in face have places within Unicode, then I will simplify my question to just marking everything from CJK to \language[cn]{*}. – Townswoman 7/5, 2012 at 13:38

Its more complicated than that. The 3 languages share character points (the numeric code), but not necessarily glyph (the graphical representation of the character). Have a look at the Unicode CJK FAQ unicode.org/faq/han_cjk.htm – Liable 7/5, 2012 at 13:41

This may prove useful: #2728304 – Darlenedarline 7/5, 2012 at 13:41

This could be a dummy method, but probably better than nothing: just map the language based on symbols code ranges, they should be fixed in the Unicode tables. – Groos 7/5, 2012 at 13:47

en.wikipedia.org/wiki/CJK_Unified_Ideographs – Teamster 7/5, 2012 at 14:9

A crude algorithm:

use 5.014;
use utf8;
while (<DATA>) {
    s
        {(\p{Hangul}+)}
        {\\language[ko]{$1}}g;
    s
        {(\p{Hani}+)}
        {\\language[zh]{$1}}g;
    s
        {(\p{Hiragana}+|\p{Katakana}+)}
        {\\language[ja]{$1}}g;
    say;
}

__DATA__
The 恐龙 ate 鱼.
The 恐竜 ate 魚.
The キョウリュウ ate うお.
The 공룡 ate 물고기.

(Also see Detect chinese character using perl?)

There are problems with that. Daenyth comments that e.g. 恐竜 is misidentified as Chinese. I find it unlikely that you are really working with mixed English-CJK, and are just giving bad example text. Perform a lexical analysis first to differentiate Chinese from Japanese.

Antitank answered 7/5, 2012 at 14:8 Comment(0)

I'd like to provide a Python solution. No matter which language, it is based on Unicode Script information (from Unicode Database, aka UCD). Perl has rather detailed UCD compared to Python.
Python has no Script information opened in its "unicodedata" module. But someone has added it at here https://gist.github.com/2204527 (tiny and useful). My implementaion is based on it. BTW, it is not space sensitive(no need of any lexical analysis).

    # coding=utf8
    import unicodedata2
    text=u"""The恐龙ate鱼.
    The 恐竜ate 魚.
    Theキョウリュウ ate うお.
    The공룡 ate 물고기. """

    langs = {
    'Han':'cn',
    'Katakana':'ja',
    'Hiragana':'ja',
    'Hangul':'ko'
    }

    alist = [(x,unicodedata2.script_cat(x)[0]) for x in text]
    # Add Last
    alist.append(("",""))
    newlist = []
    langlist = []
    prevlang = ""
    for raw, lang in alist:
        if prevlang in langs and prevlang != lang:
            newlist.append("\language[%s]{" % langs[prevlang] +"".join(langlist) + "}")
            langlist = []

        if lang not in langs:
            newlist.append(raw)
        else:                      
            langlist.append(raw)
        prevlang = lang

    newtext = "".join(newlist)
    print newtext

The Output is :

    $ python test.py 
    The\language[cn]{恐龙}ate\language[cn]{鱼}.
    The \language[cn]{恐竜}ate \language[cn]{魚}.
    The\language[ja]{キョウリュウ} ate \language[ja]{うお}.
    The\language[ko]{공룡} ate \language[ko]{물고기}.

Fustigate answered 7/5, 2012 at 20:55 Comment(0)

While Korean doesn't use much sinograms [漢字/Kanji] anymore, they still pop up sometimes. Some Japanese sinograms are solely Japanese, like 竜, but many are identical to either Simplified Chinese or Traditional. So you're kind of stuck. So you need to look at a full sentence if you have some "Han" chars. If it has some hiragana/katakana + kanji, probability is very high it's Japanese. Likewise, a bunch of hangul syllables and a couple of sinograms will tell you the sentence is in Korean.

Then, if it's all Han characters, ie Chinese, you can look at whether some of the chars are simplified: kZVariant denotes a Simplified Chinese char. Oh, and kSpecializedSemanticVariant is very often used for Japanese specific simplified chars. 内 and 內 may look the same to you, but the first is Japanese, the second Traditional Chinese and Korean (Korean uses Traditional Chinese as a standard).

I have code somewhere that returns, for one codepoint, the script name. That could help. You go through a sentence, and see what's left at the end. I'll put up the code somewhere.

EDIT: the code

http://pastebin.com/e276zn6y

In response to the comment below:

This function above is built based on data provided by Unicode.org... While not being an expert per se, I contributed quite a bit to the Unihan database – and I happen to speak CJK. Yes, all 3. I do have some code that takes advantage of the kXXX properties in the Unihan database, but A/ I wasn't aware we were supposed to write code for the OP, and B/ it would require a logistics that might go beyond what the OP is ready to implement. My advice stands. With the function above, loop through one full sentence. If all codepoints are "Han", (or "Han"+"Latin"), chances are high it's Chinese. If on the other hand the result is a mix of "Han"+"Hangul"(+"latin" possibly) you can't go wrong with Korean. Likewise, a mix of "Han" and "Katakana"/"Hiragana" you have Japanese.

A QUICK TEST

Some code to be used with the function I linked to before.

function guessLanguage(x) {
  var results={};
  var s='';
  var i,j=x.length;
  for(i=0;i<j;i++) {
    s=scriptName(x.substr(i,1));
    if(results.hasOwnProperty(s)) {
      results[s]+=1;
    } else {
      results[s]=1;
    }
  }
  console.log(results);
  mostCount=0;
  mostName='';
  for(x in results) {
    if (results.hasOwnProperty(x)) {
      if(results[x]>mostCount) {
        mostCount=results[x];
        mostName=x;
      }
    }
  }
  return mostName;
}

Some tests:

r=guessLanguage("外人だけど、日本語をペラペラしゃべるよ！");
Object
  Common: 2
  Han: 5
  Hiragana: 9
  Katakana: 4
  __proto__: Object
"Hiragana"

The r object contains the number of occurrences of each script. Hiragana is the most frequent, and Hiragana+Katakana --> 2/3 of the sentence.

r=guessLanguage("我唔知道,佢講乜話.")
Object
  Common: 2
  Han: 8
  __proto__: Object
"Han"

An obvious case of Chinese (Cantonese in this case).

r=guessLanguage("中國이 韓國보다 훨씬 크지만, 꼭 아름다운 나라가 아니다...");
Object
  Common: 11
  Han: 4
  Hangul: 19
  __proto__: Object
"Hangul"

Some Han characters, and a whole lot of Hangul. A Korean sentence, assuredly.

Enlarge answered 19/5, 2012 at 17:4 Comment(0)

Recommended topics

Hot tags