While Korean doesn't use much sinograms [漢字/Kanji] anymore, they still pop up sometimes. Some Japanese sinograms are solely Japanese, like 竜, but many are identical to either Simplified Chinese or Traditional. So you're kind of stuck. So you need to look at a full sentence if you have some "Han" chars. If it has some hiragana/katakana + kanji, probability is very high it's Japanese. Likewise, a bunch of hangul syllables and a couple of sinograms will tell you the sentence is in Korean.
Then, if it's all Han characters, ie Chinese, you can look at whether some of the chars are simplified: kZVariant denotes a Simplified Chinese char. Oh, and kSpecializedSemanticVariant is very often used for Japanese specific simplified chars. 内 and 內 may look the same to you, but the first is Japanese, the second Traditional Chinese and Korean (Korean uses Traditional Chinese as a standard).
I have code somewhere that returns, for one codepoint, the script name. That could help. You go through a sentence, and see what's left at the end. I'll put up the code somewhere.
EDIT: the code
http://pastebin.com/e276zn6y
In response to the comment below:
This function above is built based on data provided by Unicode.org... While not being an expert per se, I contributed quite a bit to the Unihan database – and I happen to speak CJK. Yes, all 3. I do have some code that takes advantage of the kXXX
properties in the Unihan database, but A/ I wasn't aware we were supposed to write code for the OP, and B/ it would require a logistics that might go beyond what the OP is ready to implement. My advice stands. With the function above, loop through one full sentence. If all codepoints are "Han", (or "Han"+"Latin"), chances are high it's Chinese. If on the other hand the result is a mix of "Han"+"Hangul"(+"latin" possibly) you can't go wrong with Korean. Likewise, a mix of "Han" and "Katakana"/"Hiragana" you have Japanese.
A QUICK TEST
Some code to be used with the function I linked to before.
function guessLanguage(x) {
var results={};
var s='';
var i,j=x.length;
for(i=0;i<j;i++) {
s=scriptName(x.substr(i,1));
if(results.hasOwnProperty(s)) {
results[s]+=1;
} else {
results[s]=1;
}
}
console.log(results);
mostCount=0;
mostName='';
for(x in results) {
if (results.hasOwnProperty(x)) {
if(results[x]>mostCount) {
mostCount=results[x];
mostName=x;
}
}
}
return mostName;
}
Some tests:
r=guessLanguage("外人だけど、日本語をペラペラしゃべるよ!");
Object
Common: 2
Han: 5
Hiragana: 9
Katakana: 4
__proto__: Object
"Hiragana"
The r
object contains the number of occurrences of each script. Hiragana is the most frequent, and Hiragana+Katakana --> 2/3 of the sentence.
r=guessLanguage("我唔知道,佢講乜話.")
Object
Common: 2
Han: 8
__proto__: Object
"Han"
An obvious case of Chinese (Cantonese in this case).
r=guessLanguage("中國이 韓國보다 훨씬 크지만, 꼭 아름다운 나라가 아니다...");
Object
Common: 11
Han: 4
Hangul: 19
__proto__: Object
"Hangul"
Some Han characters, and a whole lot of Hangul. A Korean sentence, assuredly.
\language[cn]{*}
. – Townswoman