What is proper way to test if the input is Korean or Chinese using JavaScript?
Asked Answered
F

3

6

My application was relying on this function to test if a string is Korean or not :

const isKoreanWord = (input) => {
  const match = input.match(/[\u3131-\uD79D]/g);
  return match ? match.length === input.length : false;
}

isKoreanWord('만두'); // true
isKoreanWord('mandu'); // false

until I started to include Chinese support and now this function is incoherent :

isKoreanWord('幹嘛'); // true

I believe this is caused by the fact that Korean characters and Chinese ones are intermingled into the same Unicode range.

How should I correct this function to make it returns true if the input contains only Korean characters ?

Frequentative answered 25/10, 2018 at 12:30 Comment(6)
By "Korean characters" you mean hangul? 'Cause Chinese characters are also used in Korea. Asking to distinguish "Chinese Chinese characters" from "Korean Chinese characters" is like asking to distinguish English from French.Factitive
@Factitive Yes I meant hangul. How to distinguish between hangul and hanja.Frequentative
@Factitive Also I don't think your comparison is true in that English and French derive from Latin so yes it is extremely hard to compare both language, while Korean is using Chinese as its base language and Chinese, well... is using Chinese as its own historical base language.Frequentative
I'm talking purely about the writing system used. If you just look at the range of letters, English is indistinguishable from French. In the same way, seeing just a few Chinese characters it's virtually impossible to tell whether it's a Chinese word or a word used in the context of Korean.Factitive
"Korean characters" means hangul, there's no exception.Grandfather
@Grandfather yes when you see hangul you know it's Korean, and when you see a Chinese character you know it's Chinese. Even in the context of Korean a Chinese character is always Chinese from its core. Not sure why deceze was trying to argue about that.Frequentative
E
16

Here is the unicode range you need for Hangul (Taken from their wikipedia page).

U+AC00–U+D7AF
U+1100–U+11FF
U+3130–U+318F
U+A960–U+A97F
U+D7B0–U+D7FF

So your regex .match should look like this:

const match = input.match(/[\uac00-\ud7af]|[\u1100-\u11ff]|[\u3130-\u318f]|[\ua960-\ua97f]|[\ud7b0-\ud7ff]/g);
Erwinery answered 25/10, 2018 at 12:37 Comment(2)
This fails on 매장 이름Fouts
@KarmaBlackshaw this only matches single characters, not sentences. You may need to adjust the regex to include spaces and other special characters.Erwinery
R
2

a shorter version that matches korean characters

const regexKorean = /[\u1100-\u11FF\u3130-\u318F\uA960-\uA97F\uAC00-\uD7AF\uD7B0-\uD7FF]/g
Romola answered 27/3, 2022 at 3:7 Comment(0)
K
1

In modern browsers, you can use unicode character classes directly:

const RE = /\p{sc=Hangul}/u

console.log(RE.test('만두')) // true
console.log(RE.test('mandu')) // false
console.log(RE.test('幹嘛')) // false
Kabyle answered 16/7 at 9:34 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.