Regular expression with Chinese characters and full/half-width charatcers
Asked Answered
S

1

6

I'm doing validation rules for a java project and one of the requirements I got is:

"The ID card address should contain no less than eight (≥8) Chinese characters (exclusive of full-width/half-width symbols)."

I can't get my head around how to solve this.

I have come to the point where I can validate for Chinese characters but are not able to exclude all the full-width/half-width symbols.

return Pattern.matches("^[\\p{IsHan}]{8,}$", address);

Results should be something like

  • 名字名字名字名字 = true
  • 名字名字名字名(字)= true
  • 名字名字名(字) = false
  • 名字名字名(字)= false

Does anyone have any advice?

Sfax answered 9/11, 2015 at 15:13 Comment(7)
I have come up with "(?U)[\\p{IsHan}&&[^\\p{InHalfwidth_and_Fullwidth_Forms}]]{8,}" regex, but it does not match 名字名字名字名(字). Should it really be matched?Footprint
It seems that the requirement is actually to allow any character, but make sure that at least 8 of them are actual Chinese characters. And I interpret the sentence about the half/full width characters as saying "Note that half/full width characters are not considered Chinese characters".Declination
From what I see, \p{IsHan} does not include half-width and full-width forms.Footprint
I read that requirement as "must contain at least eight Chinese characters which are not full-width/half-width characters." With that in mind, I don't think a single regex test can do it. However, address.replaceAll("\\P{IsHan}", "").length() >= 8 will.Finegan
Try (?U)^(?=(?:[^\\p{Han}]*\\p{Han}){8}).*$.Footprint
The full-width/half-width requirement doesn't make sense to me. This concept only applies for English alphabet and Katakana/Hangul. Chinese is always full-width, from what I can see.Dickdicken
@BratAnon: I think my regex also worked for you. Glad to see nhahtdh gave a comprehensive answer.Footprint
D
9

Assuming that you want to check that there are 8 or more Chinese characters in the string:

Pattern.compile("^(\\P{sc=Han}*\\p{sc=Han}){8}.*$", Pattern.DOTALL);

Since it's unclear what you consider Chinese character, I'm using Han script as an approximation. According to Unicode 6.2.0, Han script is defined to contain the following code points:

2E80..2E99    ; Han # So  [26] CJK RADICAL REPEAT..CJK RADICAL RAP
2E9B..2EF3    ; Han # So  [89] CJK RADICAL CHOKE..CJK RADICAL C-SIMPLIFIED TURTLE
2F00..2FD5    ; Han # So [214] KANGXI RADICAL ONE..KANGXI RADICAL FLUTE
3005          ; Han # Lm       IDEOGRAPHIC ITERATION MARK
3007          ; Han # Nl       IDEOGRAPHIC NUMBER ZERO
3021..3029    ; Han # Nl   [9] HANGZHOU NUMERAL ONE..HANGZHOU NUMERAL NINE
3038..303A    ; Han # Nl   [3] HANGZHOU NUMERAL TEN..HANGZHOU NUMERAL THIRTY
303B          ; Han # Lm       VERTICAL IDEOGRAPHIC ITERATION MARK
3400..4DB5    ; Han # Lo [6582] CJK UNIFIED IDEOGRAPH-3400..CJK UNIFIED IDEOGRAPH-4DB5
4E00..9FCC    ; Han # Lo [20941] CJK UNIFIED IDEOGRAPH-4E00..CJK UNIFIED IDEOGRAPH-9FCC
F900..FA6D    ; Han # Lo [366] CJK COMPATIBILITY IDEOGRAPH-F900..CJK COMPATIBILITY IDEOGRAPH-FA6D
FA70..FAD9    ; Han # Lo [106] CJK COMPATIBILITY IDEOGRAPH-FA70..CJK COMPATIBILITY IDEOGRAPH-FAD9
20000..2A6D6  ; Han # Lo [42711] CJK UNIFIED IDEOGRAPH-20000..CJK UNIFIED IDEOGRAPH-2A6D6
2A700..2B734  ; Han # Lo [4149] CJK UNIFIED IDEOGRAPH-2A700..CJK UNIFIED IDEOGRAPH-2B734
2B740..2B81D  ; Han # Lo [222] CJK UNIFIED IDEOGRAPH-2B740..CJK UNIFIED IDEOGRAPH-2B81D
2F800..2FA1D  ; Han # Lo [542] CJK COMPATIBILITY IDEOGRAPH-2F800..CJK COMPATIBILITY IDEOGRAPH-2FA1D

Java 8 is using Unicode 6.2.0, so \p{sc=Han} matches the code points listed above. However, the implementation also includes unassigned code points (in assigned blocks) and unassigned blocks, so do take note to upgrade the JRE to the latest major version to make sure the program runs correctly as more characters are added to Unicode.

In particular, \p{sc=Han} in Oracle's implementation includes these ranges:

  • U+2E80 - U+2FEF: CJK Radicals Supplement (whole block), Kangxi Radicals (whole block) and 16 code points from unassigned block.
  • U+3005, U+3007, U+3021 - U+3029, U+3038 - U+303B: CJK Symbols and Punctuation (some characters in the block)
  • U+3400 - U+4DBF: CJK Unified Ideographs Extension A (whole block)
  • U+4E00 - U+9FFF: CJK Unified Ideographs (whole block)
  • U+F900 - U+FAFF: CJK Compatibility Ideographs (whole block)
  • U+20000 - U+E0000: CJK Unified Ideographs Extension B/C/D/E (whole blocks), CJK Compatibility Ideographs Supplement (whole block), and several unassigned Unicode plane, plus one reserved code point in Tags block.
Dickdicken answered 10/11, 2015 at 5:15 Comment(3)
I had to add brackets to make intellij happy. return Pattern.compile("^([\\P{IsHan}]*[\\p{IsHan}]){8}.*$", Pattern.DOTALL).matcher(address).find();. I'm note sure if that makes any difference, but my tests are green now. Thanks.Sfax
@BratAnon: You can use address.matches("(?s)([\\P{IsHan}]*[\\p{IsHan}]){8}.*") if you don't care about extracting the matches. (^ and $ are removed in this example, since matches already makes sure the regex matches the whole string)Dickdicken
Thanks. Works perfect.Sfax

© 2022 - 2024 — McMap. All rights reserved.