How to capture Hebrew with regex in Java?
Asked Answered
C

1

6

I'm trying to catch a section of Hebrew text (the origin is comments on a news site) using the following regex:

[\u0590-\u05FF \\p{Graph} \\s]+

It works for most comments but some comments are missed.

I've tried to debug this and it seems there's a Hebrew letter that doesn't match the pattern.

When I extract this letter and print it's integer value it seems to be correct but still the regex doesn't catch it...

Ideas?

Cirenaica answered 24/1, 2012 at 12:52 Comment(3)
Do you use Pattern.UNICODE_CASE inside your Pattern.compile method?Stephine
Try it: Pattern p = Pattern.compile("YOUR_REGEX", Pattern.UNICODE_CASE);Stephine
Which Hebrew letter doesn't match the pattern?Diffractive
C
1

It would be more sematically correct to use \p{InHebrew} instead of \u0590-\u05FF

Also you need to match punctuation, digits (at least, world-common ones) and different kind of spaces. I don't know what is \p{Graph} and are there any Hebrew-specific punctuation symbols, but it seemed, you missed some parts.

Crinkumcrankum answered 24/1, 2012 at 13:0 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.