How to capture Hebrew with regex in Java?

About

Asked 24/1, 2012 at 12:52 Answered 24/1, 2012 at 13:0

I'm trying to catch a section of Hebrew text (the origin is comments on a news site) using the following regex:

[\u0590-\u05FF \\p{Graph} \\s]+

It works for most comments but some comments are missed.

I've tried to debug this and it seems there's a Hebrew letter that doesn't match the pattern.

When I extract this letter and print it's integer value it seems to be correct but still the regex doesn't catch it...

Ideas?

Cirenaica answered 24/1, 2012 at 12:52 Comment(3)

Do you use Pattern.UNICODE_CASE inside your Pattern.compile method? – Stephine 24/1, 2012 at 12:55

Try it: Pattern p = Pattern.compile("YOUR_REGEX", Pattern.UNICODE_CASE); – Stephine 24/1, 2012 at 14:8

Which Hebrew letter doesn't match the pattern? – Diffractive 18/6, 2012 at 12:52

It would be more sematically correct to use \p{InHebrew} instead of \u0590-\u05FF

Also you need to match punctuation, digits (at least, world-common ones) and different kind of spaces. I don't know what is \p{Graph} and are there any Hebrew-specific punctuation symbols, but it seemed, you missed some parts.

Crinkumcrankum answered 24/1, 2012 at 13:0 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags