how can i detect hebrew characters both iso8859-8 and utf8 in a string using php

Asked 7/11, 2009 at 20:43 Answered 21/5, 2012 at 20:54

I want to be able to detect (using regular expressions) if a string contains hebrew characters both utf8 and iso8859-8 in the php programming language. thanks!

Alveraalverez answered 7/11, 2009 at 20:43 Comment(0)

Here's map of the iso8859-8 character set. The range E0 - FA appears to be reserved for Hebrew. You could check for those characters in a character class:

[\xE0-\xFA]

For UTF-8, the range reserved for Hebrew appears to be 0591 to 05F4. So you could detect that with:

[\u0591-\u05F4]

Here's an example of a regex match in PHP:

echo preg_match("/[\u0591-\u05F4]/", $string);

Reseta answered 7/11, 2009 at 21:4 Comment(2)

The problem is that E0-FA are also values that will occur in UTF-8, but not nessescarily as hebrew characters... – Band 7/11, 2009 at 21:45

@gnud: That's why you should not use the iso8859-8 regex on UTF-8 strings – Reseta 7/11, 2009 at 22:3

well if your PHP file is encoded with UTF-8 as should be in cases that you have hebrew in it, you should use the following RegX:

$string="אבהג";
echo preg_match("/\p{Hebrew}/u", $string);
// output: 1

Pomfret answered 17/5, 2012 at 14:50 Comment(0)

Here's a small function to check whether the first character in a string is in hebrew:

function IsStringStartsWithHebrew($string)
{
    return (strlen($string) > 1 && //minimum of chars for hebrew encoding
        ord($string[0]) == 215 && //first byte is 110-10111
        ord($string[1]) >= 144 && ord($string[1]) <= 170 //hebrew range in the second byte.
        );
}

good luck :)

Mugger answered 12/4, 2010 at 20:42 Comment(0)

First, such a string would be completely useless - a mix of two different character sets?

Both the hebrew characters in iso8859-8, and each byte of multibyte sequences in UTF-8, have a value ord($char) > 127. So what I would do is find all bytes with a value greater than 127, and then check if they make sense as is8859-8, or if you think they would make more sense as an UTF8-sequence...

Band answered 7/11, 2009 at 20:59 Comment(3)

How can a character have ord($char) > 255 in ISO-8859-8? It's a single byte! – Ronaldronalda 7/11, 2009 at 21:13

Well well. I don't know why, but I completely fudged that - non-ascii are between 128 and 255 - fixed now. – Band 7/11, 2009 at 21:44

I figured that was what you meant in the mean time. You're lucky I waited before downvoting you ;-) – Ronaldronalda 7/11, 2009 at 22:37

function is_hebrew($string)
{
    return preg_match("/\p{Hebrew}/u", $string);
}

Steinke answered 21/5, 2012 at 20:54 Comment(0)

Recommended topics

Hot tags