How can I detect a palindrome in Hebrew?

Asked 20/6, 2014 at 14:6 Answered 21/6, 2014 at 14:31

python unicode internationalization hebrew palindrome

I am writing a series of tests for a palindrome solver. I came across the interesting palindrome in Hebrew:

טעם לפת תפל מעט

Which is a palindrome, but the letter Mem has both a regular form (מ) and a "final form" (ם), how it appears as the last letter in a word. But short of hardcoding that "0x5de => 0x5dd" in my program, I was not able to figure out a way to programmatically rely on Unicode, Python, or a library that would treat the two as the same. Things I did try:

s = 'טעם לפת תפל מעט'
s.casefold() # Python 3.4
s.lower()
s.upper()
import unicodedata
unicodedata.normalize(...) # In case this functioned like a German Eszett

All yielded the same string. Other Hebrew letters that would cause this problem (in case someone searches for this later) would be Kaf, Nun, Peh, and Tsadeh. No, I am not a native speaker of Hebrew.

Bik answered 20/6, 2014 at 14:6 Comment(3)

Are those the only 5 letters that would have this issue? – Mcnair 20/6, 2014 at 14:10

Why are you doing this ? I'm just curious – Melodee 20/6, 2014 at 14:10

I'm a programming instructor, trying to make an exercise that has a simple solution (is a word a palindrome?), an intermediate solution (is this English phrase a palindrome?), and a challenging solution (is this arbitrary set of "letters" a palindrome?). – Bik 20/6, 2014 at 14:13

You can make a slightly more "rigorous" answer (one that's less likely to give false positives and false negatives), with a little more work. Note that Patrick Collin's answer could fail by matching lots of unrelated characters because they share the last word in their unicode data name.

One thing you can do is a stricter approach at converting final letters:

import unicodedata

# Note the added accents
phrase = 'טעם̀ לפת תפל מ̀עט'

def convert_final_characters(phrase):
    for character in phrase:
        try:
            name = unicodedata.name(character)
        except ValueError:
            yield character
            continue

        if "HEBREW" in name and " FINAL" in name:
            try:
                yield unicodedata.lookup(name.replace(" FINAL", ""))
            except KeyError:
                # Fails for HEBREW LETTER WIDE FINAL MEM "ﬦ",
                # which has no non-final counterpart
                #
                # No failure if you first normalize to
                # HEBREW LETTER FINAL MEM "ם"
                yield character
        else:
            yield character

phrase = "".join(convert_final_characters(phrase))

phrase
#>>> 'טעמ̀ לפת תפל מ̀עט'

This just looks for Hebrew characters where "FINAL" can be removed, and does that.

You can then also convert to graphemes using the "new" regex module on PyPI.

import regex

# "\X" matches graphemes
graphemes = regex.findall("\X", phrase)
graphemes
#>>> ['ט', 'ע', 'מ̀', ' ', 'ל', 'פ', 'ת', ' ', 'ת', 'פ', 'ל', ' ', 'מ̀', 'ע', 'ט']

graphemes == graphemes[::-1]
#>>> True

This deals with accents and other combining characters.

Glaucous answered 21/6, 2014 at 14:31 Comment(0)

Here's an ugly solution that works for your current issue:

import unicodedata 

def make_map(ss):
    return [unicodedata.name(s).split(' ')[-1] for s in ss]

def is_palindrome(ss):
    return make_map(ss) == make_map(reversed(ss))

This relies on the formatting of Hebrew character names in Python's lookup table, though, so it might not generalize perfectly.

Specifically, you have:

In [29]: unicodedata.name(s[2])
Out[29]: 'HEBREW LETTER FINAL MEM'
...
In [31]: unicodedata.name(s[-3])
Out[31]: 'HEBREW LETTER MEM'

So stripping out all but the last word gives you:

In [35]: [unicodedata.name(s_).split(" ")[-1] for s_ in s]
Out[35]: ['TET', 'AYIN', 'MEM', 'SPACE', 'LAMED', 'PE', 'TAV', 'SPACE', 'TAV', 'PE', 'LAMED', 'SPACE', 'MEM', 'AYIN', 'TET']

with the same in reverse. Unicode is a big world, though, so I'm not sure if you can't construct an example that beats this.

Punctilio answered 20/6, 2014 at 14:28 Comment(4)

This is an interesting approach, but will fail on letters with accents, considering them all equal: "LATIN CAPITAL LETTER A WITH GRAVE", "LATIN CAPITAL LETTER E WITH GRAVE". – Bik 20/6, 2014 at 14:56

In this case, you could ignore "FINAL", which the only difference in the character names... – Phyllode 20/6, 2014 at 15:55

@Bik I think it's likely that you can always find some strangely named Unicode character that breaks a particular approach. There are lots of unicode characters, and if you have to handle everything from "CEDILLA" to "RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK" to "VULGAR FRACTION THREE QUARTERS" to "LATIN SMALL LETTER O WITH OGONEK AND MACRON" to "INFORMATION DESK PERSON" to.... etc. I think you're SOL. – Punctilio 20/6, 2014 at 16:5

But if the OP describes some particular group of character sets that his students need to be able to handle, they can use this approach. – Punctilio 20/6, 2014 at 16:5

Recommended topics

Hot tags