ignoring hebrew vowels when comparing strings
Asked Answered
F

2

5

Good evening, i hope you can help me with this problem, as I'm struggling to find solutions.

I have a provider of words, who give me vowelled Hebrew words , for example -

Vowelled - בַּיִת not vowelled - בית

Vowelled - הַבַּיְתָה not vowelled - הביתה

Unlike my provider, my user can't normally enter Hebrew vowels (nor should i want him to do that). The user story is the user seeking a word in the provided words. The problem is the comparison between the vowelled and the un-vowelled words. As each is represented by a different byte array in the memory, the equals method returns false.

I tried looking into how UTF-8 handles hebrew vowels and it seems like it's just normal characters.

I do want to present the vowels to the user, so i want to keep the string as-is in the memory, but when comparing i want to ignore them. Is there any simple way to solve this problem?

Frederic answered 6/10, 2012 at 20:17 Comment(3)
It may help to provide a little extra background on the subject of Hebrew vowels (many readers will be unfamiliar with the subject). Is it plausible that you can maintain a list of pairs of characters you wish to be considered equal? If so, the the question simplifies to implementing a custom String comparison method that factors in these equivalent characters.Towardly
I would create a function that strips vowels from strings, and then use this function before comparing the strings with String.equals. (This could probably be done with String.replace and a char array of all Hebrew vowels)Greeneyed
What information did you find missing? I don't want to re-implement the equals of a string again, nor do i want to keep a mapping of all the vowels, I would rather read it from some external library...Frederic
C
6

You can using a Collator. I can't tell you how exactly it's working as it's new to me, but this appears to do the trick:

public static void main( String[] args ) {
    String withVowels = "בַּיִת";
    String withoutVowels = "בית";

    String withVowelsTwo = "הַבַּיְתָה";
    String withoutVowelsTwo = "הביתה";

    System.out.println( "These two strings are " + (withVowels.equals( withoutVowels ) ? "" : "not ") + "equal" );
    System.out.println( "The second two strings are " + (withVowelsTwo.equals( withoutVowelsTwo ) ? "" : "not ") + "equal" );

    Collator collator = Collator.getInstance( new Locale( "he" ) );
    collator.setStrength( Collator.PRIMARY );

    System.out.println( collator.equals( withVowels, withoutVowels ) );
    System.out.println( collator.equals( withVowelsTwo, withoutVowelsTwo ) );
}

From that, I get the following output:

These two strings are not equal
The second two strings are not equal
true
true
Commutate answered 6/10, 2012 at 20:37 Comment(1)
Thanks, it didn't solve my problem because i don't want to use collator every where, but it is easy to continue from here. thanks again :)Frederic
M
1

AFAIK there isn't. Vowels are characters. Even some combinations of letters and dots are characters. See the wikipedia page.

http://en.wikipedia.org/wiki/Unicode_and_HTML_for_the_Hebrew_alphabet

You can store the search key for your words as characters only in the 05dx-05ex range. You can add another field for the word with the vowels.

Of course you should be expecting the following:

  • You should need to account for words that have different meaning according to nikkud.
  • You should take into account "mispellings" of י and ו, which are commonplace.
Marsden answered 6/10, 2012 at 20:39 Comment(1)
Well, thank you for your answer, but @Commutate already gave the solution i needed. As for your two user-stories, I am aware of the first one, but as for the second one, that's exactly as misspelling a word in English, are you familiar with a simple solution for spell checking?Frederic

© 2022 - 2024 — McMap. All rights reserved.