Trie for Unicode character set
Asked Answered
B

1

7

I have to match an input string against a set of prefixes. The match should be the best possible, such that if there are both abcd* and abcde*, then abcdef should match abcde*. I am using Trie for this. The problem is the character in the input and in the set of prefixes can be any Unicode character. So, the children array that we have in a simple trie won't be possible(won't be sufficiently efficient, atleast, as the array size will be very large). Using map instead of array would still be inefficient. How should I go about tackling this?

Brisson answered 15/7, 2015 at 12:3 Comment(7)
I'm not sure if I understood the question properly; how does the usage of unicode as character set make the problem any harder?Wisner
Is the problem with supporting unicode (as opposed to ASCII) the storage space needed for the child array?Perilous
For a simple trie, we index the next node reference using that node's character.Brisson
@Simon, yes. Exactly.Brisson
One possibility is Bitwise Trie. But, I want to explore other options as well.Brisson
do you want to use an existing trie implementation or do you have the task to write a trie which supports unicode strings.Gasworks
@wero, existing would also do. I don't have any task as such.Brisson
P
10

To construct the trie, you could encode the Unicode strings as UTF-8, and then construct a trie with the encoded byte sequences. Or you could work with the code points, and use a hash map in your nodes. You'll have to benchmark your application to figure out which approach works best.

But the hard problem is how to decide when two string match.

Consider the word café

This can be represented as:
A = [U+0063 U+0061 U+0066 U+0065 U+0301] (ends with e and a combining accent)
But also as
B = [U+0063 U+0061 U+0066 U+00E9] (end with é, the combined form)

So:

  • Should the strings match the prefix cafe (no accent)? A starts with that prefix, B doesn't. However A and B should either both match the prefix, or not, as both represent the same word café.

  • And what if you have A in your trie, and you're trying to match B? It's the same word, so should it match?
    → You may have to convert your strings to the same normalized form when inserting in your trie and when matching.

  • There's other issues. In German a double s is normally written as ß. Should ß and ss match or not?

And it goes on. Deciding if two Unicode strings are equal is a non-trivial problem in itself. It's up to you to decide how complex you want the matching to be, it depends on your application.

Passport answered 16/7, 2015 at 6:45 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.