Limitations of and alternatives to tries in languages other than English?
Asked Answered
E

2

18

The trie data structure is often a great way to store strings in English. It works by building a tree where each edge is labeled with a letter, and the path to a marked node in the tree spells out one of the words in the data structure.

This data structure works well in English because there are "only" 26 letters in the English alphabet (a "reasonable" branching factor), those characters have consecutive ASCII values (so the child pointers can be stored in an array keyed by the index of the letters used by each child), and there are many English words with common prefixes (so there is a lot of redundancy in the structure).

I'm a native English speaker with only limited knowledge of other languages and alphabets, but it seems like many of these properties don't hold in other languages. I know that French, Spanish, German, and Hungarian, for example, frequently use accented characters that aren't stored continuously with the remaining letters in Unicode space. Hebrew and Arabic have vowel markings that are usually indicated above or below each letter. Chinese uses a logogram system, and Korean Hangul characters consist of triples of smaller characters grouped together.

Do tries still work well for data stored in these languages and alphabets? What changes, if any, are necessary to use tries for this sort of data? Are there any data structures that work well for strings in those languages and alphabets that are particularly well-suited to them but would not be useful or efficient in English?

Eglantine answered 4/12, 2014 at 21:29 Comment(0)
M
11

As an addendum to @JimMischel's answer, I'd like to bring up the issue that in other languages there are often multiple equivalent ways to write the same thing. Vietnamese (based on the Latin/English script) is a particularly good example where letters with two accents are common. For example, Ặ (U+1EB6) can technically also be written with the sequences Ă + dot, Ạ + breve, A + breve + dot, A + dot + breve.

Unicode normalization can solve this problem by converting a string to a standardized canonical order. There are 4 different variations, NFC, NFKC, NFD, and NFKD. I won't go into too much detail here, but the first two are "composed forms" which tends to shorten the string, grouping base characters with its accents, while the last two are "decomposed forms", doing the opposite.

Hangul is an interesting case: It is an alphabet, though all the letters of a syllable are written together in a block. Both the individual letters and the syllabic blocks exist in Unicode. Normalization can solve this, though the number of distinct syllables is quite large. Using NFC/NFKC might not be useful for a trie, but in this case, using NFD/NFKD to decompose syllables to the constituent letters would work.

A few other unrelated points to consider:

  • In addition to the garçon/garcon point already brought up, you have the cote/coté/côte/côté problem, which are all distinct French words. Similarly, vowel marks in Hebrew and Arabic are not usually mandatory, which can occasionally cause ambiguities.
  • The alphabets1 of South and Southeast Asia can get large compared to English, roughly twice the size.

  1. They are strictly termed abugidas, where vowels are written as diacritics/accents, but this distinction can usually be ignored from a programming point of view.
Marcus answered 15/12, 2014 at 3:55 Comment(0)
J
14

I've found that tries work well for Western European languages, as well as for Cyrillic and many other alphabetic languages. Come to think of it, the only languages I had trouble with were Chinese, Japanese, and other logographic writing systems. And for those, the trie was useless.

The sequential Unicode values of English characters isn't really a huge benefit. Although it suggests the simple node implementation:

CharNode
    char
    array[26] of CharNode

That structure isn't particularly helpful. It can make things faster, but at a fairly high memory cost. Even at the second level of a trie, that array is remarkably sparse. By the time you get to the fourth or fifth level, it's almost all dead space. I did an analysis of that at one point. I'll look around and see if I still have the numbers.

I found it nearly as fast to have a variable-length array in the node, with items ordered by frequency. Beyond the second or third level of the trie, the character I was looking for was almost always in the first or second position in that array. And the space savings was quite large. Rather than 26 references per node (104 bytes in my implementation), I had a one-byte count, and then five bytes per reference. So as long as there were fewer than 21 children for a particular node (which was most of the time), I saved space. There was a small runtime penalty, but not enough in my application to matter.

That's the only modification I had to make to my trie structure to make it support all of the alphabetic languages I was working with. As I said, I was working mostly with Western European languages, and for those it worked beautifully. I know that it did work with Hebrew and Arabic, but I don't know how well it worked. It met our purposes, but whether it would have satisfied a native speaker is unknown.

The trie I built worked well enough for our purposes with any language whose characters fit within the Unicode Basic Multilingual Plane. There was a little weirdness when working with surrogate pairs, but we pretty much ignored those. Basically, we just treated the surrogate pair as two characters and let it go at that.

You do have to decide whether you want to treat accented characters as separate characters, or if you want to map them. Consider, for example, the French word "garçon," which some people will spell "garcon," either because they don't know any better or they don't know how to make the character 'ç'. Depending on what you're using the trie for, you might find it useful to convert accented characters to their un-accented equivalents. But I suppose that's more of an input cleansing problem than a trie problem.

That's my fairly long-winded way of saying that a standard trie should work well for any alphabetic language, without any language-specific modifications. I don't see any obvious way to use a trie for a logographic language. I know nothing about Korean Hangul, so I can't say whether a trie would be useful there.

Jacquline answered 4/12, 2014 at 22:7 Comment(3)
Along the lines of input cleansing, for the logographic writing systems, it seems like using the romanizations might help.Bohaty
@Nuclearman: I suppose the romanizations could help if you have a good dictionary. Never gave it much thought. Interesting idea.Jacquline
Another approach is to note that each character can be generated via a specific combinations of keys on a keyboard designed for that language. It should be possible to do a reverse lookup to find the specific combination. Although, that requires a kind of dictionary as well.Bohaty
M
11

As an addendum to @JimMischel's answer, I'd like to bring up the issue that in other languages there are often multiple equivalent ways to write the same thing. Vietnamese (based on the Latin/English script) is a particularly good example where letters with two accents are common. For example, Ặ (U+1EB6) can technically also be written with the sequences Ă + dot, Ạ + breve, A + breve + dot, A + dot + breve.

Unicode normalization can solve this problem by converting a string to a standardized canonical order. There are 4 different variations, NFC, NFKC, NFD, and NFKD. I won't go into too much detail here, but the first two are "composed forms" which tends to shorten the string, grouping base characters with its accents, while the last two are "decomposed forms", doing the opposite.

Hangul is an interesting case: It is an alphabet, though all the letters of a syllable are written together in a block. Both the individual letters and the syllabic blocks exist in Unicode. Normalization can solve this, though the number of distinct syllables is quite large. Using NFC/NFKC might not be useful for a trie, but in this case, using NFD/NFKD to decompose syllables to the constituent letters would work.

A few other unrelated points to consider:

  • In addition to the garçon/garcon point already brought up, you have the cote/coté/côte/côté problem, which are all distinct French words. Similarly, vowel marks in Hebrew and Arabic are not usually mandatory, which can occasionally cause ambiguities.
  • The alphabets1 of South and Southeast Asia can get large compared to English, roughly twice the size.

  1. They are strictly termed abugidas, where vowels are written as diacritics/accents, but this distinction can usually be ignored from a programming point of view.
Marcus answered 15/12, 2014 at 3:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.