stringdata-structuresinternationalizationtrie

Limitations of and alternatives to tries in languages other than English?


The trie data structure is often a great way to store strings in English. It works by building a tree where each edge is labeled with a letter, and the path to a marked node in the tree spells out one of the words in the data structure.

This data structure works well in English because there are "only" 26 letters in the English alphabet (a "reasonable" branching factor), those characters have consecutive ASCII values (so the child pointers can be stored in an array keyed by the index of the letters used by each child), and there are many English words with common prefixes (so there is a lot of redundancy in the structure).

I'm a native English speaker with only limited knowledge of other languages and alphabets, but it seems like many of these properties don't hold in other languages. I know that French, Spanish, German, and Hungarian, for example, frequently use accented characters that aren't stored continuously with the remaining letters in Unicode space. Hebrew and Arabic have vowel markings that are usually indicated above or below each letter. Chinese uses a logogram system, and Korean Hangul characters consist of triples of smaller characters grouped together.

Do tries still work well for data stored in these languages and alphabets? What changes, if any, are necessary to use tries for this sort of data? Are there any data structures that work well for strings in those languages and alphabets that are particularly well-suited to them but would not be useful or efficient in English?


Solution

  • As an addendum to @JimMischel's answer, I'd like to bring up the issue that in other languages there are often multiple equivalent ways to write the same thing. Vietnamese (based on the Latin/English script) is a particularly good example where letters with two accents are common. For example, Ặ (U+1EB6) can technically also be written with the sequences Ă + dot, Ạ + breve, A + breve + dot, A + dot + breve.

    Unicode normalization can solve this problem by converting a string to a standardized canonical order. There are 4 different variations, NFC, NFKC, NFD, and NFKD. I won't go into too much detail here, but the first two are "composed forms" which tends to shorten the string, grouping base characters with its accents, while the last two are "decomposed forms", doing the opposite.

    Hangul is an interesting case: It is an alphabet, though all the letters of a syllable are written together in a block. Both the individual letters and the syllabic blocks exist in Unicode. Normalization can solve this, though the number of distinct syllables is quite large. Using NFC/NFKC might not be useful for a trie, but in this case, using NFD/NFKD to decompose syllables to the constituent letters would work.

    A few other unrelated points to consider:


    1. They are strictly termed abugidas, where vowels are written as diacritics/accents, but this distinction can usually be ignored from a programming point of view.