calgorithmnlpstate-machineiphone-keypad

alphabetic search from telephone numeric pad


I'm implementing an alphabetic search based on telephone keypad, like Phone keypad1

When user types , say 2, I get {A, B, C} in the combination. When user types 23, I get {AD, AE, AF, BD, BE, BF, CD, CE, CF} in the combinations, and so on. If I keep typing and make combinations I get thousands of combinations which make search process quite slow. So now I want to implement an algorithm which delete illogical combinations like CF BD CD, I mean logically no one's name starts with these combinations, perhaps two consonants without vowel. So this way I want to narrow down my search. Anyone knowing about such state machine, implemented in C?


Solution

  • Keep in mind that when it comes to linguistic data "illogical" is not a good proxy for "unlikely." This is particularly true when it comes to names. As an example, according to a standard definition of "consonant" in English, my last name starts with four consonants. If it were to be written after a German fashion, it would start with five. When thinking about such issues it is useful to keep in mind that:

    1. Sounds are not letters, and letters are not sounds: in most orthographic systems, the mapping of letters to sounds is not 1:1
    2. Many languages have unexpected syllabic nuclei: Tamazight Berber, for instance, allows syllables where the sound m plays the role of the syllabic nucleous, like the vowel generally do in English. So a Berber name can look like CCmC (where C stands for consonants) and be perfect in that language. It is not unlikely that a person of Berber origin would then use the similar orthography in English, which a naive system would rule out as "illogical"
    3. Finally, many systems for writing foreign names and words in English use di-graphs or tri-graphs (two letter and three letter combinations) for representing the sounds of the foreign language in English: this can create what looks like illicit consontant clusters. We know that English does that (sh represents one sound, see point 1), but this is particularly true when transcribing foreign words.

    So unless you know very well the orthographic rules for the names you are expecting, you are likely to rule out legitimate names using a naive system.