language-agnosticstring-comparisonhomoglyph

Is there a function to compare two strings using a custom homoglyphs list


I need a function that would compare two strings and outputs an edit distance like Levenshtein, but only if the characters are homoglyphs in cursives. I have a list of those homoglyphs so I could feed a custom list to this function.

Example

homoglyphs = [["o","a"],["rn","m","nn"],...] // In cursive they look-alike

compare("Mory", "Mary", homoglyphs) // Levenshtein gives 1
compare("Mory", "Tory", homoglyphs) // Levenshtein gives 1, but I want false, 99 or -1

compare("Morio", "Mario", homoglyphs) // I expect a distance of 1
compare("Morio", "Maria", homoglyphs) // I expect a distance of 2

Tory should give a false result since there's no way someone misread an M as a T. An A could be misread as an O so it can count as 1.

The scoring could be different, I just need to know that Mory is probably Mary not Tory and Morio is a little more likely to be Mario than Maria.

Do something like this exists?


Solution

  • The key to your problem can be thought of like an IQ word association question.

      Sound       Glyph
    --------- =  ----------
    Homophone    Homoglyphs
    

    Now if you know that there is a way to find similar sounding words (homophone) then the same can be applied but instead of sounds change to glyphs (homoglyph).

    The way to find similar sounding words is via Soundex (Sound Index).

    So just do what Soundex does but instead of having a mapping from similar homophones use similar homoglyphs.

    Once you convert each word (glyphs) input into a Glyphdex (Glyph Index) then you can compute the Levenshtein distance for the two Glyphdex.

    Make sense?


    If you are into cellular biology then codon translation into amino acids (ref) might make more sense. Many amino acids are coded by more than one 3 letter codon.


    Note: Since the word glyhdex has been used prior to me writing this I can not say I coined that word, however the usage I currently find via Google (search) for the word are not in the same context as described here. So in the context of converting a sequence of glyphs into an index of similar sequence of glyphs I will take credit.