pythonrlevenshtein-distanceedit-distanceeye-tracking

Levenshtein distance with weight/penalty for adjacency


I am using the string-edit distance (Levenshtein-distance) to compare scan paths from an eye tracking experiment. (Right now I am using the stringdist package in R)

Basically the letters of the strings refer to (gaze) position in a 6x4 matrix. The matrix is configured as follows:

     [,1] [,2] [,3] [,4]
[1,]  'a'  'g'  'm'  's' 
[2,]  'b'  'h'  'n'  't'
[3,]  'c'  'i'  'o'  'u'
[4,]  'd'  'j'  'p'  'v'
[5,]  'e'  'k'  'q'  'w'
[6,]  'f'  'l'  'r'  'x'

If I use the basic Levenshtein distance to compare strings, the comparison of a and g in a string gives the same estimate as comparicon of a and x.

E.g.:

'abc' compared to 'agc' -> 1
'abc' compared to 'axc' -> 1

This means that the strings are equally (dis)similar

I would like to be able to put weights on the string comparison in a way that incorporates adjacency in the matrix. E.g. the distance between a and x should be weighted as larger then that between a and g.

One way could be to calculate the "walk" (horizontal and vertial steps) from one letter to the other in the matrix and divide by the max "walk"-distance (i.e. from a to x). E.g. the "walk"-distance from a to g would be 1 and from a to x it would be 8 resulting in a weight of 1/8 and 1 respectively.

Is there a way to implement this (in either R or python)?


Solution

  • You need a version of the Wagner-Fisher algorithm that uses non-unit cost in its inner loop. I.e. where the usual algorithm has +1, use +del_cost(a[i]), etc. and define del_cost, ins_cost and sub_cost as functions taking one or two symbols (probably just table lookups).