stringencodinghamming-distancegray-code

Encoding for strings (preferrably a value) such that closer values means more similar Strings?


I am looking for an encoding which can encode every string into a unique number such that ->

  1. Every two strings which are similar must have values close to each other.
  2. Every two values which are close to each other must represent similar strings.

Similarity of strings would mean that a few substitutions in one string can form another string. No additions or deletions are considered.

The string can only have characters A, C, T and G (only four possibilities)

Things I have tried ->

  1. Gray code -> It satisfies the second one but doesn't satisfy the first criteria. Two string which are similar need not means they have closer values in gray code.

  2. Hamming Distance from a reference string -> Clearly if the hamming distance is the same it does not mean at all that the strings are similar, just that they are equally far from the reference. So it does not satisfy the second criteria.

Please suggest a method if you know any for this particular problem.


Solution

  • I think what you're looking for is a space filling curve: A coloured Hilbert Curve

    Consider a string to be an N dimensional vector of chars, and have a corresponding point in N dimensional space. Any two strings will have a manhattan distance equal to the sum of the differences of their characters, so strings which are close together in this representation are similar strings.

    We transform our N dimensional vector into a number between 0 and n, where n is the highest possible valued string, using a Hilbert curve. In the image, we only have two dimensions, but Hillbert curves can be generalised to higher dimensions.

    If you look at the image, the line is continuous, and thus satisfies condition 2. Hillbert curves are essentially a generalised gray code.

    For the most part, condition 1 is also true. If you look at the image, the colour of the Hilbert curve changes slowly over it's length. The colour between adjacent areas of the Hilbert curve is normaly quite similar, the exception in this case would be halfway down the left side, where the colour changes from orange to blue. However, Hillbert curves will fill a small area before ever moving onto the next one, therefore most similar strings will have an integer representation close to one another. It's not perfect but it's fairly good.