arraysutf-8locality-sensitive-hash

Order preserving mapping from utf8 to an array of bytes


I'm working with an algorithm that indexes arbitrarily large unsigned integers of a known, fixed size (e.g. 64 bits or 128 bits). I'd like to be able to apply it to utf-8 strings as well, but in order to do so I need to have a reliable way to map a given string of any length to a fixed size array of unsigned bytes in such a way that lexicographical order of at least a prefix of the string is preserved.

The naive approach to this would be to simply take the first X characters of the string and give each character a full four bytes, prepending the actual value with zeroes as needed. However, this would take X * 4 bytes. I'm hoping there's a way to do this that's more space-efficient.

---- EDIT ----

Very importantly: it is acceptable to have collisions.

Using the naive approach described above and given the strings:

['Alabama', 'Alakazam', 'Alaska', 'Arkansas', 'Corduroy']

If we set X to be 3, 'Alabama', 'Alaska', and 'Alakazam' would collide -- only three unique 12-byte values would be produced from the mapping (the 4-byte-per-character representations of 'Ala', 'Ark' and 'Cor'). However, it would be very important that those three values maintain their lexicographical ordering.

We have to use 4 bytes because this is (I believe) the largest size a single character can occupy in utf-8. In order to guarantee that our mapping gives us a byte array of a fixed size (at least in this scheme), we have to have even ASCII characters, which would normally occupy only one byte, occupy the maximum four bytes.

'A' => 01100001, padded with zeros: 00000000000000000000000001100001

'l' => 01101100, padded with zeros: 00000000000000000000000001101100

'a' => 01100001, padded with zeros: 00000000000000000000000001100001

Thus, in an example where X = 4, any string starting with 'Ala' would map to:

000000000000000000000000011000010000000000000000000000000110110000000000000000000000000001100001

When viewed as a 96-bit unsigned int, it would have a value less than that of the mappings of the other prefixes from our example ('Ark' and 'Cor') and thus would satisfy the requirement that the mapping preserved our lexicographical ordering.

This scheme works but inflates the size requirement for any string by as much as 4x. The hope is to find a mapping scheme that accomplishes utf-8 prefix indexing with fewer than X * 4 bytes.


Solution

  • Happily, it turns out that UTF-8 encoded strings can be sorted lexicographically as-is.

    Sorting order: The chosen values of the leading bytes and the fact that the continuation bytes have the high-order bits first means that a list of UTF-8 strings can be sorted in code point order by sorting the corresponding byte sequences.

    By truncating the Strings' byte sequences to a fixed-length prefix, you can achieve what was described in the question above.