If I have some string to be searched in UTF-8 format and another to search for, also in UTF-8 format, are there any caveats to doing a straight up comparison search for the codepoint to pinpoint a matching character?
With the way UTF-8 works, it is possible to ever get a false positive?
I've read a lot of documentation about how great UTF-8 is but I'm having trouble forming a proof to answer this question.
If I search forward then I could skip along the length of a codepoint; but it's walking the string in reverse which worries me.
Instead of walking backwards until I hit the start of a codepoint and then doing a memory comparison from that address, is it safe to simply walk backwards along each byte until I get a full match against the search string?
Nope. There are no caveats here; this operation is perfectly safe in UTF-8.
Recall that UTF-8 represents characters using two general forms:
ASCII characters (U+0000 through U+007F), which are all represented literally using a single byte in the range 0x00-0x7F
.
All other characters, which are represented by a sequence which includes:
0xC2-0xF4
, which encodes part of the character data as well as the length of the sequence to follow.0x80-0xBF
, which encodes part of the remainder of a character.Since there is no overlap between leading and continuation bytes, accidentally starting a search in the middle of a multi-byte character is fine. You won't find your match, because the string you're searching for won't start with a continuation byte, but you won't find any false positives either.