c++unicodeencodingutf-8unicode-string

Any caveats when searching for a UTF-8 code point in a string?


If I have some string to be searched in UTF-8 format and another to search for, also in UTF-8 format, are there any caveats to doing a straight up comparison search for the codepoint to pinpoint a matching character?

With the way UTF-8 works, it is possible to ever get a false positive?

I've read a lot of documentation about how great UTF-8 is but I'm having trouble forming a proof to answer this question.

If I search forward then I could skip along the length of a codepoint; but it's walking the string in reverse which worries me.

Instead of walking backwards until I hit the start of a codepoint and then doing a memory comparison from that address, is it safe to simply walk backwards along each byte until I get a full match against the search string?


Solution

  • Nope. There are no caveats here; this operation is perfectly safe in UTF-8.

    Recall that UTF-8 represents characters using two general forms:

    Since there is no overlap between leading and continuation bytes, accidentally starting a search in the middle of a multi-byte character is fine. You won't find your match, because the string you're searching for won't start with a continuation byte, but you won't find any false positives either.