stringgoutf-8

UTF-8 safe truncation of a Go String to less than N bytes


I sometimes need to find myself truncating a string to fit within a specific number of bytes. The problem with doing that in Go is that if you do s[:1_000_000], given that s is a valid utf-8 string, you might end up cutting right in the middle of a utf-8 code point which may be 1~4 bytes long, leaving you with an invalid rune.

Some people (and the LLMs trained on their ideas) would attempt to use utf8.ValidString, or for i := range s to do this, as both of those would ensure a valid rune. However, those people would be doing a constant-time task in linear time.

I wrote a constant-time safe-trunctate function:

import "unicode/utf8"

// UTF8SafeTruncateNBytes Truncates a **valid** utf-8 string `s` to `n` bytes (not n UTF-8 characters),
// ensuring that the string is not truncated in the middle of a UTF-8 character.
func UTF8SafeTruncateNBytes(s string, n int) string {
    if n >= len(s) {
        return s
    }
    for i := n; i >= n-3 && i >= 0; i-- {
        if utf8.RuneStart(s[i]) {
            return s[:i]
            // Edit: This was:
            //if r, size := utf8.DecodeRuneInString(s[i:]); r != utf8.RuneError {
            //  return s[:i+size]
            //}
            // but got fixed because of the picked solution. This implementation is now correct,
            // and, in fact, equivalent, except that it only checks one byte instead of backing up forever.
        }
    }

    // Fallback in the case that the user lied, and passed a string that is not a valid utf-8 string.
    // It would be wise to return an error or "" here if this is a standard-library
    // function to allow the user to check for it.
    return s[:n]
}

The questions are as follows:

  1. Will this work or is there an edge case I missed?
  2. Is there a better, more eloquent way to do this that I missed, or a standard library function that already does this?
  3. Why is this not a standard library function under "unicode/utf8"? It seems like it's just the right level of use frequency and complexity to warrant having a standard library function. Should I propose it in their issues page?

Solution

    1. Your implementation, while very well-motivated, is not correct:
    UTF8SafeTruncateNBytes("世", 1) // "世" (len = 3)
    
    1. You should consider using an existing, optimized implementation: https://pkg.go.dev/tailscale.com/util/truncate:
    import "tailscale.com/util/truncate"
    truncate.String("世", 1) // "" (len = 0)
    
    1. A proposal to include this in the standard library wouldn't hurt, but be aware that it could be rejected for similar reasons as https://github.com/golang/go/issues/56885.