I sometimes need to find myself truncating a string to fit within a specific number of bytes. The problem with doing that in Go is that if you do s[:1_000_000]
, given that s
is a valid utf-8 string, you might end up cutting right in the middle of a utf-8 code point which may be 1~4 bytes long, leaving you with an invalid rune.
Some people (and the LLMs trained on their ideas) would attempt to use utf8.ValidString
, or for i := range s
to do this, as both of those would ensure a valid rune. However, those people would be doing a constant-time task in linear time.
I wrote a constant-time safe-trunctate function:
import "unicode/utf8"
// UTF8SafeTruncateNBytes Truncates a **valid** utf-8 string `s` to `n` bytes (not n UTF-8 characters),
// ensuring that the string is not truncated in the middle of a UTF-8 character.
func UTF8SafeTruncateNBytes(s string, n int) string {
if n >= len(s) {
return s
}
for i := n; i >= n-3 && i >= 0; i-- {
if utf8.RuneStart(s[i]) {
return s[:i]
// Edit: This was:
//if r, size := utf8.DecodeRuneInString(s[i:]); r != utf8.RuneError {
// return s[:i+size]
//}
// but got fixed because of the picked solution. This implementation is now correct,
// and, in fact, equivalent, except that it only checks one byte instead of backing up forever.
}
}
// Fallback in the case that the user lied, and passed a string that is not a valid utf-8 string.
// It would be wise to return an error or "" here if this is a standard-library
// function to allow the user to check for it.
return s[:n]
}
The questions are as follows:
"unicode/utf8"
? It seems like it's just the right level of use frequency and complexity to warrant having a standard library function. Should I propose it in their issues page?UTF8SafeTruncateNBytes("世", 1) // "世" (len = 3)
import "tailscale.com/util/truncate"
truncate.String("世", 1) // "" (len = 0)