stringgoutf-8slicerune

Splitting a rune correctly in golang


I'm wondering if there is an easy way, such as well known functions to handle code points/runes, to take a chunk out of the middle of a rune slice without messing it up or if it's all needs to coded ourselves to get down to something equal to or less than a maximum number of bytes.

Specifically, what I am looking to do is pass a string to a function, convert it to runes so that I can respect code points and if the slice is longer than some maximum bytes, remove enough runes from the center of the runes to get the bytes down to what's necessary.

This is simple math if the strings are just single byte characters and be handled something like:

func shortenStringIDToMaxLength(in string, maxLen int) string {
    if len(in) > maxLen {
        excess := len(in) - maxLen
        start := maxLen/2 - excess/2
        return in[:start] + in[start+excess:]
    }
    return in
}

but in a variable character width byte string it's either going to be a fair bit more coding looping through or there will be nice functions to make this easy. Does anyone have a code sample of how to best handle such a thing with runes?

The idea here is that the DB field the string will go into has a fixed maximum length in bytes, not code points so there needs to be some algorithm from runes to maximum bytes. The reason for taking the characters from the the middle of the string is just the needs of this particular program.

Thanks!

EDIT:

Once I found out that the range operator respected runes on strings this became easy to do with just strings which I found because of the great answers below. I shouldn't have to worry about the string being a well formed UTF format in this case but if I do I now know about the UTF module, thanks!

Here's what I ended up with:

package main

import (
    "fmt"
)

func ShortenStringIDToMaxLength(in string, maxLen int) string {
    if maxLen < 1 {
        // Panic/log whatever is your error system of choice.
    }
    bytes := len(in)
    if bytes > maxLen {
        excess := bytes - maxLen
        lPos := bytes/2 - excess/2
        lastPos := 0
        for pos, _ := range in {
            if pos > lPos {
                lPos = lastPos
                break
            }
            lastPos = pos
        }
        rPos := lPos + excess
        for pos, _ := range in[lPos:] {
            if pos >= excess {
                rPos = pos
                break
            }
        }
        return in[:lPos] + in[lPos+rPos:]
    }
    return in
}

func main() {
    out := ShortenStringIDToMaxLength(`123456789 123456789`, 5)
    fmt.Println(out, len(out))
}

https://play.golang.org/p/YLGlj_17A-j


Solution

  • Here is an adaptation of your algorithm, which removes incomplete runes from the beginning of your prefix and the end of your suffix :

    func TrimLastIncompleteRune(s string) string {
        l := len(s)
    
        for i := 1; i <= l; i++ {
            suff := s[l-i : l]
            // repeatedly try to decode a rune from the last bytes in string
            r, cnt := utf8.DecodeRuneInString(suff)
            if r == utf8.RuneError {
                continue
            }
    
            // if success : return the substring which contains
            // this succesfully decoded rune
            lgth := l - i + cnt
            return s[:lgth]
        }
    
        return ""
    }
    
    func TrimFirstIncompleteRune(s string) string {
        // repeatedly try to decode a rune from the beginning
        for i := 0; i < len(s); i++ {
            if r, _ := utf8.DecodeRuneInString(s[i:]); r != utf8.RuneError {
                // if success : return
                return s[i:]
            }
        }
        return ""
    }
    
    func shortenStringIDToMaxLength(in string, maxLen int) string {
        if len(in) > maxLen {
            firstHalf := maxLen / 2
            secondHalf := len(in) - (maxLen - firstHalf)
    
            prefix := TrimLastIncompleteRune(in[:firstHalf])
            suffix := TrimFirstIncompleteRune(in[secondHalf:])
    
            return prefix + suffix
        }
        return in
    }
    

    link on play.golang.org


    This algorithm only tries to drop more bytes from the selected prefix and suffix.

    If it turns out that you need to drop 3 bytes from the suffix to have a valid rune, for example, it does not try to see if it can add 3 more bytes to the prefix, to have an end result closer to maxLen bytes.