stringgoutf-8slicerune

can I remove the trailing zeros in a strings representation ([]byte) to compare strings?


I need to compare strings in Go. The problem is: I want to compare accented words (café) with its non-accented form (cafe). The first thing I do is converting my accented string to its non-accented form with this:

you can run the code here: https://play.golang.org/p/-eRUQeujZET

But every time I do this transformation in a string it adds more runes in the end. The example above prints:

bytes: [99 97 102 101 0] string: cafe

As I need to compare the string returned from this process with its counterpart without the 'é' in the first place, I would need to remove the last rune (0) from the []byte.

After running some tests I perceived that the last 0s (sometimes it adds more than one) don't change the string representation.

Am I missing something? Can I just remove all zeros in the end of the []byte?

Here is my code to remove the 0s and compare the strings:

https://play.golang.org/p/HoueAGI4uUx

As we can't work alone in this field, here the articles I read to get to where I am now:

https://blog.golang.org/strings

https://blog.golang.org/normalization

https://unicode.org/reports/tr15/

https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/


Solution

  • This is your custom Transform() function:

    func Transform(s string) ([]byte, error) {
        var t transform.Transformer
        t = transform.Chain(norm.NFD, runes.Remove(runes.In(unicode.Mn)), norm.NFC)
        dst := make([]byte, len(s))
        _, _, err := t.Transform(dst, []byte(s), true)
        if err != nil {
            return nil, err
        }
        return dst, nil
    }
    

    In it you are using Transformer.Transform() which also returns the number of bytes written to the destination. But you don't use that return value.

    So simplest is to store the nDst return value, and slice the destination slice, because this holds the number of "useful" bytes in it (bytes beyond nDst will remain 0 as handed to you by the preceding make() call):

    nDst, _, err := t.Transform(dst, []byte(s), true)
    if err != nil {
        return nil, err
    }
    return dst[:nDst], nil
    

    With this change, the returned slice will only contain the useful bytes without trailing zeros.

    Output will be (try it on the Go Playground):

    2009/11/10 23:00:00 bytes: [99 97 102 101] string: cafe