javabencoding

Bencoded string length in java


I am a bit confused with bencoding.

According to the specification when I bencode string I need to use the following format:

length:string

String spam becomes 4:spam

My question: 4 is qty of symbols of bencoded string, or qty of utf-8 bytes?

For instance, if I am going to bencode a string gâteau

What number should be specified as a length of this string?

I think I have to specify 7, and the final form should be 7:gâteau

It is because symbol â took 2 bytes accoring to utf-8 encoding, and all the rest symbols in this string took 1 byte according to utf-8 encoding.

Also I heard that it is not recommended to store bencoded data in java String instance.

In other words, when I bencode a data block, I should store it as a byte array and should not convert it to java String value to avoid encoding issues.

Are my assumptions correct?


Solution

  • According to the specification, bencoded string is a sequence of bytes, and you have to specify qty of bytes for this sequence as its length.

    And, from the specification: "All character string values are UTF-8 encoded".specification

    And for your case with "gâteau" you should specify 7 as length, because character â takes 2 bytes.