pythonbinaryfilesbinary-databitstring

How to write binary file with bit length not a multiple of 8 in Python?


I'm working on a tool generates dummy binary files for a project. We have a spec that describes the real binary files, which are created from a stream of values with various bit lengths. I use input and spec files to create a list of values, and the bitstring library's BitArray class to convert the values and join them together.

The problem is that the values' lengths don't always add up to full bytes, and I need the file to contain the bits as-is. Normally I could use BitArray.tofile(), but that method automatically pads the file with zeroes at the end.

Is there another way how to write the bits to a file?


Solution

  • A file, as most OSes see them, contains a stream of bytes, sometimes referred to as characters. In some systems, there's a difference between text and binary data storage (e.g. PDP-1 uses 6 bit characters and 18 bit words), but the size of the file is counted in those bytes. For some systems, not even that level is stored, but an end-of-file character is used to mark where the data ends in the last block (be it sector, cluster or extent).

    You'll need to replicate one of these methods to store a number of bits, for instance using 1-then-0s padding. The downside of that padding method is you need to find the end to know if a string of 0s (and the prior 1) form the padding, not data.

    Another method might be to first store the number of bits, or just store the number of bits for each written chunk. Doing that requires an encoding such that you know the size of the size field, for instance one byte, which would imply chunks of no more than 256 bits. This length prefix method is used e.g. in Pascal strings.

    You may also want to consider an established file format where bit sequences are stored, such as the serial vector format. Most of these aren't very efficient, and designed for specific tasks (in this case, storing time series of circuit simulation).

    Schemes such as these can also be generalized into the data storage formats themselves. Examples include length-prefixed strings, UTF-8 code points, BitTorrent Bencoding or Exponential-Golomb coding. That last one is relevant today because it allows an arbitrary size and is supported by the bitstring module.

    One reasonably easy way in bitstring might be to add an (aligned) trailing byte to the file signifying how many bits in the penultimate byte were padding:

    def pad(data: bitstring.BitArray):
        padding = data.bytealign()
        data.append(bitstring.Bits(chr(padding)))
    def unpad(data: bitstring.BitArray):
        padding = data[-8:].uint
        del data[-8-padding:]
    

    If you're reading the file piecemeal you'll have to take care to do this unpadding as you reach the last two bytes.

    Here's a 1-then-0 variation:

    def pad(data: bitstring.BitArray):
        data.append(bitstring.Bits(length=1, uint=1))
        data.bytealign()
    def unpad(data: bitstring.BitArray):
        last1 = data.rfind(bitstring.Bits(length=1, uint=1))[0]
        del data[last1:]