pythonunpack

Python: how to unpack variable-length data from byte string?


There's a byte-string like this:

[lenght1][sequence1][len2][seq2][len3][seq3][len1][seq1]...

where lengthX is the length of the sequenceX following just after that lenghtX. Please note there're no separators at all, and all "len-data" pairs are grouped in a set of three (after seq3 immediately comes len1 of the next group).

I'm trying to extract all sequences, but looks like using struct.unpack() is very cumbersome (or idk how to use it properly):

 loop_start:
   my_len = unpack("<B", content[:1])[0]
   content = content[1:]
   ..get sequence1
   ..shift byte-string
   ..repeat two times...

Is there any simpler way?

p.s. seqX is in fact multi-byte string, if it's matter.


Solution

  • This data structure is very useful when, for example, sending arbitrary data over a socket. Using separators can be problematic due to ambiguity - e.g, if you use STX/ETX there may be an issue if the [real] data contains the equivalent of either of those markers.

    Sending data with length/data pairs removes ambiguity. All that needs to happen is that the client and server need to agree on the format of the length value being transmitted (native, little- big-endian).

    This is best explained by example.

    We have a list of strings and we build a bytearray of length/data pairs. We'll agree on native unsigned int for the preamble. We know that the packed value is comprised of 4 bytes.

    So...

    from struct import pack, unpack
    
    strings = [
        'To be, or not to be: that is the question',
        'All the world\'s a stage, and all the men and women merely players',
        'We are such stuff as dreams are made on',
        'The course of true love never did run smooth',
        'If music be the food of love, play on',
        'Friends, Romans, countrymen, lend me your ears',
        'A horse! a horse! my kingdom for a horse!',
        'Once more unto the breach, dear friends, once more',
        'To thine own self be true',
        'Parting is such sweet sorrow'
    ]
    
    FMT = '=I' # native unsigned int
    FMTL = 4 # standard size
    
    b = bytearray()
    
    for string in strings:
        bs = string.encode()
        b += pack(FMT, len(bs)) + bs
    
    # at this point we have a bytearray comprised of length/data pairs
    # now let's unravel it
    
    while b:
        length, *_ = unpack(FMT, b[:FMTL])
        print(b[FMTL:length+FMTL].decode())
        b = b[length+FMTL:]
    

    This code is easily adapted for any integer type by specifying FMT and FMTL appropriately. Type 'c' is hinted at in OP's question. That has to be dealt with in a slightly different manner