python-3.xregexpython-3.2

Pulling valid data from bytestring in Python 3


Given the following bytestring, how can I remove any characters matching \xFF, and create a list object from what's left (by splitting on removed areas)?

b"\x07\x00\x00\x00~\x10\x00pts/5\x00\x00/5\x00\x00user\x00\x00"

Desired result:

["~", "pts/5", "/5", "user"]

The above string is just an example - I'd like to remove any \x.. (non-decoded) bytes.

I'm using Python 3.2.3, and would prefer to use standard libraries only.


Solution

  • >>> a = b"\x07\x00\x00\x00~\x10\x00pts/5\x00\x00/5\x00\x00user\x00\x00"
    >>> import re
    >>> re.findall(rb"[^\x00-\x1f\x7f-\xff]+", a)
    [b'~', b'pts/5', b'/5', b'user']
    

    The results are still bytes objects. If you want the results to be strings:

    >>> [i.decode("ascii") for i in re.findall(rb"[^\x00-\x1f\x7f-\xff]+", a)]
    ['~', 'pts/5', '/5', 'user']
    

    Explanation:

    [^\x00-\x1f\x7f-\xff]+ matches one or more (+) characters that are not in the range ([^...]) between ASCII 0 and 31 (\x00-\x1F) or between ASCII 127 and 255 (\x7f-\xff).

    Be aware that this approach only works if the "embedded texts" are ASCII. It will remove all extended alphabetic characters (like ä, é, etc.) from strings encoded in an 8-bit codepage like latin-1, and it will effectively destroy strings encoded in UTF-8 and other Unicode encodings because those do contain byte values between 0 and 31/127 and 255 as parts of their character codes.

    Of course, you can always manually fine-tune the exact ranges you want to remove according to the example given in this answer.