pythonbz2

How do I remove bytestrings left over from decompression from a string?


I have a bunch of strings which are sentences that look something like this:

Having two illnesses at the same time is known as \xe2\x80\x9ccomorbidity\xe2\x80\x9d and it can make treating each disorder more difficult.

I encoded the original string with .encode() then compressed with python's bz2 library.

I then decompressed with bz2.decompress() and used .decode() to get it back.

Any ideas how I can conveniently remove these bytestrings from the text or avoid characters like quotes not getting decoded properly?

Thanks!


Solution

  • I am guessing that you mistakenly assigned the above byte string “sentence” to an object of type str. Instead, it needs to be assigned to a byte string object and interpret it as a sequence of UTF-8 bytes. Compare:

    b = b'... known as \xe2\x80\x9ccomorbidity\xe2\x80\x9d and ...'
    s = b.decode('utf-8')
    print(b)
    # b'... known as \xe2\x80\x9ccomorbidity\xe2\x80\x9d and ...'
    print(s)
    # ... known as “comorbidity” and ...
    

    Either way, the issue is unrelated to compression: a lossless compression (such as bzip2) roundtrip never changes the data:

    print(bz2.decompress(bz2.compress(b)).decode('utf-8'))
    # ... known as “comorbidity” and ...