I have a bunch of strings which are sentences that look something like this:
Having two illnesses at the same time is known as \xe2\x80\x9ccomorbidity\xe2\x80\x9d and it can make treating each disorder more difficult.
I encoded the original string with .encode()
then compressed with python's bz2
library.
I then decompressed with bz2.decompress()
and used .decode()
to get it back.
Any ideas how I can conveniently remove these bytestrings from the text or avoid characters like quotes not getting decoded properly?
Thanks!
I am guessing that you mistakenly assigned the above byte string “sentence” to an object of type str
. Instead, it needs to be assigned to a byte string object and interpret it as a sequence of UTF-8 bytes. Compare:
b = b'... known as \xe2\x80\x9ccomorbidity\xe2\x80\x9d and ...'
s = b.decode('utf-8')
print(b)
# b'... known as \xe2\x80\x9ccomorbidity\xe2\x80\x9d and ...'
print(s)
# ... known as “comorbidity” and ...
Either way, the issue is unrelated to compression: a lossless compression (such as bzip2) roundtrip never changes the data:
print(bz2.decompress(bz2.compress(b)).decode('utf-8'))
# ... known as “comorbidity” and ...