pythongzipspecial-charactersread-write

Reading and Writing special characters in Python


Python ver. 3.11.5 on Windows 10

I have a directory filled with .gz text archives. To scan these archives, I use the following python code:

    with gzip.open(logDir+"\\"+fileName, mode="rb") as archive:
        for filename in archive:
            print(filename.decode().strip())

All used to work, however, the new system adds lines similar to this:

:§f Press [§bJ§f]

Python gives me this error:

File "C:\Users\Me\Documents\Python\ConvertLog.py", line 16, in readZIP print(filename.decode().strip())
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa7 in position 49: invalid start byte

Anyone know a way of dealing with strange characters that pop up? I can't just ignore the line. This happens to be one of the few lines I need to strip out and write to a condensed report.

I tried other modes, besides "rb". I really have no idea what else to try.


Solution

  • You can use different options for how to handle errors and using decode() a bit differently, which you can read more about in the documentation.

    In decode, you case specify errors='strict', errors='ignore', or errors='replace'. If unspecified, strict is the default, and will throw an error when it finds itself in a situation like yours. ignore will simply ignore the invalid characters. replace replaces the character with a "suitable replacement character."

    So, one way this might be implemented could be:

    import gzip
    
    with gzip.open(logDir + "\\" + fileName, mode="rb") as archive:
        for line in archive:
            decoded_line = line.decode('utf-8', errors='ignore').strip()
            print(decoded_line)