pythonpython-3.xcsvbz2

TypeError: a bytes-like object is required, not 'str' when trying to write to a csv.writer that uses a bz2.BZ2File object


Background:

I need to write a CSV file that I compress before putting to disk as I'm running about 96 processes simultaneously on an SMP and they otherwise fill up the tiny hard drive space I have before I can offload them elsewhere (no, it's not my system so don't ask my how a 104 CPU / 0.25TB RAM / 8 Tesla server only has 2TB shared for all users that is 90+% full). I need to use as many processors as I can since 1 CPU would take me almost 4 years and using 96 drops that to about 2 weeks.

All of the answers to similar questions state that you should use bz2.open() with mode 'wt'; however, I have not found any that address using the bz2 file-like object with a csv.writer() object and it just does not seem to work. I've even written a script to test all the possible write mode permutations (see below) that reproduces the problem faithfully.

Note: I cannot simply to ','.join(row) which would overwise work with a mode='wt' bz2 object because many of the text fields need escaping with line breaks, embedded commas, embedded '\x00' chars, etc.

Steps to reproduce:

/tmp/test.py:

import os
import bz2
import csv
import traceback


tfile = '/tmp/test.csv.bz'
row = ['bc22jtr', 118324, None, 'contran', None, 11.5, 9.23, ]


def perr(err, bmode, fmode=None):
    """Func for printing exception info in a less noisy manner."""
    print(
        f"EXCEPTION: wt.writerow(row) → {type(err).__name__}:"
        f" {err}; bmode='{bmode}', fmode='{fmode}'"
    )
    print((''.join(traceback.format_exception(err)[-2:-1])).strip())
    return True


for fmode in ["w", "wt", "wb"]:
    for bmode in ["w", "wb"]:
        had_err = False
        if os.path.exists(tfile):
            os.remove(tfile)
        fh = open(tfile, fmode)
        try:
            bh = bz2.BZ2File(fh, mode=bmode, compresslevel=9)
        except ValueError as err:
            had_err = perr(err, bmode, fmode)
        wt = csv.writer(fh)
        try:
            wt.writerow(row)
        except TypeError as err:
            had_err = perr(err, bmode, fmode)
        try:
            bh.close()
        except TypeError as err:
            had_err = perr(err, bmode, fmode)
        if not had_err:
            prnt(f"WAS OK: bmode={bmode}, fmode={fmode}")
for bmode in ["w", "wb", "wt"]:
    if os.path.exists(tfile):
        os.remove(tfile)
    bh = bz2.open(fh, mode=bmode, compresslevel=9)
    wt = csv.writer(fh)
    had_err = False
    try:
        wt.writerow(row)
    except TypeError as err:
        had_err = perr(err, bmode)
    try:
        bh.close()
    except TypeError as err:
        had_err = perr(err, bmode)
    if not had_err:
        prnt(f"WAS OK: bmode={bmode}")
if os.path.exists(tfile):
    os.remove(tfile)

Output:

> python3 /tmp/test.py
EXCEPTION: wt.writerow(row) → TypeError: write() argument must be str, not bytes; bmode='w', fmode='w'
File "/usr/lib/python3.10/bz2.py", line 109, in close
    self._fp.write(self._compressor.flush())
EXCEPTION: wt.writerow(row) → TypeError: write() argument must be str, not bytes; bmode='wb', fmode='w'
File "/usr/lib/python3.10/bz2.py", line 109, in close
    self._fp.write(self._compressor.flush())
EXCEPTION: wt.writerow(row) → TypeError: write() argument must be str, not bytes; bmode='w', fmode='wt'
File "/usr/lib/python3.10/bz2.py", line 109, in close
    self._fp.write(self._compressor.flush())
EXCEPTION: wt.writerow(row) → TypeError: write() argument must be str, not bytes; bmode='wb', fmode='wt'
File "/usr/lib/python3.10/bz2.py", line 109, in close
    self._fp.write(self._compressor.flush())
EXCEPTION: wt.writerow(row) → TypeError: a bytes-like object is required, not 'str'; bmode='w', fmode='wb'
File "/tmp/test.py", line 33, in <module>
    wt.writerow(row)
EXCEPTION: wt.writerow(row) → TypeError: a bytes-like object is required, not 'str'; bmode='wb', fmode='wb'
File "/tmp/test.py", line 33, in <module>
    wt.writerow(row)
EXCEPTION: wt.writerow(row) → TypeError: a bytes-like object is required, not 'str'; bmode='w', fmode='None'
File "/tmp/test.py", line 49, in <module>
    wt.writerow(row)
EXCEPTION: wt.writerow(row) → TypeError: a bytes-like object is required, not 'str'; bmode='wb', fmode='None'
File "/tmp/test.py", line 49, in <module>
    wt.writerow(row)
EXCEPTION: wt.writerow(row) → TypeError: a bytes-like object is required, not 'str'; bmode='wt', fmode='None'
File "/tmp/test.py", line 49, in <module>
    wt.writerow(row)

Note: bmode='wt' is not tested in the first loop since bz2.BZ2File(fh, mode='wt') will always raise a ValueError: Invalid mode: 'wt' exception.

Question

How can I write a compressed CSV with proper excaping and encoding on the fly?


Solution

  • Found the answer. You need multiple wrappings. Since I needed to use a special open function, I have to have the fh = open(tfile, mode='wb') line. If you don't need that ability, you can skip it and just use bz2.BZ2File(tfile, mode='wb', compresslevel=9) directly.

    In essence, because you have to open the underlying file handle or bz2.BZ2File() in binary format, you then have to wrap it in an io.TextIOWrapper() object.

    import os
    import bz2
    import csv
    import io
    
    tfile = '/tmp/test.csv.bz'
    row = ['bc22jtr', 118324, None, 'contran', None, 11.5, 9.23, ]
    
    if os.path.exists(tfile):
        os.remove(tfile)
    fh = open(tfile, mode='wb')
    bz = bz2.BZ2File(fh, mode='wb', compresslevel=9)
    it = io.TextIOWrapper(bz)
    wt = csv.writer(it)
    wt.writerow(row)