pythongoogle-bigquerygzip

python open() vs gzip.open() and file mode


Why does file mode differ when using open() versus gzip.open() from the official gzip module?

Python 2.7 on Linux.

Same thing happens when using GzipFile on already open filehandle.

I was thinking it's supposed to be transparent, so why do I see numeric modes and not rb / wb?

Test script

#!/usr/bin/env python
"""
Write one file to another, with optional gzip on both sides.

Usage:
    gzipcat.py <input file> <output file>

Examples:
    gzipcat.py /etc/passwd passwd.bak.gz
    gzipcat.py passwd.bak.gz passwd.bak
"""
import gzip
import sys

if len(sys.argv) < 3:
    sys.exit(__doc__)

ifn = sys.argv[1]
if ifn.endswith(".gz"):
    ifd = gzip.open(ifn, "rb")
else:
    ifd = open(ifn, "rb")

ofn = sys.argv[2]
if ofn.endswith(".gz"):
    ofd = gzip.open(ofn, "wb")
else:
    ofd = open(ofn, "wb")

ifm = getattr(ifd, "mode", None)
ofm = getattr(ofd, "mode", None)

print(f"input file mode: {ifm}, output file mode: {ofm}")

for ifl in ifd:
    ofd.write(ifl)

Test script output

$ python gzipcat.py /etc/passwd passwd.bak
input file mode: rb, output file mode: wb
$ python gzipcat.py /etc/passwd passwd.bak.gz
input file mode: rb, output file mode: 2
$ python gzipcat.py passwd.bak.gz passwd.txt
input file mode: 1, output file mode: wb
$ python gzipcat.py passwd.bak.gz passwd.txt.gz
input file mode: 1, output file mode: 2

Secondary question: Is there any good reason behind that, or is it just an omission / unhandled case in gzip module?

Background

My actual use case is with Google BigQuery loader which requires the mode to be rb before using it as data source. Traceback below. But I prepared minimum test case above, to make this question more readable.

# python -c 'import etl; etl.job001()'
Starting job001.
Processing table: reviews.
Extracting reviews, time range [2018-04-07 17:01:38.172129+00:00, 2018-04-07 18:09:50.763283)
Extracted 24 rows to reviews.tmp.gz in 2 s (8 rows/s).
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "etl.py", line 920, in wf_dimension_tables
    ts_end=ts_end)
  File "etl.py", line 680, in map_table_delta
    rewrite=True
  File "etl.py", line 624, in bq_load_csv
    job_config=job_config)
  File "/usr/lib/python2.7/site-packages/google/cloud/bigquery/client.py", line 797, in load_table_from_file
    _check_mode(file_obj)
  File "/usr/lib/python2.7/site-packages/google/cloud/bigquery/client.py", line 1419, in _check_mode
    "Cannot upload files opened in text mode:  use "
ValueError: Cannot upload files opened in text mode:  use open(filename, mode='rb') or open(filename, mode='r+b')

And here is the bigquery API call which uses the filehandle:

def bq_load_csv(dataset_id, table_id, fileobj):
    client = bigquery.Client()
    dataset_ref = client.dataset(dataset_id)
    table_ref = dataset_ref.table(table_id)
    job_config = bigquery.LoadJobConfig()
    job_config.source_format = 'text/csv'
    job_config.field_delimiter = ','
    job_config.skip_leading_rows = 0
    job_config.allow_quoted_newlines = True
    job_config.max_bad_records = 0
    job = client.load_table_from_file(
        fileobj,
        table_ref,
        job_config=job_config)
    res = job.result()  # Waits for job to complete
    return res

Update

This problem was fixed in python bigquery client 1.5.0. Thanks to @a-queue who filed a bug report, and thanks to Google devs who actually fixed it.


Solution

  • A proper way to deal with this is to raise an issue in both Python and Google Cloud Client Library for Python respective issue trackers.

    Workaround

    You could substitute _check_mode function from google.cloud.bigquery.client to accept 1 and 2, as I did below. I have tried running this code and it works:

    import gzip
    from google.cloud import bigquery
    
    def _check_mode(stream):
        mode = getattr(stream, 'mode', None)
    
        if mode is not None and mode not in ('rb', 'r+b', 'rb+', 1, 2):
            raise ValueError(
                "Cannot upload files opened in text mode:  use "
                "open(filename, mode='rb') or open(filename, mode='r+b')")
    
    
    bigquery.client._check_mode = _check_mode
    
    #...
    
    def bq_load_csv(dataset_id, table_id, fileobj):
        #...
    

    Explanation

    google-cloud-python

    The trace shows that the last to fail was function _check_mode from google/cloud/bigquery/client.py:

    if mode is not None and mode not in ('rb', 'r+b', 'rb+'):
        raise ValueError(
            "Cannot upload files opened in text mode:  use "
            "open(filename, mode='rb') or open(filename, mode='r+b')")
    

    gzip.py

    And in gzip library in the function __init__ of the class GzipFile you can see that the variable mode was passed to this function but NOT assigned to self.mode but is used to assign an interger:

    READ, WRITE = 1, 2 #line 18
    ...
    class GzipFile(_compression.BaseStream):
    ...
    def __init__(self, filename=None, mode=None,
        ...
        elif mode.startswith(('w', 'a', 'x')): #line 179
            self.mode = WRITE
    

    According to the blame line 18 was changed 21 years ago and line 180, self.mode = Write, 20 years ago.