python-2.7google-bigquerygoogle-cloud-platformgoogle-python-api

Byte field truncated in bigquery python API


I am trying to query a table in BigQuery with python with the sample code given in the documentation:

query_job = bq_client.run_async_query(str(uuid.uuid4()), sql_query)
query_job.use_legacy_sql = False
query_job.begin()
query_job.result()
destination_table = query_job.destination
destination_table.reload()
query_result = destination_table.fetch_data()
for row in query_result:
    print row

I get results looking like:

('?S\x9e\xbe\x9b\xadB\xd8\x92**5\xcck\xee]', 15, 28, 28, datetime.date(2017, 9, 5))
('%SwP\xe6xMK\x99T\xa1\x7f\xbbk>\xff', 15, 45, 19, datetime.date(2017, 9, 21))
('\xd7\x99>\x05(\x94M\xd8\x92\x0e\xe77\x8b\xcc\x08\xf0', 15, 18, 18, datetime.date(2017, 9, 4))
('\x0f+\xe7\xe0\xba5Ei\xa3\xb0\xfd\xd8\x1a\xa2wy', 15, 16, 16, datetime.date(2017, 9, 21))
('\xa3C\x0b\xdb\xebqM\xa1\xb3f\r\x8f#\x85\x93<', 15, 20, 16, datetime.date(2017, 9, 21))
(':\x1cZ?\x13\xf1A\xf5\x8e\xba\xfeYL.|v', 15, 15, 15, datetime.date(2017, 9, 6))
('\x1bP\x88\xd7\x1a\xfbGC\xb5$\x10\x97gx<\xb2', 15, 15, 15, datetime.date(2017, 9, 19))
('(B\xbc\xb7\xe9\xc0D\x89\xb0\x82jfW;,\x1e', 15, 18, 14, datetime.date(2017, 9, 19))
('\xd8\xbaw\x88\x89<Oh\x81]v\xa8!-\x83\x7f', 15, 17, 13, datetime.date(2017, 9, 6))
("\x94\x1f'\xf1\xd1$C\x9b\xb4o\x81H\x17\xf4\xa5S", 10, 14, 12, datetime.date(2017, 9, 5))
('\x949\x17\xbf\x90\xd7L\x04\x98\xe9+5\x9d\x1a\xb4\xe4', 15, 12, 12, datetime.date(2017, 9, 21))

The first field is a byte type that's supposed to be a b64 uuid. The actual result in the table, the UI, with the java wrapper, and in command line (with the same query) is:

P1OevputQtiSKio1zGvuXQ==    15  28  28  2017-09-05
JVN3UOZ4TUuZVKF/u2s+/w==    15  45  19  2017-09-21
15k+BSiUTdiSDuc3i8wI8A==    15  18  18  2017-09-04
o0ML2+txTaGzZg2PI4WTPA==    15  20  16  2017-09-21
Dyvn4Lo1RWmjsP3YGqJ3eQ==    15  16  16  2017-09-21
G1CI1xr7R0O1JBCXZ3g8sg==    15  15  15  2017-09-19
OhxaPxPxQfWOuv5ZTC58dg==    15  15  15  2017-09-06
KEK8t+nARImwgmpmVzssHg==    15  18  14  2017-09-19
2Lp3iIk8T2iBXXaoIS2Dfw==    15  17  13  2017-09-06
lB8n8dEkQ5u0b4FIF/SlUw==    10  14  12  2017-09-05
FJZXBYCAR8mQQwEeuuKKhQ==    15  12  12  2017-09-19
APVxsNTHSrCSU0z6QWdXSw==    15  12  12  2017-09-05
lDkXv5DXTASY6Ss1nRq05A==    15  12  12  2017-09-21

I can't seem to have the full byte array in the python answer, whereas I get it with every other wrapper. Here's the query:

SELECT *
                FROM (SELECT c.rich_log.user_id,
                      w.topic_id, count(*) as event_count,
                      count(CASE WHEN c.header.timestamp > (UNIX_MILLIS(CURRENT_TIMESTAMP()) - 2506000000)
                                  then 1 else null end) as event_count_in_lookback_window,
                      date(TIMESTAMP_MILLIS(max(c.header.timestamp))) as last_event
                FROM `table1_*`  as c
                join `table2` as w on w.id = c.rich_log.website_id
                where w.topic_id in {}
                and c.header.timestamp > UNIX_MILLIS(CURRENT_TIMESTAMP()) - 3456000000
                group by c.rich_log.user_id, w.topic_id)
                WHERE event_count_in_lookback_window > 0
                ORDER BY event_count_in_lookback_window DESC
                LIMIT 100

Does anyone see why this is happening? my byte field seems to be truncated.

Thanks,

Anton


Solution

  • Nothing is truncated here. Instead, python is showing the raw binary data, while the other interfaces are base64 encoding the data for you. To show the result is equivalent in the two formats:

    $ printf '?S\x9e\xbe\x9b\xadB\xd8\x92**5\xcck\xee]' | base64
    P1OevputQtiSKio1zGvuXQ==
    

    This is happening as your UUID is being stored in a BYTES field. This is the most efficient and correct way to store such data, but does mean different systems will by default convert to strings in different way.

    In python, there is a base64 module that will covert to base64 string if you want.

    Python also has a uuid module that you can use to change the bytes directly into a UUID object:

    >>> import uuid
    >>> print uuid.UUID(bytes=b'?S\x9e\xbe\x9b\xadB\xd8\x92**5\xcck\xee]')
    3f539ebe-9bad-42d8-922a-2a35cc6bee5d