pythonsqlgoogle-bigquery

Python Google BigQuery Paramaterized SELECT


I am running into an issue with BigQuery parameterization. I am passing a start date and end date to the function, along with an array of potential fields present in the database. Start and end date formatted as "yyyymmdd".

The goal is to be able to pass a set of dates and a set of fields and collect the data relating to the array of fields between the two dates.

Date operation works as intended.

The fields array is passed like so: ["user_pseudo_id", "event_name", "event_timestamp"] as an example (other entries in array possible)

Effectively, I want to further parameterize the query to look similar to the below, where the @search_params replaces individual variables in the SELECT portion of the query. The intent is for the fields array to be more scalable, from a single entry to multiple.

From my searching, I believe ArrayQueryParameter (in place of ScalarQueryParameter) would solve the issue, but I haven't found much documentation for the usage.

query_job = client.query("""
    SELECT @search_params, _TABLE_SUFFIX AS suffix
    FROM `analytics_180354243.events_*` 
    WHERE REGEXP_EXTRACT(_TABLE_SUFFIX, r'[0-9]+')
    BETWEEN @start_date AND @end_date
    """, job_config=job_config)

FULL FUNCTION BELOW

def query_awe(start_date, end_date, fields):
    credentials = service_account.Credentials.from_service_account_file('auth.json')

    project_id = 'my-project-id'

    client = bigquery.Client(credentials=credentials, project=project_id)

    search_params = ""

    for i in fields:
        search_params += i + ", "
    search_params = search_params[:-2]

    query_params = [
        bigquery.ScalarQueryParameter('start_date', 'STRING', start_date),
        bigquery.ScalarQueryParameter('end_date', 'STRING', end_date),
        bigquery.ScalarQueryParameter('search_params', 'STRING', search_params),

    ]
    bigquery.ArrayQueryParameter

    job_config = bigquery.QueryJobConfig()
    job_config.use_legacy_sql = False
    job_config.query_parameters = query_params

    query_job = client.query("""
        SELECT user_pseudo_id, event_name, _TABLE_SUFFIX AS suffix
        FROM `analytics_180354243.events_*` #Each day saved as events_yyyymmdd
        WHERE REGEXP_EXTRACT(_TABLE_SUFFIX, r'[0-9]+')
        BETWEEN @start_date AND @end_date
        ORDER BY user_pseudo_id DESC
        """, job_config=job_config)

    results = query_job.result()  # Waits for job to complete.

    for row in results:
        print(row)

Solution

  • What about just using string formatting?

    def query_awe(start_date, end_date, fields):
        credentials = service_account.Credentials.from_service_account_file('auth.json')
    
        project_id = 'my-project-id'
    
        client = bigquery.Client(credentials=credentials, project=project_id)
    
        job_config = bigquery.QueryJobConfig()
        job_config.use_legacy_sql = False
    
        my_query = """
            SELECT {0}, _TABLE_SUFFIX AS suffix
            FROM `analytics_180354243.events_*` #Each day saved as events_yyyymmdd
            WHERE REGEXP_EXTRACT(_TABLE_SUFFIX, r'[0-9]+')
            BETWEEN {1} AND {2}
            ORDER BY user_pseudo_id DESC
            """
    
        my_query = my_query.format(', '.join(fields), start_date, end_date)
        query_job = client.query(my_query, job_config=job_config)
        results = query_job.result()  # Waits for job to complete.
    
        for row in results:
            print(row)