google-bigquery

BigQuery apply function over many columns in groupby


I come from an R background where the apply family of functions makes it incredibly easy to apply a function over multiple columns at once. In the updated example below, I have a table with multiple "pctile" columns, with numbers ranging from 1 - 100 in these columns:

WITH 
  wide_stats AS (
    SELECT 
      'joe' name, 'bills' team, false as is_o,
      7 as stat1, 95 as stat1_pctile, 
      9 as stat2, 82 as stat2_pctile, 
      3 as stat3, 67 as stat3_pctile
    UNION ALL
    SELECT 
      'joe', 'bills', true as is_o,
      1 as stat1, 37 as stat1_pctile,
      5 as stat2, 54 as stat2_pctile,
      2 as stat3, 17 as stat3_pctile
    UNION ALL
    SELECT 
      'tim' name, 'jets' team, false as is_o,
      17 as stat1, 65 as stat1_pctile,
      12 as stat2, 85 as stat2_pctile,
      13 as stat3, 69 as stat3_pctile
    UNION ALL
    SELECT
      'tim' name, 'jets' team, true as is_o,
      15 as stat1, 32 as stat1_pctile,
      14 as stat2, 56 as stat2_pctile,
      12 as stat3, 16 as stat3_pctile
  )
      
 SELECT * FROM wide_stats

To each column that ends in _pctile, I would like to run the function if(is_o, 100 - col_val, col_val), that simply (a) checks if the is_o column is true, and (b) if true, then do 100 - col_val on whatever value is already in the column, otherwise leave the column unchanged. I would like to achieve this without having to manually write this code out for each column ending in _pctile, as I have 100+ of these in my actual, very wide data.

Although this is BQ instead of R, this type of apply functionality would help me significantly with my task.


Solution

  • Unfortunately, you are paying the price for the non-optimal schema you've chosen to use.

    The only option I see here is to flatten your data such that each key(stat1, stat2, stat3, etc) is on separate row as in example below (BigQuery Standard SQL)

    #standardSQL
    SELECT * REPLACE(IF(is_o, 100 - value_pctile, value_pctile) AS value_pctile)
    FROM (
      SELECT name, team, is_o, 
        REPLACE(SPLIT(kv, '":')[OFFSET(0)], '_pctile', '') key, 
        CAST(REGEXP_EXTRACT(TO_JSON_STRING(t), CONCAT(r'"', REPLACE(SPLIT(kv, '":')[OFFSET(0)], '_pctile', ''), r'":(\d+),')) AS INT64) value,
        SPLIT(kv, '":')[OFFSET(0)] key_pctile, 
        CAST(SPLIT(kv, '":')[OFFSET(1)] AS INT64) value_pctile
      FROM wide_stats t, UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r'"(\w+_pctile":\d+),')) kv
    )
    

    if to apply to sample data from your question - result is

    Row name    team    is_o    key     value   key_pctile      value_pctile     
    1   joe     bills   false   stat1   7       stat1_pctile    95   
    2   joe     bills   false   stat2   9       stat2_pctile    82   
    3   joe     bills   true    stat1   1       stat1_pctile    63   
    4   joe     bills   true    stat2   5       stat2_pctile    46   
    5   tim     jets    false   stat1   17      stat1_pctile    65   
    6   tim     jets    false   stat2   12      stat2_pctile    85   
    7   tim     jets    true    stat1   15      stat1_pctile    68   
    8   tim     jets    true    stat2   14      stat2_pctile    44  
    

    As you can see, if you would have your table flattened to start with - the query would be as simple as

    #standardSQL
    SELECT * REPLACE(IF(is_o, 100 - value_pctile, value_pctile) AS value_pctile)
    FROM flattened_table  
    

    Note: for all practical purposes - you anyway will need to flatten whatever result you need otherwise you constantly be facing same very issue like this one!