[SOLVED] Array Aggregation in Bigquery with Run Length Encoded Arrays

Array Aggregation in Bigquery with Run Length Encoded Arrays

I've got a BigQuery table that looks something like this:

--------------------------------------------------------
|  col1  | cols 2-10 |              col 11             |
--------------------------------------------------------
| event1 | metatdata | [0,0,0,0,0,0,0...1, 30, 50, 43] |
--------------------------------------------------------

The last column is an array of int64s that represent event data (the place in the array is important as well as any leading zeros). Normally ARRAY_AGG would work to aggregate the counts in that last column (which is the end goal of all this), but in my case the data is actually run length encoded (RLE), so it might look like this:

[-20, 1, 4, 5, 10, 20]
[-10, 2, 4, 10, 3, 8]

Where any negative number represents that number of zeroes.

So [-10, 1] is: [0,0,0,0,0,0,0,0,0,0,1]

I'm trying to think of a way to replace the negative numbers at query time with the correct length of array 0's so that I can then feed it into ARRAY_AGG.

An ideal solution would be able to replace any number of negative numbers in the array with the appropriate number of zeros at query time to allow for the aggregation to accurately compute. A less ideal solution would allow for a fixed number of replacements (or even 1).

So, ideal: [-4, 3, -2, 1, -3, 9] -> [0,0,0,0,3,0,0,1,0,0,0,9]

My attempts so far have used UNNEST and IF statements but I've not gotten very far.

Thanks for any input.

Solution

Consider below approach

select 
  format('%t', col11) col11, 
  format('%t', encoded_col11) encoded_col11, 
  array(select cast(el as int64) from t.encoded_col11 as el)
from (
  select col11, split((
    select string_agg(if(num < 0,repeat('0', -num), '' || num), '' order by offset)
    from t.col11 num with offset
  ), '') encoded_col11
  from your_table t
) t

if applied to sample data in your question - output is