amazon-redshiftamazon-redshift-spectrum

Items in the FROM clause of aggregation subqueries have to refer to nested tables of higher level FROM clauses


I have data being written into AWS S3 through kafka and hence the message can be written more than once. I'm using Spectrum Redshift to query it.

Since the data is nested, I've faced some limitations explained in the documentation, but I can't seem to work a way to deduplicate it.

I've first attempted the following to get rid of the duplicates using group by and then doing my usual aggregation

with rid_of_duplicates as (
select table.field_1, table.struct_1.field_2, table.struct_1.field_3
from table
group by 1,2,3 
)

select field_1 || field_2, count(field_3)
from rid_of_duplicates
group by field_1, field_2

But I get the following error

ERROR: Spectrum nested query error Detail: ----------------------------------------------- error: Spectrum nested query error code: 8001 context: Items in the FROM clause of aggregation subqueries have to refer to nested tables of higher level FROM clauses. query: 0 location: nested_query_rewriter.cpp:814 process: padbmaster [pid=6630] ----------

If I get rid of the GROUP BY in the subquery, the query works fine, but its output is wrong.

Reading the docs, it seems that I wasn't supposed to select specific fields in the subquery. But as I'm using nested data, I can't use SELECT *.

I've also attempted creating a view with no schema binding, but that failed as well. SELECT Distinct doesn't work as well

If possible, I'd not like to create a table for that, since the data is being stored in S3. Also, count(distinct field_3) doesn't work due to business logic


Solution

  • The solution I found for this is to set json_serialization to TRUE for the sessions, this way Redshift sets all data to behave as json and then you can work on it as if it were a regular table. Docs

    SET json_serialization_enable TO true;
    

    In my case, my data was a pure struct and not an array of structs (what would enable the join mentioned in the use cases for neted data)