I am attempting to flatten a list of values into a comma separated list in the last step of a SQL transformation in Workato. I have been advised that this is backed by Apache Datafusion. I've previously done a lot of work on assorted versions of MS SQL Server. In that world I know to use the stuff/XML Path combo to accomplish this goal. Currently I have this that "works":
key_value
,array_to_string(array_agg(DISTINCT FlatColumn1)
,array_to_string(array_agg(DISTINCT FlatColumn2)
,array_to_string(array_agg(DISTINCT FlatColumn3)
I then use a window function, Row_number(), to get rid of extra rows.
However I'm losing some values that aren't being pulled into the array in the first place and wind up on a second row and get thrown out of my final dataset. I don't see array_agg() in the documentation for datafusion
https://datafusion.apache.org/user-guide/sql/scalar_functions.html#array-functions
I don't see similar questions online, in blog posts, etc... I suspect without knowing that most people accomplish this with some sort of python scripting but that is not an option for me here due to external constraints. Is what I'm trying to accomplish here possible? If so how? If I need to put more info in here please let me know and I'm happy to add more as needed.
Got this sorted.
with CoreData as(
select distinct key_value, FlatColumn1, FlatColumn2, FlatColumn3
from dataSet1),
Flatset1 as(
select distinct key_value,
array_to_string(array_agg(DISTINCT FlatColumn1) as FlatColumn1,
FlatColumn2, FlatColumn3
from dataSet1
Group by FlatColumn2, FlatColumn3)
[repeat for as many columns as you want to flatten with a select statement at the end that merges the CTE blocks together into your final return set.]
What I don't know is why this is the case. My suspicion is that in DataFusion that the array_agg() does not gracefully handle the race conditions created when you have multiple instances of it acting on different parts of the same return set and that by breaking them apart and isolating them this is obviated.