sqldatabricks

Databricks SQL string_agg


Hopefully a quick one;

Migrating some on-premise SQL views to Databricks and struggling to find conversions for some functions. The main one is the string_agg function.

string_agg(field_name, ', ')

Anyone know how to convert that to Databricks SQL?

Thanks in advance.


Solution

  • The rough equivalent would be using collect_set and array_join but note you have lost the order:

    %sql
    SELECT col1, array_join(collect_set(col2), ',') j
    FROM tmp
    GROUP BY col1
    

    I do not think STRING_AGG guarantees order (unless you specify the WITHIN GROUP...ORDER BY clause) but you should expect the order not to match. Hopefully the order does not matter to your process but you should double-check it does not have any implications for your process. As per the official documentation:

    [collect_list] is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle.

    They have recently added the ordinal argument to STRING_AGG to Azure SQL DB, Managed Instance and Synapse, but presumably you don't yet have that feature on-premises anyway.