sqlarrayspostgresqlsortingaggregate-functions

Create two arrays for two fields, keeping sort order of arrays in sync (without subquery)


There is no rhyme or reason for this question other than I was curious about how one would go about doing this.

Platform: while I was hoping for a SQL-Standard solution, my main concentration is with PostgreSQL 8.4+. (I know 9.0+ has some array sorting functions.)

SELECT    id, group, dt
FROM      foo
ORDER BY  id;
  id   | group |    dt
-------+-------+-----------
   1   |  foo  | 2012-01-01
   1   |  bar  | 2012-01-03
   1   |  baz  | 2012-01-02
   2   |  foo  | 2012-01-01
   3   |  bar  | 2012-01-01
   4   |  bar  | 2012-01-01
   4   |  baz  | 2012-01-01

I know the following query is wrong, but the result is similar to what I'm after; a way to tie the two fields (sorting of group should also sort dt):

SELECT    id, sort_array(array_agg(group)), array_agg(dt)
FROM      foo
GROUP BY  id;
  id   |     group      |                dt
-------+----------------+------------------------------------
   1   |  {bar,baz,foo} | {2012-01-03,2012-01-02,2012-01-01}
   2   |  {foo}         | {2012-01-01}
   3   |  {bar}         | {2012-01-01}
   4   |  {bar,baz}     | {2012-01-01,2012-01-01}

Is there an easy way to tie the fields for sorting, w/o using a subquery? Perhaps build an array of arrays and then unnest?


Solution

  • I understand your question like this:

    Get the two arrays sorted in identical sort order so that the same element position corresponds to the same row in both arrays.

    Use a subquery or CTE and order the rows before you aggregate.

    SELECT id, array_agg(grp) AS grp, array_agg(dt) AS dt
    FROM  (
        SELECT *
        FROM   tbl
        ORDER  BY id, grp, dt
        ) x
    GROUP  BY id;
    

    Typically faster than a nested ORDER BY in the aggregate function.

    I changed your column name group to grp because group is a reserved word in Postgres and every SQL standard and shouldn't be used as identifier.

    Is ORDER BY in a subquery safe?

    The manual:

    The aggregate functions array_agg, json_agg, [...] as well as similar user-defined aggregate functions, produce meaningfully different result values depending on the order of the input values. This ordering is unspecified by default, but can be controlled by writing an ORDER BY clause within the aggregate call, as shown in Section 4.2.7. Alternatively, supplying the input values from a sorted subquery will usually work. For example:

    SELECT xmlagg(x) FROM (SELECT x FROM test ORDER BY y DESC) AS tab;
    

    Beware that this approach can fail if the outer query level contains additional processing, such as a join, because that might cause the subquery's output to be reordered before the aggregate is computed.

    So yes, it's safe in the example.

    Without subquery

    If you really need a solution without subquery, you can:

    SELECT id
         , array_agg(grp ORDER BY grp)
         , array_agg(dt  ORDER BY grp, dt)
    FROM   tbl
    GROUP  BY id;
    

    I sort by dt in addition to break ties and make the sort order unambiguous. Not necessary for grp, though.

    There is also a completely different way with window functions:

    SELECT DISTINCT ON (id)
           id
         , array_agg(grp) OVER w AS grp
         , array_agg(dt)  OVER w AS dt
    FROM   tbl
    WINDOW w AS (PARTITION BY id ORDER BY grp, dt
                 ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
    ORDER  BY id;
    

    Note DISTINCT ON (id) instead of just DISTINCT which produces the same result but performs faster by an order of magnitude because we do not need an extra sort.

    I ran some tests and this is almost as fast as the other two solutions. As expected, the subquery version was still fastest. Test with EXPLAIN ANALYZE to see for yourself.