pythonduckdb

For every identifier select only rows with largest order column


Consider the following table source

id: LONG
name: STRING
ornum: LONG

The goal is to create a new table target which for each distinct id in source selects the row with largest ornum.

The following works fine for reasonable number of distinct ids:

CREATE OR REPLACE TABLE target AS SELECT DISTINCT ON(id) * FROM test ORDER BY ornum DESC;

Is there a way to handle large number (e.g 1B) distinct values with limited memory (say 4GB)?


Solution

  • I'm not sure how your SQL solves the problem given it has no condition on ornum - you may just be getting lucky.

    In DuckDB, you can use the arg_max function just as @jqurious suggested. This will scale with the number of id values.

    The traditional SQL windowing version from @Barmar will have to materialise a sorted version of the entire data set and then pull out only one row. You can do this in DuckDB:

    SELECT columns(*)
    FROM test
    WINDOW w AS (PARTITION BY id ORDER BY ornum DESC)
    QUALIFY row_number() over w = 1
    

    but it will be a lot slower and consume a lot more RAM/Disk space.