I have a system basically like this:
CREATE TABLE entries (
id SERIAL PRIMARY KEY,
slug VARCHAR(50) NOT NULL
);
CREATE TABLE pronunciations (
id SERIAL PRIMARY KEY,
entry_id INTEGER NOT NULL,
position INTEGER NOT NULL,
text VARCHAR(50) NOT NULL,
FOREIGN KEY (entry_id) REFERENCES entries (id) ON DELETE CASCADE
);
Entries can have many pronunciations, and they are ordered per entry by position
.
How do I select 100 pronunciations (for 100 entries), even if there are say 1-5 pronunciations per entry? I want to basically get 1 pronunciation per entry in the final result, and only want the result to contain pronunciations. I want to pick the one with the lowest position, presumably position == 0
. I will then (at the application layer), assign them to the entry record (which has other stuff I left out which is beside the point).
I found examples that essentially say to do something with WITH
, along these lines:
WITH ranked_pronunciations AS (
SELECT
p.id,
p.entry_id,
p.position,
p.text,
ROW_NUMBER() OVER (PARTITION BY p.entry_id ORDER BY p.position) AS rn
FROM
pronunciations p
)
SELECT
rp.id,
rp.entry_id,
rp.position,
rp.text
FROM
ranked_pronunciations rp
WHERE
rp.rn = 1
ORDER BY
rp.entry_id
LIMIT 100;
That seems awfully complex. Is there a way to do this in a simple way? I guess I could fetch all pronunciations and just filter out the first ones at the application layer, but that is a ways of data transfer.
I am using Kysely, a PostgreSQL Node.js query builder, so I'm not quite sure how to handle such a complex WITH
statement in that, and wondering (a) if that's the required approach (in which case I'll figure out the Kysely syntax), or (b) if there's a simpler approach.
You are right, there is a much simpler way with DISTINCT ON
:
SELECT DISTINCT ON (entry_id) *
FROM pronunciations
ORDER BY entry_id, position -- !
LIMIT 100;
Typically faster, too, for your data distribution ("1-5 pronunciations per entry"). See:
Like in your original query, the table entries
is not needed. But this:
I will then (at the application layer), assign them to the entry record [...]
... sounds like it would be a lot faster integrated in a single query.