I have a postgres table with a 100K± records representing edges of graph (will grows even to 1M) defined as (I left the relevant parts only):
CREATE TABLE public.collection_lineages (
id int8 NOT NULL GENERATED BY DEFAULT AS IDENTITY( INCREMENT BY 1 MINVALUE 1 MAXVALUE 9223372036854775807 START 1 CACHE 1 NO CYCLE),
upstream_id int8 NOT NULL,
downstream_id int8 NOT NULL,
);
worth mentioning:
After researching about postgres graph traversal I found postgres CYCLE DETECTION and I wrote this query:
WITH RECURSIVE
"recursive_table"("id",
"upstream_id",
"downstream_id",
"depth") AS (
SELECT
"working_table"."id",
"working_table"."upstream_id",
"working_table"."downstream_id",
1
FROM
"collection_lineages" AS "working_table"
WHERE
"working_table"."upstream_id" = $1
UNION ALL
SELECT
"working_table"."id",
"working_table"."upstream_id",
"working_table"."downstream_id",
"recursive_table"."depth" + 1
FROM
"collection_lineages" AS "working_table",
"recursive_table"
WHERE
"working_table"."upstream_id" = "recursive_table"."downstream_id") CYCLE "id" SET "is_cycle" USING PATH
-- filter out cycles
SELECT
"recursive_table"."id"
FROM
"recursive_table"
WHERE
NOT "recursive_table"."is_cycle";
when querying it with some collection_id as $1
, CTE traverse the table but revisiting edges. simple example will be:
B
/ \
A D-E-F...-Z
\ /
C
query with A
will produce:
A->B, A->C, B->D, C->D, D->E, D->E...
notice path D->Z returned twice due to the nature of recursive CTEs. This isn't efficient in scale.
My question is: is there a better way query the graph using CTE avoiding visited edges?
I tried adding: WHERE "working_table"."id" NOT IN (SELECT id from recursive_table)
but I'm getting: recursive reference to query "recursive_table" must not appear within a subquery
.
My workaround is to SELECT DISTINCT "recursive_table"."id"
but the issue is still there and query takes unreasonable amount of time.
I already tried writing a simple BFS script for that, maintaining visited_ids
table, iterating layer after layer. It shows better performance in general.
I also looked at age extension but this works with Cypher and force me to create the graph via them and I'd like to use my existing table.
I ended up with the BFS script, it's way faster gave me 42K rows for 28 depth graph in just 2.4s. Sorry folks recursive CTEs produce repetitive records - it's just doesn't scale.