postgresqlcommon-table-expressiongraph-traversal

Postgres slow recursive CTE graph traversal due to visited edges


I have a postgres table with a 100K± records representing edges of graph (will grows even to 1M) defined as (I left the relevant parts only):

CREATE TABLE public.collection_lineages (
    id int8 NOT NULL GENERATED BY DEFAULT AS IDENTITY( INCREMENT BY 1 MINVALUE 1 MAXVALUE 9223372036854775807 START 1 CACHE 1 NO CYCLE),
    upstream_id int8 NOT NULL,
    downstream_id int8 NOT NULL,
);

worth mentioning:

After researching about postgres graph traversal I found postgres CYCLE DETECTION and I wrote this query:

WITH RECURSIVE
"recursive_table"("id",
"upstream_id",
"downstream_id",
"depth") AS (
SELECT
    "working_table"."id",
    "working_table"."upstream_id",
    "working_table"."downstream_id",
    1
FROM
    "collection_lineages" AS "working_table"
WHERE
    "working_table"."upstream_id" = $1
UNION ALL
SELECT
    "working_table"."id",
    "working_table"."upstream_id",
    "working_table"."downstream_id",
    "recursive_table"."depth" + 1
FROM
    "collection_lineages" AS "working_table",
    "recursive_table"
WHERE
    "working_table"."upstream_id" = "recursive_table"."downstream_id") CYCLE "id" SET "is_cycle" USING PATH

-- filter out cycles
SELECT
    "recursive_table"."id"
FROM
    "recursive_table"
WHERE
    NOT "recursive_table"."is_cycle";

when querying it with some collection_id as $1, CTE traverse the table but revisiting edges. simple example will be:

  B 
 / \
A   D-E-F...-Z
 \ /
  C

query with A will produce:

A->B, A->C, B->D, C->D, D->E, D->E...

notice path D->Z returned twice due to the nature of recursive CTEs. This isn't efficient in scale.

My question is: is there a better way query the graph using CTE avoiding visited edges?

I tried adding: WHERE "working_table"."id" NOT IN (SELECT id from recursive_table) but I'm getting: recursive reference to query "recursive_table" must not appear within a subquery.

My workaround is to SELECT DISTINCT "recursive_table"."id" but the issue is still there and query takes unreasonable amount of time.

I already tried writing a simple BFS script for that, maintaining visited_ids table, iterating layer after layer. It shows better performance in general.

I also looked at age extension but this works with Cypher and force me to create the graph via them and I'd like to use my existing table.


Solution

  • I ended up with the BFS script, it's way faster gave me 42K rows for 28 depth graph in just 2.4s. Sorry folks recursive CTEs produce repetitive records - it's just doesn't scale.