I'm seeking advice on best practices regarding the implementation of partitioned tables in a data warehouse when using PostgreSQL. My current understanding is that partitioning a fact table by a timestamp column time
makes sense since analytic queries often filter by time intervals. Semantically, it would be correct to have event_id as the primary key. However, PostgreSQL requires including the partitioning column in the primary key or unique constraints, leading to an error:
ERROR: unique constraint on partitioned table must include all partitioning columns
DETAIL: PRIMARY KEY constraint on table "earthquake_events" lacks column "time" which is part of the partition key.
If I opt to create a composite primary key including event_id
and time
, I face a new issue where dimension tables cannot reference just the event_id
in the fact table due to the lack of a unique constraint that matches the given keys:
ERROR: there is no unique constraint matching given keys for referenced table "earthquake_events"
This constraint would force me to create composite foreign keys that include both event_id
and time
in numerous dimension tables, which seems semantically incorrect and overly complex.
What is the recommended approach to designing primary keys and referencing foreign keys in partitioned fact tables in a DWH using PostgreSQL? Is there a better way to maintain the semantic integrity of the primary key while also adhering to PostgreSQL's partitioning requirements?
Thank you for your insights.
The correct way to proceed in a data warehouse is not to create foreign keys. The data originate in a transactional database that guarantees referential integrity, and the ETL process shouldn't mess that up. Foreign keys not only don't work well with partitioned tables, they will also slow down data loading somewhat.