postgresqlprimary-keyuuidbigintcitus

Primary Keys (UUID, Sequence) strategy on Citus


What is the best approach for primary keys on Citus?

UUID: No lock required, as opposed to the Identity/Serial. But expensive in storage and eventually on queries + causes fragmentations.

Sequence - Identity Causes a bottleneck while creating an entity. Less expensive in storage and queries will be faster + no fragmentations.

If we want to be scale-ready project, will it better to work with UUID?

According to this post: https://www.cybertec-postgresql.com/en/uuid-serial-or-identity-columns-for-postgresql-auto-generated-primary-keys/

For shards it is recommended to work with UUID eventually.

How well it perform on Citus?

I'll give a schema example:

User
UserId uuid/bigint?

Device
Device Id uuid/bigint?
UserId (here for the distribution key)

In the above example, we want to distribute the user data according to the UserId, for example his Devices. What should be the primary key ids types? If UUID is the answer, should we be afraid of fragmentations in the nodes?


Solution

  • Disclaimer: I am the Software Engineer working on Citus engine that opened a PR for supporting UDFs in column defaults.

    In the post you shared in the question gen_random_uuid() UDF is used as a column default. This is not yet supported on Citus.

    I suggest you use bigserial.

    I also want to note that some of the statements in the blog are not correct for Citus. For example:

    So if you use sharding, where you distribute your data across several databases, you cannot use a sequence.

    In Citus, you can create distributed sequence objects where each worker node will use a disjoint subset of the allowed values of the sequence. The application sees a single sequence on the coordinator node, while the shards have their own sequences with different start values.

    (You could use sequences defined with an INCREMENT greater than 1 and different START values, but that might lead to problems when you add additional shards.)

    Citus shifts the start of a bigserial by node_group_id * 2^48 and it means that you are limited to a maximum of 2^18 shards that is practically unlimited. You will run into bigger issues when you have petabytes of data in a single table, or a quarter of a million shards, and this limitations here will not really affect your cluster.

    PS: My work on the UDF column defaults is on hold now, as some recent code changes will make the solution cleaner, and easier to maintain. We did not yet ship the changes in a released version. https://github.com/citusdata/citus/pull/5149