postgresqlpsqlcitus

Sharding by default shardId in Citus


i have Citus that i run from docker-compose, i took yaml from official site and ran it initially from 2 nodes

 docker-compose -p citus up --scale worker=2 -d

Afterwards i add the table

CREATE TABLE IF NOT EXISTS messages (
                        text TEXT NOT NULL,
                       "from" CHARACTER VARYING NOT NULL,
                       "to" CHARACTER VARYING NOT NULL,
                       "shardId" CHARACTER VARYING NOT NULL
                  );

I call then for a SELECT create_distributed_table('messages', 'shardId'); shardId is a hash value from 'from' and 'to' variables that in range from 0 to 31, so far everything is ok i see that each chat has its own shard.

But then when i tried to reshard my cluster

set POSTGRES_PASSWORD=pass && docker-compose -p citus up --scale worker=5 -d

and rebalance data through nodes

SELECT citus_rebalance_start();

I got the error that i cannot rebalance the table without a pk, so i added a pk(id) and then i got error that i cannot put create_distributed_table a value that its not being a pk or under unique constraint

  SELECT create_distributed_table('messages', 'shardId');

ERROR: cannot create constraint on "messages" DETAIL: Distributed relations cannot have UNIQUE, EXCLUDE, or PRIMARY KEY constraints that do not include the partition column (with an equality operator if EXCLUDE).

So i in dead end, what can i do so i could manipulate the key in create_distributed_table method?


Solution

  • Assuming you did something like this:
    demo at db<>fiddle

    alter table messages 
      add column id int generated always as identity
     ,add constraint pk primary key(id);
    

    Your situation resembles that of a partitioning attempt - it has the same requirement

    create table messages_partitioned(like messages including all)
      partition by list("shardId");
    

    It even words that requirement similarly:

    ERROR:  unique constraint on partitioned table must include all partitioning columns
    DETAIL:  PRIMARY KEY constraint on table "messages_partitioned" lacks column "shardId" which is part of the partition key.
    

    Since it looks like you only added id to satisfy citus rebalancing requirements, you can just swap out the primary key on it for a composite primary key - give it two fields and make sure shardId is one of them:

    alter table messages 
      drop constraint pk
     ,add constraint pk primary key(id,"shardId");
    

    Now both sharding and partitioning works fine. Here's more on that, from Chapter 5. Data Definition: 5.12. Table Partitioning, 5.12.2.3. Limitations:

    • To create a unique or primary key constraint on a partitioned table, the partition keys must not include any expressions or function calls and the constraint's columns must include all of the partition key columns. This limitation exists because the individual indexes making up the constraint can only directly enforce uniqueness within their own partitions; therefore, the partition structure itself must guarantee that there are not duplicates in different partitions.

    • Similarly an exclusion constraint must include all the partition key columns. Furthermore the constraint must compare those columns for equality (not e.g. &&). Again, this limitation stems from not being able to enforce cross-partition restrictions. The constraint may include additional columns that aren't part of the partition key, and it may compare those with any operators you like.

    There's also an entry in Citus' FAQ:

    Citus is able to enforce a primary key or uniqueness constraint only when the constrained columns contain the distribution column. In particular this means that if a single column constitutes the primary key then it has to be the distribution column as well.

    This restriction allows Citus to localize a uniqueness check to a single shard and let PostgreSQL on the worker node do the check efficiently.

    Citus also mentions the composite PK above in their doc on cluster management:

    Add a primary key to the table. If the desired key happens to be the distribution column, then it’s quite easy, just add the constraint. Otherwise, a primary key with a non-distribution column must be composite and contain the distribution column too.