postgresqlcitus

measure tpc-ds benchmark on citus


I'm trying to do some measurement on citus (extension of postgres). for that mission I'm running tpc-ds query on citus. the citus that I'm using is a containers of master, workers and manager that taken from here: https://github.com/citusdata/docker I can add workers by adding their containers. so far so good but I'm having troubles by doing the measurement and need some answers:

  1. to use all worker I need to run select_distributed_table/select_reference _table. is that copy the all data to all workers (for example 1TB of data became 16 TB for 16 workers)?
  2. if I not using select_distributed_table but adding worker is there any benefit to that action?
  3. If I already run select_distributed_table and later added worker, do it get the data distributed or I need to run again select_distributed_table?

Solution

    1. to use all worker I need to run select_distributed_table/select_reference _table. is that copy the all data to all workers (for example 1TB of data became 16 TB for 16 workers)?

    Reference tables are copied across the whole cluster and distributed tables are sharded across the worker nodes.

    If you ran the following queries on a Citus cluster with 16 workers for tables with 16 GB of data

    SELECT create_reference_table('ref_table');
    SELECT create_distributed_table('dist_table','partition_column_name');
    

    Then you each worker node will have a total of ~1 GB of the data in dist_table and the whole 16 GB of ref_table.

    1. if I not using select_distributed_table but adding worker is there any benefit to that action?

    If you do not do a rebalance operation, or move shards to the new nodes manually, adding new nodes does not usually help you. The new nodes will contain the all the distributed objects in the cluster (users, functions, schemas, types, etc.) and copies of reference tables. The only queries that will hit these new worker nodes will be the ones that access only reference tables.

    1. If I already run select_distributed_table and later added worker, do it get the data distributed or I need to run again select_distributed_table?

    If you run SELECT create_distributed_table('events','id') than you will create shards on the current worker nodes. If you add some new nodes later on, you will not see any shards for events table unless you do a rebalancing.

    However, if you run SELECT create_reference_table('customers') than you will see the copies of all the data in customers in all the nodes in the cluster.