databasepartitioncrate

Set the right partitions for Crate Database


I am modelling for the Database CrateDB.
I have an avg. of 400 customers and the produce different amounts of time-series data every day. (Between 5K and 500K; avg. ~15K)

Later I should be able to query per customer_year_month and per customer_year_calendar_week. That means that I will only query for the intervals:

Now I'am asking myself how to partition this table?

I would partion per customer and year.
Does this make sense?
Or would it be better to partion by customer, year and month?


Solution

  • so the question of partitioning a table is quite complex and should consider a lot of things. Among others:

    Essentially, each partition also creates overhead by multiplying the shard count (a partition can be considered a "sub-table" based on a column value), which - if chosen improperly - can hinder performance a lot. So in your case 15k inserts a day is not too much, however the distribution of inserts might cause problems, a customer's partition that grows with 500k inserts a day will run into performance problems earlier than the 5k person. As a consequence I would use weekly partitioning only.

    create table "customer-logging" (
        customer_id long, 
        log string, 
        ts timestamp, 
        week as date_trunc('week', ts)
    ) partitioned by (week) into 8 shards
    

    Please only use 8 shards if you have an appropriate amount of CPU cores ;)

    Docs: date_trunc(), partitioned tables

    Ideally you try out a few different combinations and find what works best for you. Insights into shard sizes and locations are provided by our sys tables, so you can see if there's a particularly fat shard that overloads a node ;)

    Cheers, Claus