My application stores performance time series data in CrateDB, and in order to get the setup right, I have a few questions, since it's going to be around 72M records a day and it should scale properly :). My goal is to visualize the resulting data with Grafana and currently I have the following structure in mind:
CREATE TABLE metrics (
ts TIMESTAMP,
hostname STRING,
servicename STRING,
perfdata OBJECT(DYNAMIC)
)
// for example
{
"hostname": "localhost",
"servicename": "ping",
"timestamp": 1483699527,
"perfdata": {
"rta": {
"current": 0.5,
"unit": "ms",
"warn": 100,
"critical": 200
},
"pl": {
"current": 0,
"unit": "%",
"warn": 10,
"crit": 20
}
}
}
The important bits are the host-/servicename, the metric's name and values, and the timestamp. Which would also be the alternative schema:
CREATE TABLE metrics (
ts TIMESTAMP,
hostname STRING,
servicename STRING,
metric OBJECT(DYNAMIC) AS (
unit STRING,
name STRING,
value DOUBLE,
)
)
So which one would be the preferred way to store the data? Do I also need partitioning? My aggregations usually show the last 24h and rarely the last month...
Thanks!
In general, I would suggest going with the second table schema, since its simpler and it lets you capture the raw data (instead of something pre-assembled) and do the aggregation for what you actually need.
However, this question is quite tricky since it depends a lot on the actual requirements. So essentially the combined schema will make updates tricky (objects can only be replaced and not updated) and therefore requires the data to be sent together (and maybe collected at the same pace?).
On top of that, setting up partitioning may be useful. This can speed up queries and allows to change the number of shards for future partitions (and allows you to scale better). Common practice is to partition by month or week, in some cases also by day. But has to take care there to prevent the number of shards from exploding as shards also require system resources.
CrateDB works very well with Grafana, thanks to the datasource plugin :)