[SOLVED] Archival strategy Apache Druid by having latest data on historical

Archival strategy Apache Druid by having latest data on historical

I want to configure my Apache Druid to keep only latest 3 months data on historical nodes wherein it will be queryable. Deep storage (s3) can continue to keep all data forever.

For data older than 3 months , I plan to have a separate low cost system which queries s3 to give the data.

I couldn't find any configuration where I can specify historical nodes to store latest data upto certain time period.

Please let me know if this is possible, if not what archival strategy can we have for production setup with Apache Druid ?

Solution

Configuration of prefetch to Historical services is done per-datasource using the Retention Configuration for that datasource.

https://druid.apache.org/docs/latest/operations/rule-configuration

Query data in deep storage directly without pre-fetch to Historicals using MSQ.

https://druid.apache.org/docs/latest/querying/query-deep-storage

For a run-through of this, check out the learn-druid Python notebooks on the topic.

https://github.com/implydata/learn-druid/blob/main/notebooks/03-query/14-sync-async-queries.ipynb