My use case is the following : I have continuously produced time-series data + one year history. I want to index them into Elasticsearch in such a way that data is deleted after one year (according to the @timestamp field).
Data streams seem to be the perfect solution for the newly producted time-series data. They get indexed as soon as they are created, and the ILM will delete the associated backing indices at the right moment in one year.
However, I'm stuck with the historical datas. How to index them in such a way that the historical data will be deleted at the right time ? As the rollover is based on the index age and not the documents @timestamp fields, all associated backing indices will be also deleted in one year, even if they contains older data. In my use case, this typically means that the oldest historical data will remain two years in the cluster, which is not the expected behaviour.
Do you have any ideas to fix this ?
You have the possibility to override this behavior and provide your own index.lifecycle.origination_date
If specified, this is the timestamp used to calculate the index age for its phase transitions. Use this setting if you create a new index that contains old data and want to use the original creation date to calculate the index age. Specified as a Unix epoch value in milliseconds.
So you can index your old data into your data streams and for each backing index you can dynamically set the timestamp that should correspond to the date the index would have been created if that old historical data had been indexed back then.
PUT .ds-index-xxx/_settings
{
"index.lifecycle.origination_date": "2020-01-01"
}
You can find the max timestamp to use for each backing index using the following query:
POST index/_search
{
"size": 0,
"aggs": {
"index": {
"terms": {
"field": "_index"
},
"aggs": {
"date": {
"max": {
"field": "@timestamp"
}
}
}
}
}
}