datadog

How to apply a new pipeline to already ingested logs


I have created a Datadog pipeline with a single grok parser. The purpose of the pipeline is to parse logs which are not logged in JSON format.

Rule1 %{date("yyyy-MM-dd'T'HH:mm:ss.SSSSSSZZ"):date}\s+%{notSpace:token_1}\s+tomcataccesslogs\s+%{ipv4:network.client.ip}\s+-\s+-\s+[%{date("dd/MMM/yyyy:HH:mm:ss Z"):date_1}]\s+"%{word:http.method}\s+%{notSpace:http.url}\s+HTTP%{notSpace:http.url_1}"\s+%{integer:http.status_code}\s+%{integer:token_2}\s+ms\s+-

The logs created after I create the pipeline get parsed correctly.

However, the logs created before the pipeline does not get parsed.

How do I get the pipeline to parse the historical logs?


Solution

  • Not possible

    The short answer is: you can't.

    Log Pipelines run on ingest, logs that have already been ingested are read-only and have no means to go through this process again. See the below summary taken from the log management overview:

    log ingestion overview

    Log pipelines are part of the 'parse and enrich' step, logs you can query via the UI are read-only and part of an index.

    But you can Rehydrate

    If you absolutely need to do this, there is a way: Rehydrating logs.

    Assuming these logs have been archived, those logs can be rehydrated from archives - i.e. they can be ingested again.

    This won't do anything to the already-indexed logs, i.e. be aware that this will result in both the original and re-hydrated logs existing at the same time, until they hit their independent retention limit and age out.

    Fair warning though: Rehydrating is not free and to those unfamiliar with the process it's a really easy way to burn money. The cost to rehydrate is primarily the cost to scan the archive, not the number of logs you want to index, or how long you want to index them for. At the time of writing the cost to rehydrate is 10c per GB - so scan 1TB of archives and whether you find logs or not, that'll cost you $100 :).