linuxbashawkunix-text-processing

awk: split logfile, starting from 1st line matching minimum value up until last line before maximum value is exceeded


I have a logfile from a service which was never rotated. Now I want to split this logfile into separate files, one for each month. Most lines start with the unix-timestamp enclosed in brackets, however there are log-messages spanning multiple lines (output from dig) which needs to be grabbed to. Additionally the next line with a timestamp after a multi-line message is not necessarily from the same month. Like in the example below.

1700653509 = Wed 22 Nov 12:45:09 CET 2023
1700798246 = Fri 24 Nov 04:57:26 CET 2023
1701385200 = Fri  1 Dec 00:00:00 CET 2023
[1700653509] unbound[499:0] debug: module config: "subnetcache validator iterator"
[1700798246] unbound[1506:0] info: incoming scrubbed packet: ;; ->>HEADER<<- opcode: QUERY, rcode: NOERROR, id: 0
;; flags: qr aa ; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
chat.cdn.whatsapp.net.  IN      A

;; ANSWER SECTION:
chat.cdn.whatsapp.net.  60      IN      A       157.240.252.61

;; AUTHORITY SECTION:

;; ADDITIONAL SECTION:
;; MSG SIZE  rcvd: 55

[1701385200] unbound[1506:0] debug: iter_handle processing q with state QUERY RESPONSE STATE

My first approach was to define minimum and maximum values (1st and last second of a month) and check if the timestamp in the line matches that range. If yes, write it to the new logfile and move on. I need this approach as not every first or last second of a month is present in the logfile.

Like this:

for YEAR in {2023..2024}; do
  for MONTH in {1..12}; do

# Calculate first and last second of each month
FIRST_SECOND="$(date -d "$(date +"$YEAR"/"$MONTH"/01)" "+%s")"
LAST_SECOND="$(date -d "$(date +"$YEAR"/"$MONTH"/01) + 1 month - 1 second" "+%s")"`

awk -F'[\\[\\]]' -v MIN="${FIRST_SECOND}" -v MAX="${LAST_SECOND}" '{if($2 >= MIN && $2 <= MAX) print}' unbound.log >> "unbound-$YEAR-$MONTH.log
  done;
done

Then I encountered the multi-line messages and hit a roadblock.

Basically what I need now is some kind of "Grab all matching and non-matching lines until you hit the first value bigger then MAX." I thought of getting the first and last matching line number and simply use those. But then again I have the same problem with the multi-line messages.

Any ideas?

EDIT: Based on the accepted answer I ended up with this. I changed the filename to unbound-YYYY-MM instead of MM-YYYY and also gzip each file after it has been closed.

awk '
$1 ~ /^\[[0-9]+]$/ {
  f = "unbound-" strftime("%Y-%m", substr($1, 2, length($1)-2)) ".log"
  if (f != prev) {
    if (prev) system("gzip " prev)
    close(prev)
    prev = f
  }
}
{
  print > f
}
END {
  if (prev) system("gzip " prev)
}' unbound.log

Solution

  • With GNU awk (for strftime):

    awk '
    $1 ~ /^\[[0-9]+]$/ {
      f = "unbound-" strftime("%Y-%m", substr($1, 2, length($1)-2)) ".log"
      if (f != prev) close(f); prev = f
    }
    {
      print > f
    }' unbound.log
    

    For each line which first field is a [timestamp] (that is, matches regexp ^\[[0-9]+]$), we use substr and length to extract timestamp, strftime to convert it to a YYYY-mm string and assign "unbound-YYYY-mm.log" to variable f. In the second block, that applies to all lines, we print the current line in file f. Note: contrary to shell redirections, in awk, print > FILE appends to FILE.

    Edit: as suggested by Ed Morton closing each file when we are done with it should significantly improve the performance if the total number of files is large. if (f != prev) close(f); prev = f added. Ed also noted that escaping the final ] in the regex is useless (and undefined behavior per POSIX). Backslash removed.