I have a logfile from a service which was never rotated. Now I want to split this logfile into separate files, one for each month.
Most lines start with the unix-timestamp enclosed in brackets, however there are log-messages spanning multiple lines (output from dig
) which needs to be grabbed to.
Additionally the next line with a timestamp after a multi-line message is not necessarily from the same month. Like in the example below.
1700653509 = Wed 22 Nov 12:45:09 CET 2023
1700798246 = Fri 24 Nov 04:57:26 CET 2023
1701385200 = Fri 1 Dec 00:00:00 CET 2023
[1700653509] unbound[499:0] debug: module config: "subnetcache validator iterator"
[1700798246] unbound[1506:0] info: incoming scrubbed packet: ;; ->>HEADER<<- opcode: QUERY, rcode: NOERROR, id: 0
;; flags: qr aa ; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
chat.cdn.whatsapp.net. IN A
;; ANSWER SECTION:
chat.cdn.whatsapp.net. 60 IN A 157.240.252.61
;; AUTHORITY SECTION:
;; ADDITIONAL SECTION:
;; MSG SIZE rcvd: 55
[1701385200] unbound[1506:0] debug: iter_handle processing q with state QUERY RESPONSE STATE
My first approach was to define minimum and maximum values (1st and last second of a month) and check if the timestamp in the line matches that range. If yes, write it to the new logfile and move on. I need this approach as not every first or last second of a month is present in the logfile.
Like this:
for YEAR in {2023..2024}; do
for MONTH in {1..12}; do
# Calculate first and last second of each month
FIRST_SECOND="$(date -d "$(date +"$YEAR"/"$MONTH"/01)" "+%s")"
LAST_SECOND="$(date -d "$(date +"$YEAR"/"$MONTH"/01) + 1 month - 1 second" "+%s")"`
awk -F'[\\[\\]]' -v MIN="${FIRST_SECOND}" -v MAX="${LAST_SECOND}" '{if($2 >= MIN && $2 <= MAX) print}' unbound.log >> "unbound-$YEAR-$MONTH.log
done;
done
Then I encountered the multi-line messages and hit a roadblock.
Basically what I need now is some kind of "Grab all matching and non-matching lines until you hit the first value bigger then MAX." I thought of getting the first and last matching line number and simply use those. But then again I have the same problem with the multi-line messages.
Any ideas?
EDIT: Based on the accepted answer I ended up with this. I changed the filename to unbound-YYYY-MM instead of MM-YYYY and also gzip each file after it has been closed.
awk '
$1 ~ /^\[[0-9]+]$/ {
f = "unbound-" strftime("%Y-%m", substr($1, 2, length($1)-2)) ".log"
if (f != prev) {
if (prev) system("gzip " prev)
close(prev)
prev = f
}
}
{
print > f
}
END {
if (prev) system("gzip " prev)
}' unbound.log
With GNU awk
(for strftime
):
awk '
$1 ~ /^\[[0-9]+]$/ {
f = "unbound-" strftime("%Y-%m", substr($1, 2, length($1)-2)) ".log"
if (f != prev) close(f); prev = f
}
{
print > f
}' unbound.log
For each line which first field is a [timestamp]
(that is, matches regexp ^\[[0-9]+]$
), we use substr
and length
to extract timestamp
, strftime
to convert it to a YYYY-mm
string and assign "unbound-YYYY-mm.log"
to variable f
. In the second block, that applies to all lines, we print the current line in file f
. Note: contrary to shell redirections, in awk
, print > FILE
appends to FILE
.
Edit: as suggested by Ed Morton closing each file when we are done with it should significantly improve the performance if the total number of files is large. if (f != prev) close(f); prev = f
added. Ed also noted that escaping the final ]
in the regex is useless (and undefined behavior per POSIX). Backslash removed.