I have a program that during it's operation it writes to a text file. In this text file each line consists of 4 parts.
An example of what a log line would look like would be something like this:
127894 2020-07-30 22:04:30.234124 foobar caught an unknown exception
127895 2020-07-30 22:05:30.424134 foobar clearing the programs cache
127896 2020-07-30 22:06:30.424134 foobar recalibrating dankness
The logs are printed in chronological order and I would like to know how I to get the highest frequency of these logs. For example I wanted to know at what minute or second of the day the program has the highest congestion.
Ideally I'd like an output that could tell me for example, "The highest logging frequency is between 22:04:00 and 22:05:00 with 10 log lines printed in this timeframe".
Let's consider this test file:
$ cat file.log
127894 2020-07-30 22:04:30.234124 foobar caught an unknown exception
127895 2020-07-30 22:05:20.424134 foobar clearing the programs cache
127895 2020-07-30 22:05:30.424134 foobar clearing the programs cache
127895 2020-07-30 22:05:40.424134 foobar clearing the programs cache
127896 2020-07-30 22:06:30.424134 foobar recalibrating dankness
127896 2020-07-30 22:06:40.424134 foobar recalibrating dankness
To get the most congested minutes, ranked in order:
$ awk '{sub(/:[^:]*$/, "", $3); a[$2" "$3]++} END{for (d in a)print a[d], d}' file.log | sort -nr
3 2020-07-30 22:05
2 2020-07-30 22:06
1 2020-07-30 22:04
22:05 appeared three times in the log file and is, thus, the most congested, followed by 22:06.
To get only the top most congested minutes, add head
. For example:
$ awk '{sub(/:[^:]*$/, "", $3); a[$2" "$3]++} END{for (d in a)print a[d], d}' file.log | sort -nr | head -1
3 2020-07-30 22:05
Note that we select here based on the second and third fields. The presense of dates or times in the texts of log messages will not confuse this code.
sub(/:[^:]*$/, "", $3)
removes everything after minutes in the third field.
a[$2" "$3]++
counts the number of times that date and time (up to minutes) appeared.
After the whole file has been read, for (d in a)print a[d], d
prints out the count and date for every date observed.
sort -nr
sorts the output with the highest count at the top. (Alternatively, we could have awk do the sorting but sort -nr
is simple and portable.)
Instead of minutes resolution, we can get seconds resolution:
$ awk '{sub(/\.[^.]*$/, "", $3); a[$2" "$3]++} END{for (d in a)print a[d], d}' file.log | sort -nr
1 2020-07-30 22:06:40
1 2020-07-30 22:06:30
1 2020-07-30 22:05:40
1 2020-07-30 22:05:30
1 2020-07-30 22:05:20
1 2020-07-30 22:04:30