linuxbashgrepzgrep

How to get frequency of logging using bash if each line contains a timestamp?


I have a program that during it's operation it writes to a text file. In this text file each line consists of 4 parts.

  1. Thread ID (a number)
  2. A date in the format yyyy-mm-dd
  3. A timestamp in the format 12:34:56.123456
  4. A function name
  5. Some useful comments printed out by the programs

An example of what a log line would look like would be something like this:

127894 2020-07-30 22:04:30.234124 foobar caught an unknown exception
127895 2020-07-30 22:05:30.424134 foobar clearing the programs cache
127896 2020-07-30 22:06:30.424134 foobar recalibrating dankness

The logs are printed in chronological order and I would like to know how I to get the highest frequency of these logs. For example I wanted to know at what minute or second of the day the program has the highest congestion.

Ideally I'd like an output that could tell me for example, "The highest logging frequency is between 22:04:00 and 22:05:00 with 10 log lines printed in this timeframe".


Solution

  • Let's consider this test file:

    $ cat file.log 
    127894 2020-07-30 22:04:30.234124 foobar caught an unknown exception
    127895 2020-07-30 22:05:20.424134 foobar clearing the programs cache
    127895 2020-07-30 22:05:30.424134 foobar clearing the programs cache
    127895 2020-07-30 22:05:40.424134 foobar clearing the programs cache
    127896 2020-07-30 22:06:30.424134 foobar recalibrating dankness
    127896 2020-07-30 22:06:40.424134 foobar recalibrating dankness
    

    To get the most congested minutes, ranked in order:

    $ awk '{sub(/:[^:]*$/, "", $3); a[$2" "$3]++} END{for (d in a)print a[d], d}' file.log | sort -nr
    3 2020-07-30 22:05
    2 2020-07-30 22:06
    1 2020-07-30 22:04
    

    22:05 appeared three times in the log file and is, thus, the most congested, followed by 22:06.

    To get only the top most congested minutes, add head. For example:

    $ awk '{sub(/:[^:]*$/, "", $3); a[$2" "$3]++} END{for (d in a)print a[d], d}' file.log | sort -nr | head -1
    3 2020-07-30 22:05
    

    Note that we select here based on the second and third fields. The presense of dates or times in the texts of log messages will not confuse this code.

    How it works

    sub(/:[^:]*$/, "", $3) removes everything after minutes in the third field.

    a[$2" "$3]++ counts the number of times that date and time (up to minutes) appeared.

    After the whole file has been read, for (d in a)print a[d], d prints out the count and date for every date observed.

    sort -nr sorts the output with the highest count at the top. (Alternatively, we could have awk do the sorting but sort -nr is simple and portable.)

    To sort down to the second

    Instead of minutes resolution, we can get seconds resolution:

    $ awk '{sub(/\.[^.]*$/, "", $3); a[$2" "$3]++} END{for (d in a)print a[d], d}' file.log | sort -nr
    1 2020-07-30 22:06:40
    1 2020-07-30 22:06:30
    1 2020-07-30 22:05:40
    1 2020-07-30 22:05:30
    1 2020-07-30 22:05:20
    1 2020-07-30 22:04:30