awkunix-text-processing

How to extract text from access log?


I am very new in this. I am trying to extract some text from my access log in a new file.
My log file is like this:

111.111.111.111 - - [02/Jul/2021:18:35:19 +0000] "GET /api/items HTTP/2.0" 304 0 "https://example.com/some/text/call-log?roomNo=5003" "Mozilla etc etc etc etc"
111.111.111.111 - - [02/Jul/2021:20:35:19 +0000] "GET /api/items HTTP/2.0" 304 0 "https://example.com/some/text/resevation-log?roomNo=4003" "Mozilla etc etc etc etc"

I want to extract in below format in a new file.

02/Jul/2021:18:35:19 +0000, call-log, 5003
02/Jul/2021:20:35:19 +0000, resevation-log, 4003

Till now I have managed to do this basic awk command:

awk '{print $4,$5,",",$11}' < /file.log

Which gives me the below output:

[02/Jul/2021:18:35:19 +0000] , "https://example.com/some/text/call-log?roomNo=5003"

Solution

  • $ cat tst.awk
    BEGIN {
        FS="[[:space:]]*[][\"][[:space:]]*"
        OFS = ", "
    }
    {
        n = split($6,f,"[/?=]")
        print $2, f[n-2], f[n]
    }
    

    $ awk -f tst.awk file
    02/Jul/2021:18:35:19 +0000, call-log, 5003
    02/Jul/2021:20:35:19 +0000, resevation-log, 4003
    

    The above uses the following way to split the input in your question into fields using any POSIX awk:

    $ cat tst.awk
    BEGIN {
        FS="[[:space:]]*[][\"][[:space:]]*"
        OFS = ","
    }
    {
        print
        for (i=1; i<=NF; i++) {
            print "\t" i, "<" $i ">"
        }
        print "-----"
    }
    

    $ awk -f tst.awk file
    111.111.111.111 - - [02/Jul/2021:18:35:19 +0000] "GET /api/items HTTP/2.0" 304 0 "https://example.com/some/text/call-log?roomNo=5003" "Mozilla etc etc etc etc"
            1,<111.111.111.111 - ->
            2,<02/Jul/2021:18:35:19 +0000>
            3,<>
            4,<GET /api/items HTTP/2.0>
            5,<304 0>
            6,<https://example.com/some/text/call-log?roomNo=5003>
            7,<>
            8,<Mozilla etc etc etc etc>
            9,<>
    -----
    111.111.111.111 - - [02/Jul/2021:20:35:19 +0000] "GET /api/items HTTP/2.0" 304 0 "https://example.com/some/text/resevation-log?roomNo=4003" "Mozilla etc etc etc etc"
            1,<111.111.111.111 - ->
            2,<02/Jul/2021:20:35:19 +0000>
            3,<>
            4,<GET /api/items HTTP/2.0>
            5,<304 0>
            6,<https://example.com/some/text/resevation-log?roomNo=4003>
            7,<>
            8,<Mozilla etc etc etc etc>
            9,<>
    -----
    

    That would fail if any of your quoted fields can contain [, ], or an escaped ", none of which exist in your example but if they can happen then include them in the example in your question.