parsingawklogparser

Performant comparisons in awk?


I've got a python script that runs through some logs and figured it'd be instructive to do a few benchmarks against some other approaches before deploying this out. When looking at awk, I'm hoping to minimize overhead to get a 'fair' shake at beating the somewhat optimized python variant.

My log entries look like:

--------
SomeField=SomeValue
OptionallyAppearingField=WhoKnowsWhat
AnotherField=AnotherValue
ExtraStuff=OneBonusKey=1,SecondBonusKey=2,ThirdBonusKey=3,...
--------

And I'm keen to get the value of AnotherField when one of our ThirdBonusKeys exists and has a certain value (actually just the number 1).

The 'stupid' way here is to set our RS to '--------' and then just apply a regex to $0 twice, first to see if ThirdBonusKey=1 is in the record, and then to extract AnotherField=(desired_value).

But that seems like an unfair comparison, given it's just throwing a regex at the problem (twice!). Without a guaranteed ordering of fields to leverage awk's cool FS skills, is there a quicker or more appropriate approach here? It's possible that the answer is just "this is not a job for awk", and that's okay too, I guess.

Cyrus has kindly pointed out that the sketch of code I gave above is not technically code, and he's technically correct, so here's a reasonably stupid implementation:

awk 'BEGIN{RS="--------"} { if ($0 ~ /ThirdBonusKey=1/) { for(i=1;i<NF;i++) {if ($i ~ "AnotherField=") { print $i }}}}'

Given input

--------
SomeField=SomeValue
OptionallyAppearingField=WhoKnowsWhat
AnotherField=DesiredValue1
ExtraStuff=OneBonusKey=1,SecondBonusKey=2,ThirdBonusKey=1,...
--------
SomeField=SomeValue
OptionallyAppearingField=WhoKnowsWhat
AnotherField=DesiredValue2
ExtraStuff=OneBonusKey=1,SecondBonusKey=2,ThirdBonusKey=0,...
--------
SomeField=
ExtraStuff=
--------

we'd expect output

AnotherField=DesiredValue1

Solution

  • Most efficiently I expect:

    $ awk '/^AnotherField=/{val=$0; next} /[=,]ThirdBonusKey=1(,|$)/{print val}' file
    AnotherField=DesiredValue1
    

    but more robustly and easier to enhance to do anything else you want later:

    $ cat tst.awk
    BEGIN { FS="[,=[:space:]]"; OFS="=" }
    /^-+$/ {
        if ( f["ExtraStuff_ThirdBonusKey"] == 1 ) {
            print "AnotherField", f["AnotherField"]
        }
        delete f
        next
    }
    {
        if ( $1 == "ExtraStuff" ) {
            pfx = $1
            sub(/[^=]+=/,"")
            f[pfx] = $0
            pfx = pfx "_"
        }
        else {
            pfx = ""
        }
        for (i=1; i<NF; i+=2) {
            f[pfx $i] = $(i+1)
        }
    }
    
    $ awk -f tst.awk file
    AnotherField=DesiredValue1
    

    That second script first stores all of the values in an array f[] so you can access the values by their names, here's what the contents of that array look like:

    $ cat tst.awk
    BEGIN { FS="[,=[:space:]]"; OFS="=" }
    /^-+$/ {
        for (i in f) printf "> f[%s]=%s\n", i, f[i]
        if ( f["ExtraStuff_ThirdBonusKey"] == 1 ) {
            print "AnotherField", f["AnotherField"]
        }
        print "----"
        delete f
        next
    }
    {
        if ( $1 == "ExtraStuff" ) {
            pfx = $1
            sub(/[^=]+=/,"")
            f[pfx] = $0
            pfx = pfx "_"
        }
        else {
            pfx = ""
        }
        for (i=1; i<NF; i+=2) {
            f[pfx $i] = $(i+1)
        }
    }
    

    .

    $ awk -f tst.awk file
    ----
    > f[OptionallyAppearingField]=WhoKnowsWhat
    > f[AnotherField]=DesiredValue1
    > f[ExtraStuff_SecondBonusKey]=2
    > f[ExtraStuff_ThirdBonusKey]=1
    > f[ExtraStuff_OneBonusKey]=1
    > f[SomeField]=SomeValue
    > f[ExtraStuff]=OneBonusKey=1,SecondBonusKey=2,ThirdBonusKey=1,...
    AnotherField=DesiredValue1
    ----
    > f[OptionallyAppearingField]=WhoKnowsWhat
    > f[AnotherField]=DesiredValue2
    > f[ExtraStuff_SecondBonusKey]=2
    > f[ExtraStuff_ThirdBonusKey]=0
    > f[ExtraStuff_OneBonusKey]=1
    > f[SomeField]=SomeValue
    > f[ExtraStuff]=OneBonusKey=1,SecondBonusKey=2,ThirdBonusKey=0,...
    ----
    > f[SomeField]=
    > f[ExtraStuff]=
    ----
    

    Given that you can create whatever conditions and/or print whatever combinations of fields you want in any input or output order.