I've got a python script that runs through some logs and figured it'd be instructive to do a few benchmarks against some other approaches before deploying this out. When looking at awk
, I'm hoping to minimize overhead to get a 'fair' shake at beating the somewhat optimized python variant.
My log entries look like:
--------
SomeField=SomeValue
OptionallyAppearingField=WhoKnowsWhat
AnotherField=AnotherValue
ExtraStuff=OneBonusKey=1,SecondBonusKey=2,ThirdBonusKey=3,...
--------
And I'm keen to get the value of AnotherField when one of our ThirdBonusKey
s exists and has a certain value (actually just the number 1).
The 'stupid' way here is to set our RS
to '--------'
and then just apply a regex to $0
twice, first to see if ThirdBonusKey=1
is in the record, and then to extract AnotherField=(desired_value)
.
But that seems like an unfair comparison, given it's just throwing a regex at the problem (twice!). Without a guaranteed ordering of fields to leverage awk
's cool FS skills, is there a quicker or more appropriate approach here? It's possible that the answer is just "this is not a job for awk
", and that's okay too, I guess.
Cyrus has kindly pointed out that the sketch of code I gave above is not technically code, and he's technically correct, so here's a reasonably stupid implementation:
awk 'BEGIN{RS="--------"} { if ($0 ~ /ThirdBonusKey=1/) { for(i=1;i<NF;i++) {if ($i ~ "AnotherField=") { print $i }}}}'
Given input
--------
SomeField=SomeValue
OptionallyAppearingField=WhoKnowsWhat
AnotherField=DesiredValue1
ExtraStuff=OneBonusKey=1,SecondBonusKey=2,ThirdBonusKey=1,...
--------
SomeField=SomeValue
OptionallyAppearingField=WhoKnowsWhat
AnotherField=DesiredValue2
ExtraStuff=OneBonusKey=1,SecondBonusKey=2,ThirdBonusKey=0,...
--------
SomeField=
ExtraStuff=
--------
we'd expect output
AnotherField=DesiredValue1
Most efficiently I expect:
$ awk '/^AnotherField=/{val=$0; next} /[=,]ThirdBonusKey=1(,|$)/{print val}' file
AnotherField=DesiredValue1
but more robustly and easier to enhance to do anything else you want later:
$ cat tst.awk
BEGIN { FS="[,=[:space:]]"; OFS="=" }
/^-+$/ {
if ( f["ExtraStuff_ThirdBonusKey"] == 1 ) {
print "AnotherField", f["AnotherField"]
}
delete f
next
}
{
if ( $1 == "ExtraStuff" ) {
pfx = $1
sub(/[^=]+=/,"")
f[pfx] = $0
pfx = pfx "_"
}
else {
pfx = ""
}
for (i=1; i<NF; i+=2) {
f[pfx $i] = $(i+1)
}
}
$ awk -f tst.awk file
AnotherField=DesiredValue1
That second script first stores all of the values in an array f[]
so you can access the values by their names, here's what the contents of that array look like:
$ cat tst.awk
BEGIN { FS="[,=[:space:]]"; OFS="=" }
/^-+$/ {
for (i in f) printf "> f[%s]=%s\n", i, f[i]
if ( f["ExtraStuff_ThirdBonusKey"] == 1 ) {
print "AnotherField", f["AnotherField"]
}
print "----"
delete f
next
}
{
if ( $1 == "ExtraStuff" ) {
pfx = $1
sub(/[^=]+=/,"")
f[pfx] = $0
pfx = pfx "_"
}
else {
pfx = ""
}
for (i=1; i<NF; i+=2) {
f[pfx $i] = $(i+1)
}
}
.
$ awk -f tst.awk file
----
> f[OptionallyAppearingField]=WhoKnowsWhat
> f[AnotherField]=DesiredValue1
> f[ExtraStuff_SecondBonusKey]=2
> f[ExtraStuff_ThirdBonusKey]=1
> f[ExtraStuff_OneBonusKey]=1
> f[SomeField]=SomeValue
> f[ExtraStuff]=OneBonusKey=1,SecondBonusKey=2,ThirdBonusKey=1,...
AnotherField=DesiredValue1
----
> f[OptionallyAppearingField]=WhoKnowsWhat
> f[AnotherField]=DesiredValue2
> f[ExtraStuff_SecondBonusKey]=2
> f[ExtraStuff_ThirdBonusKey]=0
> f[ExtraStuff_OneBonusKey]=1
> f[SomeField]=SomeValue
> f[ExtraStuff]=OneBonusKey=1,SecondBonusKey=2,ThirdBonusKey=0,...
----
> f[SomeField]=
> f[ExtraStuff]=
----
Given that you can create whatever conditions and/or print whatever combinations of fields you want in any input or output order.