Ive been trying to create a PCRE regex (for grep -P) that returns matches where the content inside a match doesn't match the expected format.
The use case is grepping a directory of Markdown files, finding all (geo:) links, and returning any where the coordinate link does not match the expected format. But I can't figure out how to do the not match inside a match...
This regex works for the expected format:
[-+]?([1-8]?\d(\.\d+)?|90(\.0+)?),[-+]?(180(\.0+)?|((1[0-7]\d)|([1-9]?\d))(\.\d+)?)\)
And finding the (geo:
string is easy.
But how can I do the not match inside the match?
Examples good format (that I don't want to return):
(https://url.com/something)
Only looking for (geo:)(geo:12.34567,-1234567)
(geo:50.5,25.5)
Invalid formats that I want to find:
(geo: 50.5,25.5)
Space after :(geo:Ooops typed text)
Text instead of coordinates(geo:12.34567 -12.34567)
Space instead of comma(geo:12.34567, -12.34567)
Space after commaI've tried variations of negative lookaheads on the desired matching format but I am not sure if that is a valid approach... Or if its even going to work.
(?!(?:([-+]?([1-8]?\d(\.\d+)?|90(\.0+)?),[-+]?(180(\.0+)?|((1[0-7]\d)|([1-9]?\d))(\.\d+)?))*?\)))
This question and answer got me close but I couldn't quite adapt it. Regex to match local markdown links
Suppose you have these files:
$ head case[0-9].txt
==> case1.txt <==
Random text.
Something else
Blah bla (https://url.com/something) GOOD, Only looking for ( geo:) blah
Random text.
Something else
==> case2.txt <==
Random text.
Something else
Blah bla (geo:12.34567,-1234567) GOOD PATTERN blah
Random text.
Something else
==> case3.txt <==
Random text.
Something else
Blah bla (geo:50.5,25.5) GOOD PATTERN blah
Random text.
Something else
==> case4.txt <==
Random text.
Something else
Blah bla (geo: 50.5,25.5) BAD - Space after : blah
Random text.
Something else
==> case5.txt <==
Random text.
Something else
Blah bla (geo:Ooops typed text) BAD - Text instead of coordinates blah
Random text.
Something else
==> case6.txt <==
Random text.
Something else
Blah bla (geo:12.34567 -12.34567) BAD - Space instead of comma blah
Random text.
Something else
==> case7.txt <==
Random text.
Something else
Blah bla (geo:12.34567, -12.34567) BAD - Space after comma blah
Random text.
Something else
(So case4.txt - case7.txt are 'bad')
Rather than come up with some wizard level single PCRE, it is usually easier to filter out / validate what you WANT and then anything left over is what you DON'T WANT.
Here is a Ruby that demonstrates the concept:
ruby -e 'ARGV.each{|f|
paren_lines=File.open(f).each_line.
select{|line| line=~/\(|\)/}. # select only if it has parenthesis.
reject{|line| # reject lines that are OK
line=~/\(geo:[+-]?\d+\.?\d*,[+-]?\d+\.?\d*\)/ || line=~/\((?!geo:)/
}
puts "#{f} errors:\n\t#{paren_lines.join("\t")}\n" if paren_lines.length>0
}' case[0-9].txt
Prints:
case4.txt errors:
Blah bla (geo: 50.5,25.5) BAD - Space after : blah
case5.txt errors:
Blah bla (geo:Ooops typed text) BAD - Text instead of coordinates blah
case6.txt errors:
Blah bla (geo:12.34567 -12.34567) BAD - Space instead of comma blah
case7.txt errors:
Blah bla (geo:12.34567, -12.34567) BAD - Space after comma blah
You could do the same with Bash testing several greps but why? So much easier with awk / Perl / Python / Ruby. And awk is in every Unix.
Here is a more basic version that works in any awk:
awk '/[()]/{
if (/\(geo:[+-]?[0-9]+\.?[0-9]*,[+-]?[0-9]+\.?[0-9]*/ ) next
if (!/\(geo:/) next
print FILENAME ": ", $0
}
' case[0-9].txt
Prints:
case4.txt: Blah bla (geo: 50.5,25.5) BAD - Space after : blah
case5.txt: Blah bla (geo:Ooops typed text) BAD - Text instead of coordinates blah
case6.txt: Blah bla (geo:12.34567 -12.34567) BAD - Space instead of comma blah
case7.txt: Blah bla (geo:12.34567, -12.34567) BAD - Space after comma blah
Either the Ruby or the Awk could also further validate the acceptable pattern has a) balanced parenthesis (which is hard with regex alone) and b) numerically is in an acceptable range (again, challenging and error prone in a single regex).
The other advantage of this approach is that it easy to modify your conditions of what you want vs what you don't.