regexpcre

Regex not match inside of match (validating content in Markdown links)


Ive been trying to create a PCRE regex (for grep -P) that returns matches where the content inside a match doesn't match the expected format.

The use case is grepping a directory of Markdown files, finding all (geo:) links, and returning any where the coordinate link does not match the expected format. But I can't figure out how to do the not match inside a match...

This regex works for the expected format:

[-+]?([1-8]?\d(\.\d+)?|90(\.0+)?),[-+]?(180(\.0+)?|((1[0-7]\d)|([1-9]?\d))(\.\d+)?)\)

And finding the (geo: string is easy.

But how can I do the not match inside the match?

Examples good format (that I don't want to return):

Invalid formats that I want to find:

I've tried variations of negative lookaheads on the desired matching format but I am not sure if that is a valid approach... Or if its even going to work.

(?!(?:([-+]?([1-8]?\d(\.\d+)?|90(\.0+)?),[-+]?(180(\.0+)?|((1[0-7]\d)|([1-9]?\d))(\.\d+)?))*?\)))

This question and answer got me close but I couldn't quite adapt it. Regex to match local markdown links


Solution

  • Suppose you have these files:

    $ head case[0-9].txt
    ==> case1.txt <==
    Random text.
    
    Something else
    Blah bla (https://url.com/something) GOOD, Only looking for ( geo:) blah
    Random text.
    Something else
    
    ==> case2.txt <==
    Random text.
    
    Something else
    Blah bla (geo:12.34567,-1234567) GOOD PATTERN blah
    Random text.
    Something else
    
    ==> case3.txt <==
    Random text.
    
    Something else
    Blah bla (geo:50.5,25.5) GOOD PATTERN blah
    Random text.
    Something else
    
    ==> case4.txt <==
    Random text.
    
    Something else
    Blah bla (geo: 50.5,25.5) BAD - Space after : blah
    Random text.
    Something else
    
    ==> case5.txt <==
    Random text.
    
    Something else
    Blah bla (geo:Ooops typed text) BAD - Text instead of coordinates blah
    Random text.
    Something else
    
    ==> case6.txt <==
    Random text.
    
    Something else
    Blah bla (geo:12.34567 -12.34567) BAD - Space instead of comma blah
    Random text.
    Something else
    
    ==> case7.txt <==
    Random text.
    
    Something else
    Blah bla (geo:12.34567, -12.34567) BAD - Space after comma blah
    Random text.
    Something else
    

    (So case4.txt - case7.txt are 'bad')

    Rather than come up with some wizard level single PCRE, it is usually easier to filter out / validate what you WANT and then anything left over is what you DON'T WANT.

    Here is a Ruby that demonstrates the concept:

    ruby -e 'ARGV.each{|f| 
        paren_lines=File.open(f).each_line.
            select{|line| line=~/\(|\)/}.    # select only if it has parenthesis.
            reject{|line|                    # reject lines that are OK
                line=~/\(geo:[+-]?\d+\.?\d*,[+-]?\d+\.?\d*\)/ || line=~/\((?!geo:)/
            }
        puts "#{f} errors:\n\t#{paren_lines.join("\t")}\n" if paren_lines.length>0
    }' case[0-9].txt
    

    Prints:

    case4.txt errors:
        Blah bla (geo: 50.5,25.5) BAD - Space after : blah
    
    case5.txt errors:
        Blah bla (geo:Ooops typed text) BAD - Text instead of coordinates blah
    
    case6.txt errors:
        Blah bla (geo:12.34567 -12.34567) BAD - Space instead of comma blah
    
    case7.txt errors:
        Blah bla (geo:12.34567, -12.34567) BAD - Space after comma blah
    

    You could do the same with Bash testing several greps but why? So much easier with awk / Perl / Python / Ruby. And awk is in every Unix.

    Here is a more basic version that works in any awk:

    awk '/[()]/{
        if (/\(geo:[+-]?[0-9]+\.?[0-9]*,[+-]?[0-9]+\.?[0-9]*/ ) next
        if (!/\(geo:/) next
    
        print FILENAME ": ", $0
        }
    ' case[0-9].txt
    

    Prints:

    case4.txt:  Blah bla (geo: 50.5,25.5) BAD - Space after : blah
    case5.txt:  Blah bla (geo:Ooops typed text) BAD - Text instead of coordinates blah
    case6.txt:  Blah bla (geo:12.34567 -12.34567) BAD - Space instead of comma blah
    case7.txt:  Blah bla (geo:12.34567, -12.34567) BAD - Space after comma blah
    

    Either the Ruby or the Awk could also further validate the acceptable pattern has a) balanced parenthesis (which is hard with regex alone) and b) numerically is in an acceptable range (again, challenging and error prone in a single regex).

    The other advantage of this approach is that it easy to modify your conditions of what you want vs what you don't.