regexregex-look-ahead

Regular Expressions Negative Lookahead not working as expected


I am trying to match only the a tags that are not followed by an img tag in some HTML code.

I tried with the pattern <a.*?>(?!<img.*?>) on this HTML code:

<a href="">Text</a>
<a href=""><img src=""/></a>

But it doesn't work as expected, it match both a tags and it even capture the img tag.


Solution

  • You said it captures the img tag, but you are likely seeing these two parts captured:

    <a href="">
    <a href=""><img src=""/>
    

    This makes sense, because your regex <a.*?>(?!<img.*?>) can be explained as: capture everything starting with <a immediately followed by any characters (not greedy), immediately followed by a > not immediately followed by an img tag (defined as <i followed by any characters (not greedy), followed by a >).

    The first <a href=""> is not followed by an img tag immediately, there's Text first.

    The second <a href=""> is followed by an img tag, so that does not match, but <a href=""><img src=""/> taken together also matches the first part of the regex, and that is not followed by an img tag - so it matches.

    Your definition is vague, so it's hard to write an expression that does what you want. When is an anchor tag not followed by an image tag? Should there not be an image tag in the rest of the document? Or do you mean that the image tag should not occurs before the anchor tag closes?

    If you're just looking to match anchor tags and their contents, unless they contain an image tag, something like this would be better:

    <a\b[^>]*>(?:(?!<img\b).)*?</a>
    

    Depending on the regex engine, you may need to escape the / with a \. You may also want to make sure it's not case-sensitive.

    Consider using a site like https://regex101.com/ to test your regex - it offers a detailed explanation of the regex and shows what it would match. There are offline tools available as well.