htmlregexregex-groupregex-negationtext-extraction

regex for match all <img> tag and extract the "src" attribute


i want, with a regex, find all img tag into html document and extract the content of the src attribute.

This is my regex (see online https://regex101.com/r/EE08dw/1):

<img(?<prepend>[^>]+?)src=('|")?(?<src>[^\2>]+)[\2]?(?<append>[^>]*)>

On a test string:

<img src="aaa.jpg">

the output is:

Full match    `<img src="aaa.jpg">`
Group prepend ` `
Group 2.      "
Group srs     `aaa.jpg"`
Group append  ``

but the expected output is:

Full match    `<img src="aaa.jpg">`
Group prepend ` `
Group 2.      "
Group srs     `aaa.jpg`
Group append  ``

the problem is into group src that also match the " char:

Output:   Group srs `aaa.jpg"`
Expected: Group srs `aaa.jpg`

how fix it?

side note: regex is safe in my context


Solution

  • Since you specified in the comments below your question that using regex in your case is safe...

    You can't put backreferences in a set. It'll interpret the characters literally (so in your case \2 matches the character with index 28 literally). Use a tempered greedy token instead.

    See regex in use here

    <img(?<prepend>[^>]+?)src=(['"])?(?<src>(?:(?!\2)[^>])+)\2?(?<append>[^>]*)>
                              ^^^^^^        ^^^^^^^^^^^^^^  ^^
                              1             2               3
    1: Uses set - you can do an or | as well, but using a set improves performance
    2: Tempered greedy token
    3: Take backreference out of set