i want, with a regex, find all img
tag into html document and extract the content of the src
attribute.
This is my regex (see online https://regex101.com/r/EE08dw/1):
<img(?<prepend>[^>]+?)src=('|")?(?<src>[^\2>]+)[\2]?(?<append>[^>]*)>
On a test string:
<img src="aaa.jpg">
the output is:
Full match `<img src="aaa.jpg">`
Group prepend ` `
Group 2. "
Group srs `aaa.jpg"`
Group append ``
but the expected output is:
Full match `<img src="aaa.jpg">`
Group prepend ` `
Group 2. "
Group srs `aaa.jpg`
Group append ``
the problem is into group src
that also match the "
char:
Output: Group srs `aaa.jpg"`
Expected: Group srs `aaa.jpg`
how fix it?
side note: regex is safe in my context
Since you specified in the comments below your question that using regex in your case is safe...
You can't put backreferences in a set. It'll interpret the characters literally (so in your case \2
matches the character with index 28 literally). Use a tempered greedy token instead.
<img(?<prepend>[^>]+?)src=(['"])?(?<src>(?:(?!\2)[^>])+)\2?(?<append>[^>]*)>
^^^^^^ ^^^^^^^^^^^^^^ ^^
1 2 3
1: Uses set - you can do an or | as well, but using a set improves performance
2: Tempered greedy token
3: Take backreference out of set