in c# this pattern match img and the src url in the src group
<img.*?src=.*?\"(?<src>([^\"]*?))\".*?>
but in some document the match fails because the src in enclosed in single quote.
example:
<div class="tableauPlaceholder" id="viz1749842670060" style="position: relative"><noscript><a href='#'><img alt='Dashboard 1 ' src='https://public.tableau.com/static/images/Pu/PuntualitneipagamentiB2Bdifferenzeperdimensioneesettore/Dashboard1/1_rss.png' style='border: none'></a></noscript><object class="tableauViz" style="display:none;">
</div>
i really don't understand why the src group contains tableauViz, the contents of class attribute.
is there a way to edit the pattern to match correctly src of image tag even when contents is between single quote?
Your current regex only matches double-quoted values ("
). When the src
is enclosed in single quotes, it fails or captures something wrong.
Try:
<img[^>]*?\s+src\s*=\s*['"](?<src>[^'"]+)['"][^>]*?>
<img[^>]*?
— Matches <img
and any characters (non-greedy) until we get to the src
attribute
\s+src\s*=\s*
— Matches the src=
part, allowing optional spaces
['"](?<src>[^'"]+)['"]
— Matches the value of src
whether it's in single '
or double "
quotes, and stores it in the src
group
[^>]*?>
— Matches the rest of the tag until the closing >