I need to match complete blogger.googleusercontent.com
image link URLs that include the /img/a/
subdirectories. The URLs are for images, and the file names don't have file extensions, but that may not matter.
These are two sample URLs from a large text file dump of HTML. There is a lot of HTML markup, but there are spaces before href
and after the closing " of the URLs.
href="https://blogger.googleusercontent.com/img/a/AVvXsEhb-vB59M2NTWyWvDlMemRdgT0XMKdjB4NMH02iP4Nb7HbzHwq5ZObxjEC1_oLne6xpUhIkrkpyWEdMX9ck-aU5h1JXdpSw-GhbV90QBEi2xigGLQdoSswWuQtPNNCyWMRJiT2XnEadx170jUDbtL-AQKKzYyarCoj8=s1727"
src="https://blogger.googleusercontent.com/img/a/AVvXsEhb-vB59M2NTWyWvDlMemRdgT0XMKdjB4NMH02iP4Nb7HbzHwq5ZObxjEC1_oLne6xpUhIkrkpyWEdMX9ck-aU5h1JXdpSw-GhbV90QBEi2xigGLQdoSswWuQtPNNCyWMRJiT2XnEadx170jUDbtL-AQKKzYyarCoj8=w400-h183"
What I am using is this:
\/img\/a\/[^\/]
And that matches
/img/a/A
I don't need to match the capital A, as I also need to change the regex to also find images in /img/b/
But I do need to expand the match to find the entire URL, from https to the end ".
Fiddle: https://regex101.com/r/txLWcO/1
You could use:
https://blogger\.googleusercontent\.com/img/a/[^/\s'"]+
The pattern matches:
https://blogger\.googleusercontent\.com/img/a/
Match https://blogger.googleusercontent.com/img/a/
escaping the dots to match them literally[^/\s'"]+
Match 1+ non whitespace characters excluding "
and '
(Or use *
to match zero or more occurrences instead of +
)
See a regex demo
If you want to match either /a/
or /b/
you can use a character class
https://blogger\.googleusercontent\.com/img/[ab]/[^/\s'"]+