regexbbedit

How can I expand a regex to find the entire URL in these cases?


I need to match complete blogger.googleusercontent.com image link URLs that include the /img/a/ subdirectories. The URLs are for images, and the file names don't have file extensions, but that may not matter.

These are two sample URLs from a large text file dump of HTML. There is a lot of HTML markup, but there are spaces before href and after the closing " of the URLs.

href="https://blogger.googleusercontent.com/img/a/AVvXsEhb-vB59M2NTWyWvDlMemRdgT0XMKdjB4NMH02iP4Nb7HbzHwq5ZObxjEC1_oLne6xpUhIkrkpyWEdMX9ck-aU5h1JXdpSw-GhbV90QBEi2xigGLQdoSswWuQtPNNCyWMRJiT2XnEadx170jUDbtL-AQKKzYyarCoj8=s1727"

src="https://blogger.googleusercontent.com/img/a/AVvXsEhb-vB59M2NTWyWvDlMemRdgT0XMKdjB4NMH02iP4Nb7HbzHwq5ZObxjEC1_oLne6xpUhIkrkpyWEdMX9ck-aU5h1JXdpSw-GhbV90QBEi2xigGLQdoSswWuQtPNNCyWMRJiT2XnEadx170jUDbtL-AQKKzYyarCoj8=w400-h183"

What I am using is this:

\/img\/a\/[^\/]

And that matches

/img/a/A

I don't need to match the capital A, as I also need to change the regex to also find images in /img/b/

But I do need to expand the match to find the entire URL, from https to the end ".

Fiddle: https://regex101.com/r/txLWcO/1


Solution

  • You could use:

    https://blogger\.googleusercontent\.com/img/a/[^/\s'"]+
    

    The pattern matches:

    (Or use * to match zero or more occurrences instead of +)

    See a regex demo

    If you want to match either /a/ or /b/ you can use a character class

    https://blogger\.googleusercontent\.com/img/[ab]/[^/\s'"]+