I have a text file of several thousand URLs that I need to truncate or trim with regex. I am using BBEdit as a text editor as it has a great regex find/replace function.
This is an example of one of the URLs:
https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhUk2LfEXvKMZ48tpWUR607L5y_TRn-lXyajH_tJBOeWPqNFmfU1UV7pKginB78MHnuGS-luzq-RCIj1Z6rJ2y8VE3P93gIGeN_ZMjFii1Vnb2wZMnbyLTH241UTuu8kcvMZHFii1Vnb2wZMnbyLTH241gaZGDlgWTfx4EVdAlNFncc2XZJNz0fE0-JK1iDP7WgLEJWNg/w640-h196/Oscar.png
I need to truncate/trim the longest subdirectory path, i.e., which is this:
/AVvXsEhUk2LfEXvKMZ48tpWUR607L5y_TRn-lXyajH_tJBOeWPqNFmfU1UV7pKginB78MHnuGS-luzq-RCIj1Z6rJ2y8VE3P93gIGeN_ZMjFii1Vnb2wZMnbyLTH241UTuu8kcvMZHFii1Vnb2wZMnbyLTH241gaZGDlgWTfx4EVdAlNFncc2XZJNz0fE0-JK1iDP7WgLEJWNg/
What I need to do is truncate or trim that one subdirectory path to the leading /AVvXsE
and include the next 20 characters to the right.
i.e., this is what I need as a result:
/AVvXsEhUk2LfEXvKMZ48tpWUR6/
so the resulting full URL path is this:
https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhUk2LfEXvKMZ48tpWUR6/w640-h196/Oscar.png
The first six characters of the URL /AVvXsE
are the same in all the URLs I need to truncate/trim. I need the next 20 characters to the right of the /AVvXsE
to create unique paths because I can see that other subdirectories for the image files, i.e. w640-h196
, are used for many other images.
How can I do this with Regex? Or is Regex not the best way to do this? What about sed?
Regex Fiddle: https://regex101.com/r/W2t82Z/1
You can use a pattern, which includes (\/AVvXsE\S{20})[^\/]*
, such as:
(?i)(https?:\/\/blogger.googleusercontent.com\/img\/.*)(\/AVvXsE\S{20})[^\/]*
Assuming that you want only https://blogger.googleusercontent.com/img/
URLs.
import re
s = """https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEEXvKMZ48tpWUR607L5y_TRn-lXyajH_tJBOeWPqNFmfU1UV7pKginB78MHnuGS-luzq-RCIj1Z6rJ2y8VE3P93gIGeN_ZMjFii1Vnb2wZMnbyLTH241UTuu8kcvMZHFii1Vnb2wZMnbyLTH241gaZGDlgWTfx4EVdAlNFncc2XZJNz0fE0-JK1iDP7WgLEJWNg/w640-h196/Oscar.png"""
p = r'(?i)(https?:\/\/blogger.googleusercontent.com\/img\/.*)(\/AVvXsE\S{20})[^\/]*'
print(re.sub(p, r'\1\2', s))
https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEEXvKMZ48tpWUR607L5y_/w640-h196/Oscar.png
(?i)
: insensitive flag (allows for all combinations of lowercases and uppercases HTTPS://
, https://
, etc.).(https?:\/\/blogger.googleusercontent.com\/img\/.*)
: this capture group limits the pattern to specific URLs.(\/AVvXsE\S{20})
: this capture group is the part you want to keep.[^\/]*
: this is the part you want to get rid of.