I have the following URLS:
www.google.com?utm_source=site_corriere&utm_medium=video&utm_content=box
www.google.com?utm_source=site_rep&utm_medium=display&utm_content=box
www.google.com?utm_source=site_fattoquotidiano&utm_medium=social&utm_content=box
www.google.com?utm_source=site_inter&utm_medium=video&utm_content=box
www.google.com?utm_source=site_foglio&utm_medium=video&utm_content=box
Using the package stringr, I want the extract only the values between "utm_source=" and "&"
So I expect to have:
site_corriere
site_rep
site_fattoquotidiano
site_inter
site_foglio
I am using this regex
(?<=utm_source=)(.*)(?=&)
but it is not working correctly because it is not excluding this part &utm_medium=video&utm_content=box
Could you please help me?
Thanks
If you slightly change your regex pattern to the following it should work:
(?<=\butm_source=)[^&]+
R script:
library(stringr)
x <- c("www.google.com?utm_source=site_corriere&utm_medium=video&utm_content=box",
"www.google.com?utm_source=site_rep&utm_medium=display&utm_content=box",
"www.google.com?utm_source=site_fattoquotidiano&utm_medium=social&utm_content=box",
"www.google.com?utm_source=site_inter&utm_medium=video&utm_content=box",
"www.google.com?utm_source=site_foglio&utm_medium=video&utm_content=box")
output <- str_extract(x, "(?<=\\butm_source=)[^&]+")
output
[1] "site_corriere" "site_rep" "site_fattoquotidiano"
[4] "site_inter" "site_foglio"
The logical change to your regex pattern is to express the query parameter value of utm_source
as [^&]+
, which will match the single following value only (and note that it also matches if utm_source
were the last query parameter in the URL as well).