rstringrdata-wrangling

Regex to extract a part of URL using stringr r package


I have the following URLS:

www.google.com?utm_source=site_corriere&utm_medium=video&utm_content=box

www.google.com?utm_source=site_rep&utm_medium=display&utm_content=box

www.google.com?utm_source=site_fattoquotidiano&utm_medium=social&utm_content=box

www.google.com?utm_source=site_inter&utm_medium=video&utm_content=box

www.google.com?utm_source=site_foglio&utm_medium=video&utm_content=box

Using the package stringr, I want the extract only the values between "utm_source=" and "&"

So I expect to have:

site_corriere

site_rep

site_fattoquotidiano

site_inter

site_foglio

I am using this regex

(?<=utm_source=)(.*)(?=&)

but it is not working correctly because it is not excluding this part &utm_medium=video&utm_content=box

Could you please help me?

Thanks


Solution

  • If you slightly change your regex pattern to the following it should work:

    (?<=\butm_source=)[^&]+
    

    R script:

    library(stringr)
    
    x <- c("www.google.com?utm_source=site_corriere&utm_medium=video&utm_content=box",
           "www.google.com?utm_source=site_rep&utm_medium=display&utm_content=box",
           "www.google.com?utm_source=site_fattoquotidiano&utm_medium=social&utm_content=box",
           "www.google.com?utm_source=site_inter&utm_medium=video&utm_content=box",
           "www.google.com?utm_source=site_foglio&utm_medium=video&utm_content=box")
    output <- str_extract(x, "(?<=\\butm_source=)[^&]+")
    output
    
    [1] "site_corriere"        "site_rep"             "site_fattoquotidiano"
    [4] "site_inter"           "site_foglio"
    

    The logical change to your regex pattern is to express the query parameter value of utm_source as [^&]+, which will match the single following value only (and note that it also matches if utm_source were the last query parameter in the URL as well).