rregexstringr

Extract text in two columns from a string


I have a table where one column has data like this:

table$test_string<- "[projectname](https://somewebsite.com/projectname/Abc/xyz-09)"

1.) I am trying to extract the first part of this string within the square brackets in one column, i.e.

table$project_name <- "projectname"

using the regex:

project_name <- "^\\[|(?:[a-zA-Z]|[0-9])+|\\]$"
table$project_name <- str_extract(table$test_string, project_name)

If I test the regex on 1 value (1 row individually) of the table, the above regex works with using str_extract_all(table$test_string, project_name[[1]][2]).

However, I get NA when I apply the regex pattern to the whole table and an error if I use str_extract_all.

2.) Second part of the string, which is a URL in another column,

table$url_link <- "https://somewebsite.com/projectname/Abc/xyz-09"

I am using the following regex expression for URL:

url_pattern <- "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"

table$url_link <- str_extract(table$test_string, url_pattern)

and this works on the whole table, however, I still get the ')' last paranthesis in the url link.

What am I missing here? and why does the first regex work individually and not on the whole table? and for the url, how do I not get the last paranthesis?


Solution

  • It feels like you could simplify things considerably by using parentheses to group capture. For example:

    test_string<- "[projectname](https://somewebsite.com/projectname/Abc/xyz-09)"
    
    regex <- "\\[(.*)\\]\\((.*)\\)"
    
    gsub(regex, "\\1", test_string)
    #> [1] "projectname"
    
    gsub(regex, "\\2", test_string)
    #> [1] "https://somewebsite.com/projectname/Abc/xyz-09"