rregex-lookaroundsregex-group

Delimiting string in R


I have this challenge:

I want to be able to extract one portion of a string in the following manner:

  1. The string may or may not have a dot or may have plenty of them
  2. I want to extract the string part that is before the first dot, if there is no dot then I want the whole string
  3. I want to use a regex to achieve this
    test<-c("This_This-This.Not This",
            "This_This-This.not_.this",
            "This_This-This",
            "this",
            "this.Not This")

Since I need to use a regex, I have been trying to use this expression:

str_match(test,"(^[a-zA-Z].+)[\\.\\b]?")[,2]

but what I get is:

> str_match(test,"(^[a-zA-Z].+)[\\.\\b]?")[,2]
[1] "This_This-This.Not This" "This_This-This.not_this"
[3] "This_This-This"          "this"                   
[5] "this.Not This"          
> 

My desired output is:

"This_This-This"
"This_This-This"
"This_This-This"
"this"
"this"

This is my thought process behind the regex

str_match(test,"(^[a-zA-Z].+)[\\.\\b]?")[,2]

(^[a-zA-Z].+)= this to capture the group before the dot since the string starts always with a letter cpas or lower case, and all other strings after that that's why the .+

[\.\b]?=a dot or a world boundary that may or may not be that's why the ?

This is not giving what I want. Where is my mistake?


Solution

  • My regex is "match anything up to either a dot or the end of the line".

    library(stringr)
    str_match(test, "^(.*?)(\\.|$)")[, 2]
    

    Result:

    [1] "This_This-This" "This_This-This" "This_This-This" "this" "this"