rstringqdapregex

Extracting string between words using logical operators in rm_between function


I am trying to extract strings between words. Consider this example -

x <-  "There are 2.3 million species in the world"

This may also take another form which is

x <-  "There are 2.3 billion species in the world"

I need the text between There and either 'million or billion, including them. The presence of million or billion is decided on run time, it is not decided before hand. So the output which I need from this sentence is

[1] There are 2.3 million OR
[2] There are 2.3 billion

I am using rm_between function from qdapRegex package for the same. Using this command I can extract only one of them at a time.

library(qdapRegex)
rm_between(x, 'There', 'million', extract=TRUE, include.markers = TRUE) 

OR I have to use

rm_between(x, 'There', 'billion', extract=TRUE, include.markers = TRUE)

How can I write a command which can check presence of million or billion in the same sentence. Something like this -

rm_between(x, 'There', 'billion' || 'million', extract=TRUE, include.markers = TRUE)

I hope this is clear. Any help would be appreciated.


Solution

  • The left and right arguments in rm_between takes a vector of character/numeric symbols. So you can use a vector with equal length in both left/right arguments.

     library(qdapRegex)
     unlist(rm_between(x, rep('There',2), c('million', 'billion'),
                             extract=TRUE, include.markers=TRUE))
     #[1] "There are 2.3 million" "There are 2.3 billion"
     unlist(rm_between(x1, rep('There',2), c('million', 'billion'),
                             extract=TRUE, include.markers=TRUE))
     #[1] "There are 2.3 million"
    
     unlist(rm_between(x2, rep('There',2), c('million', 'billion'),
                             extract=TRUE, include.markers=TRUE))
     #[1] "There are 2.3 billion"
    

    Or

      sub('\\s*species.*', '', x)
    

    data

     x <-  c("There are 2.3 million species in the world", 
       "There are 2.3 billion species in the world")
     x1 <- "There are 2.3 million species in the world"
     x2 <- "There are 2.3 billion species in the world"