I am trying to extract strings between words. Consider this example -
x <- "There are 2.3 million species in the world"
This may also take another form which is
x <- "There are 2.3 billion species in the world"
I need the text between There
and either 'million
or billion
, including them. The presence of million or billion is decided on run time, it is not decided before hand. So the output which I need from this sentence is
[1] There are 2.3 million
OR
[2] There are 2.3 billion
I am using rm_between
function from qdapRegex
package for the same. Using this command I can extract only one of them at a time.
library(qdapRegex)
rm_between(x, 'There', 'million', extract=TRUE, include.markers = TRUE)
OR I have to use
rm_between(x, 'There', 'billion', extract=TRUE, include.markers = TRUE)
How can I write a command which can check presence of million
or billion
in the same sentence. Something like this -
rm_between(x, 'There', 'billion' || 'million', extract=TRUE, include.markers = TRUE)
I hope this is clear. Any help would be appreciated.
The left
and right
arguments in rm_between
takes a vector
of character/numeric symbols. So you can use a vector with equal length in both left/right
arguments.
library(qdapRegex)
unlist(rm_between(x, rep('There',2), c('million', 'billion'),
extract=TRUE, include.markers=TRUE))
#[1] "There are 2.3 million" "There are 2.3 billion"
unlist(rm_between(x1, rep('There',2), c('million', 'billion'),
extract=TRUE, include.markers=TRUE))
#[1] "There are 2.3 million"
unlist(rm_between(x2, rep('There',2), c('million', 'billion'),
extract=TRUE, include.markers=TRUE))
#[1] "There are 2.3 billion"
Or
sub('\\s*species.*', '', x)
x <- c("There are 2.3 million species in the world",
"There are 2.3 billion species in the world")
x1 <- "There are 2.3 million species in the world"
x2 <- "There are 2.3 billion species in the world"