rstringtexttext-mining

Efficiently break up a string based on the nth occurrence of a substring using R


Introduction

Given a string in R, is it possible to get a vectorized solution (i.e. no loops) where we can break the string into blocks where each block is determined by the nth occurrence of a substring in the string.

Work done with Reproducible Example

Suppose we have several paragraphs of the famous Lorem Ipsum text.

library(strex)
# devtools::install_github("aakosm/lipsum")
library(lipsum)

my.string = capture.output(lipsum(5))
my.string = paste(my.string, collapse = " ")

> my.string # (partial output)
# [1] "Lorem ipsum dolor ... id est laborum. "

We would like to break this text into segments at every 3rd occurrence of the the word " in" (a space is included in order to distinguish from words which contain "in" as part of them, such as "min").

I have the following solution with a while loop:

# We wish to break up the string at every 
# 3rd occurence of the worn "in"

break.character = " in"
break.occurrence = 3
string.list = list()
i = 1

# initialize string to send into the loop
current.string = my.string

while(length(current.string) > 0){

  # Enter segment into the list which occurs BEFORE nth occurence character of interest
  string.list[[i]] = str_before_nth(current.string, break.character, break.occurrence)

  # Update next string to exmine.
  # Next string to examine is current string AFTER nth occurence of character of interest
  current.string = str_after_nth(current.string, break.character, break.occurrence)

  i = i + 1
}

We are able to get the desired output in a list with a warning (warning not shown)

> string.list (#partial output shown)
[[1]]
[1] "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit"

[[2]]
[1] " voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.  Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor"
...

[[6]]
[1] " voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.  Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor"

Goal

Is it possible to improve this solution by vectorizing (i.e. using apply(), lapply(), mapply() etc.). Also, my current solution cut's off the last occurrence of the substring in a block.

The current solution may not work well on extremely long strings (such as DNA sequences where we are looking for blocks with the nth occurrence of a substring of nucleotides).


Solution

  • Try with this:

    text_split=strsplit(text," in ")[[1]]
    
    l=length(text_split)
    n = floor(l/3)
    Seq = seq(1,by=2,length.out = n)
    
    L= list()
    L=sapply(Seq, function(x){
      paste0(paste(text_split[x:(x+2)],collapse=" in ")," in ")
    })
    if (l>(n*3)){
    L = c(L,paste(text_split[(n*3+1):l],collapse=" in "))
    }
    

    Last conditional is in case number of in is not divisible by 3. Also, the last in pasted in the sapply() is there because I don't know what you want to do with the one in that separates your blocks.