rstringstringrstringi

Efficient way to split a huge string in R


I have a huge string (> 500MB), actually it's an entire book collection in one. I have some meta information in another dataframe, e.g. page numbers, (different) authors and titles. I try to detect the title strings in my huge string and split it by title. I assume titles are unique.

The data looks like this:

mystring <- "Lorem ipsum dolor sit amet, sollicitudin duis maecenas habitasse ultrices aenean tempus"

# a dataframe of meta data, e.g. page numbers and titles
mydf <- data.frame(page = c(1, 2),
                   title = c( "Lorem", "maecenas"))
mydf

  page   title
1    1   Lorem
2    2 vivamus

mygoal <- mydf  # text that comes after the title
mygoal$text <- c("ipsum dolor sit amet, sollicitudin duis", "habitasse ultrices aenean tempus")
mygoal 

  page   title                                    text
1    1   Lorem ipsum dolor sit amet, sollicitudin duis
2    2 vivamus        habitasse ultrices aenean tempus

How can I split the string such that everything between two titles is the first text, everything that comes after the second title and before the third title, becomes the second text element - in the most efficient way.


Solution

  • In case you wanted to do the operation in a piped tidyverse way, you could try using stringr::str_extract with some regex:

    library(dplyr)
    library(stringr)
    library(glue)
    
    mydf |>  
      mutate(next_title = lead(title, default = "$")) |> 
      mutate(text = str_extract(mystring, glue::glue("(?<={title}\\s?)(.*)(?:{next_title})"))) |> 
      select(-next_title)
    

    Yielding:

    page    title                                      text
    1    1    Lorem  ipsum dolor sit amet, sollicitudin duis 
    2    2 maecenas          habitasse ultrices aenean tempus
    

    If performance is a concern, a similar approach with data.table would be:

    library(data.table)
    library(stringr)
    library(glue)
    
    mydt <- setDT(mydf)
    
    mydt[, next_title :=shift(title, fill = "$", type = "lead")][
      ,text := str_extract(..mystring, glue_data(.SD,"(?<={title}\\s?)(.*)(?={next_title})"))][,
        !("next_title")]
    

    Resulting in:

       page    title                                      text
    1:    1    Lorem  ipsum dolor sit amet, sollicitudin duis 
    2:    2 maecenas          habitasse ultrices aenean tempus
    

    EDIT

    Added for better performance options:

    Generally, str_split or str_split_fixed will be a faster way to go than str_extract.

    The problem for str_split is that a regex with many alternate pipes will also slow down the process, so another solution would be to replace all the titles in the string first with some fixed character string, and then split on those. Another thing you can do to speed up the splitting is use str_split_fixed and pre-assign how many splits to process.

        # create named character vector for str_replace_all function
    split_at <- rep("@@",nrow(mydf))
    names(split_at) <- mydf$title
    mystring <- str_replace_all(mystring, split_at)
    
    # used fixed in str_split
    mydf$text <- str_split(mystring,fixed("@@ "))[[1]][-1]
    
    # Alternative (maybe faster) define number of splits by nrow
    mydf$text <- str_split_fixed(mystring,fixed("@@ "), n = nrow(mydf)+1)[,-1]
    
    
    ## using str_split_fixed in data.table
    mydt <- setDT(mydf)
    mydt[, text := 
           str_split_fixed(mystring,fixed("@@ "), nrow(mydt)+1)[,-1]