rstringrstringitidytextread-text

Automatically extracting Sections (and section Titles) from a file


I need to extract all subsections (for further text analysis) and their title from an .Rmd file (e.g. from 01-tidy-text.Rmd of tidy-text-mining book: https://raw.githubusercontent.com/dgrtwo/tidy-text-mining/master/01-tidy-text.Rmd)

All I know that a section starts from ## sign and runs till either next #, ## signs or the end of the file.

The entire text is already extracted (using dt <- readtext("01-tidy-text.Rmd"); strEntireText <-dt[1,1]) and is located variable strEntireText.

I would like to use stringr for this. or stringi, something along the lines:

 strAllSections <- str_extract(strEntireText , pattern="...")
 strAllSectionsTitles <- str_extract(strEntireText , pattern="...")

Please suggest your solution. Thank you

The final objective of this exercise is to be able to automatically create a data.frame from .Rmd file, where each row corresponds to each section (and subsection), columns containing: section title, section label, section text itself, and some other section-specific details, which will be extracted later.


Solution

  • Here is an example using a tidyverse approach. This will not necessarily work well with whatever file you have -- if you are working with markdown, you should probably try to find a proper markdown parsing library, as Spacedman mentions in his comment.

    library(tidyverse)
    
    ## A df where each line is a row in the rmd file.
    raw <- data_frame(
      text = read_lines("https://raw.githubusercontent.com/dgrtwo/tidy-text-mining/master/01-tidy-text.Rmd")
    )
    
    ## We don't want to mark R comments as sections.
    detect_codeblocks <- function(text) {
      blocks <- text %>%
        str_detect("```") %>%
        cumsum()
    
      blocks %% 2 != 0
    }
    
    ## Here is an example of how you can extract information, such
    ## headers, using regex patterns.
    df <-
      raw %>%
      mutate(
        code_block = detect_codeblocks(text),
        section = text %>%
          str_match("^# .*") %>%
          str_remove("^#+ +"),
        section = ifelse(code_block, NA, section),
        subsection = text %>%
          str_match("^## .*") %>%
          str_remove("^#+ +"),
        subsection = ifelse(code_block, NA, subsection),
        ) %>%
      fill(section, subsection)
    
    ## If you wish to glue the text together within sections/subsections,
    ## then just group by them and flatten the text.
    df %>%
      group_by(section, subsection) %>%
      slice(-1) %>%                           # remove the header
      summarize(
        text = text %>%
          str_flatten(" ") %>%
          str_trim()
      ) %>%
      ungroup()
    
    #> # A tibble: 7 x 3
    #>   section                          subsection  text                       
    #>   <chr>                            <chr>       <chr>                      
    #> 1 The tidy text format {#tidytext} Contrastin… "As we stated above, we de…
    #> 2 The tidy text format {#tidytext} Summary     In this chapter, we explor…
    #> 3 The tidy text format {#tidytext} The `unnes… "Emily Dickinson wrote som…
    #> 4 The tidy text format {#tidytext} The gutenb… "Now that we've used the j…
    #> 5 The tidy text format {#tidytext} Tidying th… "Let's use the text of Jan…
    #> 6 The tidy text format {#tidytext} Word frequ… "A common task in text min…
    #> 7 The tidy text format {#tidytext} <NA>        "```{r echo = FALSE} libra…