rregexunnesttidytext

how can I unnest phrases between brackets


I have text that I am trying to organizing for some text mining and am using the TidyText library. I have tried setting the token to a regex and setting a custom pattern, but it sends up returning just the bracket (or nothing) and not the content of the brackets.

library(tidytext)
library(stringr)

df <- data.frame("text" = c("[instruction] [Mortgage][Show if Q1A5]Mortgage Loans","[checkboxes] [min 1] [max OFF] [Show if Q29A2] Please indicate the reason(s) you would not purchase this check package."), "line" = c(1,2))

un <- unnest_regex(df,elements,text,pattern = "\\[(.*?)\\]")

head(un)
  line                                                                  elements
1    1                                                                          
2    1                                                            mortgage loans
3    2                                                                          
4    2                                                                          
5    2                                                                          
6    2  please indicate the reason(s) you would not purchase this check package.

un2 <- unnest_regex(df,elements,text,pattern = "(?<=\\[).+?(?=\\])")

head(un2)
  line        elements
1    1               [
2    1             ] [
3    1              ][
4    1 ]mortgage loans
5    2               [
6    2             ] [

My ultimate goal is to get this:

  line             elements
1    1        [instruction]
2    1           [Mortgage]
3    1       [Show if Q1A5]
4    2         [checkboxes]
5    2              [min 1]
6    2            [max OFF]

Is this possible?


Solution

  • This should work, if a bit hacky. The idea is to extract out all the stuff in brackets using stringr, and then "explode" the output. Since it isn't space-delimited, explode on the closing bracket, and then just add it back later.

    library(dplyr)
    library(stringr)
    library(tidyr)
    
    df <- data.frame("text" = c("[instruction] [Mortgage][Show if Q1A5]Mortgage Loans","[checkboxes] [min 1] [max OFF] [Show if Q29A2] Please indicate the reason(s) you would not purchase this check package."), "line" = c(1,2))
    
    df <- df %>%
        dplyr::mutate(
            text_in_brackets = stringr::str_extract_all(text, "\\[[^()]+\\]")
        ) %>%
        tidyr::separate_rows(text_in_brackets, sep = "]") %>%
        dplyr::filter(text_in_brackets != "") %>%
        dplyr::mutate( # some cleaning
            text_in_brackets = paste0(text_in_brackets, "]"), # add back "]"
            text_in_brackets = stringr::str_trim(text_in_brackets) # remove leading/trailing spaces
        )
    

    Output

    # A tibble: 7 × 2
       line text_in_brackets
      <dbl> <chr>           
    1     1 [instruction]   
    2     1 [Mortgage]      
    3     1 [Show if Q1A5]  
    4     2 [checkboxes]    
    5     2 [min 1]         
    6     2 [max OFF]       
    7     2 [Show if Q29A2]