I have text that I am trying to organizing for some text mining and am using the TidyText
library. I have tried setting the token to a regex and setting a custom pattern, but it sends up returning just the bracket (or nothing) and not the content of the brackets.
library(tidytext)
library(stringr)
df <- data.frame("text" = c("[instruction] [Mortgage][Show if Q1A5]Mortgage Loans","[checkboxes] [min 1] [max OFF] [Show if Q29A2] Please indicate the reason(s) you would not purchase this check package."), "line" = c(1,2))
un <- unnest_regex(df,elements,text,pattern = "\\[(.*?)\\]")
head(un)
line elements
1 1
2 1 mortgage loans
3 2
4 2
5 2
6 2 please indicate the reason(s) you would not purchase this check package.
un2 <- unnest_regex(df,elements,text,pattern = "(?<=\\[).+?(?=\\])")
head(un2)
line elements
1 1 [
2 1 ] [
3 1 ][
4 1 ]mortgage loans
5 2 [
6 2 ] [
My ultimate goal is to get this:
line elements
1 1 [instruction]
2 1 [Mortgage]
3 1 [Show if Q1A5]
4 2 [checkboxes]
5 2 [min 1]
6 2 [max OFF]
Is this possible?
This should work, if a bit hacky. The idea is to extract out all the stuff in brackets using stringr, and then "explode" the output. Since it isn't space-delimited, explode on the closing bracket, and then just add it back later.
library(dplyr)
library(stringr)
library(tidyr)
df <- data.frame("text" = c("[instruction] [Mortgage][Show if Q1A5]Mortgage Loans","[checkboxes] [min 1] [max OFF] [Show if Q29A2] Please indicate the reason(s) you would not purchase this check package."), "line" = c(1,2))
df <- df %>%
dplyr::mutate(
text_in_brackets = stringr::str_extract_all(text, "\\[[^()]+\\]")
) %>%
tidyr::separate_rows(text_in_brackets, sep = "]") %>%
dplyr::filter(text_in_brackets != "") %>%
dplyr::mutate( # some cleaning
text_in_brackets = paste0(text_in_brackets, "]"), # add back "]"
text_in_brackets = stringr::str_trim(text_in_brackets) # remove leading/trailing spaces
)
Output
# A tibble: 7 × 2
line text_in_brackets
<dbl> <chr>
1 1 [instruction]
2 1 [Mortgage]
3 1 [Show if Q1A5]
4 2 [checkboxes]
5 2 [min 1]
6 2 [max OFF]
7 2 [Show if Q29A2]