rregexpdfpurrrtabulizer

Indexing a PDF as list of data frames based on regex pattern match


In extracting information from a pdf using tabulizer and pdftools, I sometimes would like to index a large list of df based on a regex pattern match.

a <- data.frame(yes=c("pension"))
b <- data.frame(no=c("other"))
my_list <- list(a,b)

I would like to use str_detect to return an index of underlying df matching the pattern "pension".

The desired output would be:

index <- 1 (based on which and str_detect)
new_list <- my_list[[index]]
new_list
     yes
1 pension

How to detect the pattern in the underlying df and then return the index using which has been a struggle. I see previous discussions using loops and if-then statements, but a solution using purrr seems preferred.


Solution

  • We may use

    getIdx <- function(pattern, l)
      l %>% map_lgl(~ any(unlist(map(.x, grepl, pattern = pattern))))
    
    getIdx("pension", my_list)
    # [1]  TRUE FALSE
    
    my_list[getIdx("pension", my_list)]
    # [[1]]
    #       yes
    # 1 pension
    

    This allows for multiple matching data frames. (No need for which really.)

    In getIdx we go over data frames of l, then in a given data frame we go over its columns and use grepl. If there is a match in any of the columns, TRUE is returned for the corresponding data frame.