rstringrtextreader

Create Dataframe from pdf to csv based on string


I like to split information of a pdf document based on the presence of colon. A sample is here. enter image description here

Updated PDF with four pages can be downloaded from this link

I am attempting the following. After reading the pdf, I am trying to split it by colon.

library(textreadr)
dat <- '~Here is the thing1.pdf' %>%
    textreadr::read_pdf()
dat
Source: local data frame [26 x 3]

   page_id element_id                                     text
1        1          1                       Here is the thing.
2        1          2                                Case ID 1
3        1          3 Exploring Angels: It is a long establish
4        1          4 page when looking at its layout. The poi
5        1          5 distribution of letters, as opposed to u
6        1          6 English. Many desktop publishing package
7        1          7 model text, and a search for 'lorem ipsu
8        1          8 versions have evolved over the years, so
9        1          9                           and the like).
10       1         10 New agency: Lorem Ipsum is simply dummy 
..     ...        ...                                      ... 

OR

library(pdftools)
dat <- pdf_text("~Here is the thing1.pdf")
dat1 <- strsplit(dat[[1]], "\n")[[1]]
head(dat1)
[1] "Here is the thing.\r"                                                                                           
[2] "Case ID 1\r"                                                                                                    
[3] "Exploring Angels: It is a long established fact that a reader will be distracted by the readable content of a\r"
[4] "page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal\r"         
[5] "distribution of letters, as opposed to using 'Content here, content here', making it look like readable\r"      
[6] "English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default\r"

dat2 <- dat1 %>%
  str_split(pattern = "\r") 
head(dat2)

[[1]]
[1] "Here is the thing." ""                  

[[2]]
[1] "Case ID 1" ""         

[[3]]
[1] "Exploring Angels: It is a long established fact that a reader will be distracted by the readable content of a"
[2] ""                                                                                                             

[[4]]
[1] "page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal"
[2] ""                                                                                                    

[[5]]
[1] "distribution of letters, as opposed to using 'Content here, content here', making it look like readable"
[2] ""                                                                                                       

[[6]]
[1] "English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default"
[2] "

I want to get my data sorted into a table like this:

  Case.ID                             Exploring.Angels                        New.agency New.Factor New.Factor2 Creative.One
1       1 It is a long established fact that a reader  Lorem Ipsum is simply dummy text         ABC         BNM         <NA>
2       2               Various versions have evolved     It has survived not only five         ABC        <NA>          DFZ

Solution

  • Here's how I would do it using tidyverse

    library(tidyverse)
    
    # read in the file, separate by line, convert to tibble
    pdftools::pdf_text("../_xlam/Here is the thing1.pdf") %>% str_split("(\\r\\n)") %>% 
      unlist() %>% as_tibble() %>% 
    # separate cases and mark lines containing colon
      mutate(case=cumsum(str_detect(value, "Case ID")),
             tag_line=str_detect(value, ": ")) %>%
    # drop lines with Case ID, separate tag from text, move text into one column, fill the tags
      filter(!str_detect(value,"Case ID")) %>% 
      separate(value, into = c("key", "text"), sep=": ", fill="right", extra="merge") %>% 
      mutate(text=ifelse(is.na(text), key, text),
             key=ifelse(tag_line, key, NA)) %>% fill(key) %>% 
    # summarize text by concatenation
      group_by(case, key) %>% summarise(text=paste(text, collapse = " ")) %>% 
    # filter away the `Here is the thing` line 
      drop_na(key) %>%
    # move values to columns
      spread(key=key, value=text)