rpdfpdf-scrapingpdftoolstabulizer

Scraping PDF in R with Nested Information


I am attempting to scrape a rather difficult PDF in R using both pdftools::pdf_text and tabulizer::extract_tables. However, in my situation, neither of these seems to be too helpful based on the nature of the PDF. The PDF contains "nested" information, as shown in the picture.

What is the best way to approach this? Splitting by white space using stringr::str_split_fixed with n=3 gave me matrix, but it seems too difficult to create a regular expression to detect the information I want (only after the Description, and Incident Date/Time) within each column.


Solution

  • I think a regular expressions approach isn't that complicated:

    library(pdftools)
    library(tidyverse)
    library(magrittr)
    mylog <- "https://www.lsu.edu/police/files/crime-log/2021/jan2021.pdf"
    pdf.text <- pdf_text(mylog)
    map_dfr(pdf.text, ~ {
      str_split(.x,"\\n") %>% unlist() -> vectors;
      vectors %>% str_detect("^Case") %>% which %>% add(1) -> cases
      vectors %>% str_detect("^Desc") %>% which %>% add(1) -> descriptions
      vectors %>% str_detect("^Addr") %>% which %>% add(1) -> addresses
      vectors[cases] %>% str_split("(\\s{2,}|\\s(?=[0-9]{1,2}/)|(?<=[AP]M)\\s+)") %>%
        map_dfr(~setNames(.,c("Case.Number","Date.Report","Date.Incident","Case.Status")[seq_along(.)])) -> cases
      vectors[descriptions] %>% str_split("\\s{2,}") %>%
        map_dfr(~setNames(.,c("Description","Date.Incident.End")[seq_along(.)])) -> descriptions
      bind_cols(cases,descriptions,data.frame(Address = vectors[addresses]))
      })
    # A tibble: 155 x 7
       Case.Number  Date.Report     Date.Incident     Case.Status Description                   Date.Incident.End   Address             
       <chr>        <chr>           <chr>             <chr>       <chr>                         <chr>               <chr>               
     1 20210101-001 January 01, 20… 1/1/2021 10:28:0… Inactive    COMPLAINT ANIMAL              1/1/2021 10:28:00AM UREC FIELDS         
     2 20210101-002 January 01, 20… 1/1/2021 2:48:00… Inactive    911 HNGUP/OP - 911 HANG-UP/O… 1/1/2021 2:48:00PM  PMAC                
     3 20210101-003 January 01, 20… 1/1/2021 3:27:00… Pending     UNAUTHORIZED ENTRY OF A PLAC… 1/1/2021 3:27:00PM  COMPANION ANIMAL AL…
     4 20210102-001 January 02, 20… 1/2/2021 5:12:00… Inactive    SUSPICIOUS INCIDENT           1/2/2021 5:12:00PM  TIGER STADIUM       
     5 20210103-001 January 03, 20… 12/23/2020 12:00… Pending     HIT AND RUN                   1/3/2021 9:15:00AM  BROUSSARD HALL TRAF…
     6 20210103-002 January 03, 20… 1/3/2021 9:28:46… Inactive    DISTURBANCE                   1/3/2021 9:28:00PM  VET SCHOOL          
     7 20210104-001 January 04, 20… 11/23/2018 11:00… Inactive    NONCRIMINAL INFORMATION ONLY  11/23/2018 11:00:0… Oaks Lot            
     8 20210104-002 January 04, 20… 1/4/2021 7:26:00… Inactive    SUSPICIOUS INCIDENT           1/4/2021 7:26:00AM  ECE                 
     9 20210104-003 January 04, 20… 8/1/2017 12:00:0… Pending     INVESTIGATN - INVESTIGATION   1/2/2021 3:00:00PM  EAST CAMPUS APARTME…
    10 20210104-004 January 04, 20… 1/4/2021 12:30:0… Pending     HIT AND RUN                   1/4/2021 12:30:00PM HIGHLAND ROAD @ STU…
    # … with 145 more rows