I am attempting to scrape a rather difficult PDF in R using both pdftools::pdf_text
and tabulizer::extract_tables
. However, in my situation, neither of these seems to be too helpful based on the nature of the PDF. The PDF contains "nested" information, as shown in the picture.
What is the best way to approach this? Splitting by white space using stringr::str_split_fixed
with n=3
gave me matrix, but it seems too difficult to create a regular expression to detect the information I want (only after the Description, and Incident Date/Time) within each column.
I think a regular expressions approach isn't that complicated:
library(pdftools)
library(tidyverse)
library(magrittr)
mylog <- "https://www.lsu.edu/police/files/crime-log/2021/jan2021.pdf"
pdf.text <- pdf_text(mylog)
map_dfr(pdf.text, ~ {
str_split(.x,"\\n") %>% unlist() -> vectors;
vectors %>% str_detect("^Case") %>% which %>% add(1) -> cases
vectors %>% str_detect("^Desc") %>% which %>% add(1) -> descriptions
vectors %>% str_detect("^Addr") %>% which %>% add(1) -> addresses
vectors[cases] %>% str_split("(\\s{2,}|\\s(?=[0-9]{1,2}/)|(?<=[AP]M)\\s+)") %>%
map_dfr(~setNames(.,c("Case.Number","Date.Report","Date.Incident","Case.Status")[seq_along(.)])) -> cases
vectors[descriptions] %>% str_split("\\s{2,}") %>%
map_dfr(~setNames(.,c("Description","Date.Incident.End")[seq_along(.)])) -> descriptions
bind_cols(cases,descriptions,data.frame(Address = vectors[addresses]))
})
# A tibble: 155 x 7
Case.Number Date.Report Date.Incident Case.Status Description Date.Incident.End Address
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 20210101-001 January 01, 20… 1/1/2021 10:28:0… Inactive COMPLAINT ANIMAL 1/1/2021 10:28:00AM UREC FIELDS
2 20210101-002 January 01, 20… 1/1/2021 2:48:00… Inactive 911 HNGUP/OP - 911 HANG-UP/O… 1/1/2021 2:48:00PM PMAC
3 20210101-003 January 01, 20… 1/1/2021 3:27:00… Pending UNAUTHORIZED ENTRY OF A PLAC… 1/1/2021 3:27:00PM COMPANION ANIMAL AL…
4 20210102-001 January 02, 20… 1/2/2021 5:12:00… Inactive SUSPICIOUS INCIDENT 1/2/2021 5:12:00PM TIGER STADIUM
5 20210103-001 January 03, 20… 12/23/2020 12:00… Pending HIT AND RUN 1/3/2021 9:15:00AM BROUSSARD HALL TRAF…
6 20210103-002 January 03, 20… 1/3/2021 9:28:46… Inactive DISTURBANCE 1/3/2021 9:28:00PM VET SCHOOL
7 20210104-001 January 04, 20… 11/23/2018 11:00… Inactive NONCRIMINAL INFORMATION ONLY 11/23/2018 11:00:0… Oaks Lot
8 20210104-002 January 04, 20… 1/4/2021 7:26:00… Inactive SUSPICIOUS INCIDENT 1/4/2021 7:26:00AM ECE
9 20210104-003 January 04, 20… 8/1/2017 12:00:0… Pending INVESTIGATN - INVESTIGATION 1/2/2021 3:00:00PM EAST CAMPUS APARTME…
10 20210104-004 January 04, 20… 1/4/2021 12:30:0… Pending HIT AND RUN 1/4/2021 12:30:00PM HIGHLAND ROAD @ STU…
# … with 145 more rows