Let’s consider this PDF, imported in R as follows:
library(pdftools)
library(tidyverse)
mylink <- "https://www.probioqual.com/12_PDF/02_EEQ/Modele_Rapport_EEQ.pdf"
mypdf <- pdf_data(mylink)
The pdf_data
function generates a large list composed of 35 pages (one tibble per page, each including n rows and 6 columns, among which x and y coordinates).
Let’s now consider many PDFs in a file, imported using:
mypdfs_list <- list.files(pattern = '*.pdf')
allpdfs <- lapply(mypdfs_list, pdf_data)
Among allpdfs
, I would like to select only those pages that contain the "Limites acceptables" characters string in the top right box, e.g. as highlighted in yellow on page 5 of the pdf:
NB: selecting this specific string is the way I found to select only the pages that contain the tables of interest. Indeed, the first text pages of each pdf (the number of which may vary from one pdf to another) do not interest me so I want to discard them; e.g., in the pdf above, I want to discard the first 4 pages of text (but in another pdf, the first 3 or the first 5 would have to be removed, for example).
Using pdftools::pdf_data
, the "Limites acceptables" string is always located inside the area of coordinates x>360 & x<580 & y>26 & y<35
.
Question: is it possible, using a function (map
, lapply
or other, including e.g. filter
or other) to select only these pages (thus discarding the first text pages) among all lists from imported pdfs?
Of course open to any other approach!
Thanks
A slightly complicated solution, but it works:
# import all pdfs from the file
mypdfs_list <- list.files(pattern = '*.pdf')
allpdfs <- lapply(mypdfs_list, pdf_data)
# map nested lists to mutate 'page_ok' = 1 when fixed areas contain specific texts
allpdfs <- map(allpdfs, ~ .x %>%
map(~ mutate(., page_ok = case_when(.$x>360 & .$x<580 & .$y>26 & .$y<35 &
(.$text=="Li" | .$text=="mi") ~ 1, TRUE ~ 0))))
# map nested lists to fill 'page_ok' with 1 if this variable contains at least one value 1
allpdfs <- map(allpdfs , ~ .x %>%
map(~ mutate(., page_ok = if_else(any(.$page_ok == 1), 1, 0))))
# map nested lists to keep only sublists (i.e., tibbles that correspond to pages) for which the sum of 'page_ok' is not 0
allpdfs <- map(allpdfs , ~ .x %>%
keep(~ sum(.$page_ok) != 0))
The first uninteresting text pages were thus deleted. Compare below to above RStudio screenshots: the 1st pdf has now 26 pages instead of 29, the 2nd pdf 35 pages instead of 38, the 3rd pdf 26 pages instead of 28...
I would have liked to be able to combine these 3 steps into one. Would there be a simpler solution?