I need help to extract information from a pdf file in r (for example https://arxiv.org/pdf/1701.07008.pdf)
I'm using pdftools
, but sometimes pdf_info()
doesn't work and in that case I can't manage to do it automatically with pdf_text()
NB notice that tabulizer didn't work on my PC.
Here is the treatment I'm doing (Sorry you need to save the pdf and do it with your own path):
info <- pdf_info(paste0(path_folder,"/",pdf_path))
title <- c(title,info$keys$Title)
key <- c(key,info$keys$Keywords)
auth <- c(auth,info$keys$Author)
dom <- c(dom,info$keys$Subject)
metadata <- c(metadata,info$metadata)
I would like to get title and abstract most of the time.
We will need to make some assumptions about the structure of the pdf we wish to scrape. The code below makes the following assumptions:
library(tidyverse)
library(pdftools)
data = pdf_data("~/Desktop/scrape.pdf")
#Get First page
page_1 = data[[1]]
# Get Title, here we assume its of size 15
title = page_1%>%
filter(height == 15)%>%
.$text%>%
paste0(collapse = " ")
#Get Abstract
abstract_start = which(page_1$text == "Abstract.")[1]
introduction_start = which(page_1$text == "Introduction")[1]
abstract = page_1$text[abstract_start:(introduction_start-2)]%>%
paste0(collapse = " ")
You can, of course, work off of this and impose stricter constraints for your scraper.