[SOLVED] how to extract title from a pdf documment with R

how to extract title from a pdf documment with R

I need help to extract information from a pdf file in r (for example https://arxiv.org/pdf/1701.07008.pdf)

I'm using pdftools, but sometimes pdf_info() doesn't work and in that case I can't manage to do it automatically with pdf_text()

NB notice that tabulizer didn't work on my PC.
Here is the treatment I'm doing (Sorry you need to save the pdf and do it with your own path):

 info <- pdf_info(paste0(path_folder,"/",pdf_path))
 title <- c(title,info$keys$Title)
 key <- c(key,info$keys$Keywords)
 auth <- c(auth,info$keys$Author)
 dom <- c(dom,info$keys$Subject)
 metadata <- c(metadata,info$metadata)

I would like to get title and abstract most of the time.

Solution

We will need to make some assumptions about the structure of the pdf we wish to scrape. The code below makes the following assumptions:

Title and abstract are on page 1 (fair assumption?)
Title is of height 15
The abstract is between the first occurrence of the word "Abstract" and first occurrence of the word "Introduction"

library(tidyverse)
library(pdftools)

data = pdf_data("~/Desktop/scrape.pdf")

#Get First page
page_1 = data[[1]]

# Get Title, here we assume its of size 15
title = page_1%>%
  filter(height == 15)%>%
  .$text%>%
  paste0(collapse = " ")


#Get Abstract
abstract_start = which(page_1$text == "Abstract.")[1]
introduction_start = which(page_1$text == "Introduction")[1]

abstract = page_1$text[abstract_start:(introduction_start-2)]%>%
  paste0(collapse = " ")

You can, of course, work off of this and impose stricter constraints for your scraper.