rpdftext-miningtabulizer

how to extract title from a pdf documment with R


I need help to extract information from a pdf file in r (for example https://arxiv.org/pdf/1701.07008.pdf)

I'm using pdftools, but sometimes pdf_info() doesn't work and in that case I can't manage to do it automatically with pdf_text()

NB notice that tabulizer didn't work on my PC.
Here is the treatment I'm doing (Sorry you need to save the pdf and do it with your own path):

 info <- pdf_info(paste0(path_folder,"/",pdf_path))
 title <- c(title,info$keys$Title)
 key <- c(key,info$keys$Keywords)
 auth <- c(auth,info$keys$Author)
 dom <- c(dom,info$keys$Subject)
 metadata <- c(metadata,info$metadata)

I would like to get title and abstract most of the time.


Solution

  • We will need to make some assumptions about the structure of the pdf we wish to scrape. The code below makes the following assumptions:

    1. Title and abstract are on page 1 (fair assumption?)
    2. Title is of height 15
    3. The abstract is between the first occurrence of the word "Abstract" and first occurrence of the word "Introduction"
    library(tidyverse)
    library(pdftools)
    
    data = pdf_data("~/Desktop/scrape.pdf")
    
    #Get First page
    page_1 = data[[1]]
    
    # Get Title, here we assume its of size 15
    title = page_1%>%
      filter(height == 15)%>%
      .$text%>%
      paste0(collapse = " ")
    
    
    #Get Abstract
    abstract_start = which(page_1$text == "Abstract.")[1]
    introduction_start = which(page_1$text == "Introduction")[1]
    
    abstract = page_1$text[abstract_start:(introduction_start-2)]%>%
      paste0(collapse = " ")
    
    

    You can, of course, work off of this and impose stricter constraints for your scraper.