rpdftabulizer

R tabulizer encoding or security


I have been practicing with tabulizer package in R and have following problem. Unfortunately I can't offer reproducible example, as pdf is firms property, but I will describe problem in detail.

I'm trying to read PDF that has start and end date in upperright corner. When I open PDF they look normal

Start: 01-Mar-2018
  End: 31-Mar-2018

Now the fun part. When I highlight them and use Ctrl+C to copy them here is result when pasted to R.

:tttt: 11-rrr-8118
tt:: 11-rrr-8118

This is exactly same kind of nonsense that extract_text(path, pages=1) will give. A lot of t::ttttt:ttt... My question is that is there some security in this PDF or do I just need to figure out correct encoding or because this PDF is automatically created from system, there is some weird notation to everything?


Solution

  • I figured it out. This PDF is mainly created by metadata (didn't know) and great tool in R for accessing metadata in PDFs is pdftools.

    library(pdftools)
    
    pdf_info(path.pdf)
    

    and you can wrangle out all the important metadata bits.