I have been practicing with tabulizer package in R and have following problem. Unfortunately I can't offer reproducible example, as pdf is firms property, but I will describe problem in detail.
I'm trying to read PDF that has start and end date in upperright corner. When I open PDF they look normal
Start: 01-Mar-2018
End: 31-Mar-2018
Now the fun part. When I highlight them and use Ctrl+C to copy them here is result when pasted to R.
:tttt: 11-rrr-8118
tt:: 11-rrr-8118
This is exactly same kind of nonsense that extract_text(path, pages=1)
will give. A lot of t::ttttt:ttt... My question is that is there some security in this PDF or do I just need to figure out correct encoding or because this PDF is automatically created from system, there is some weird notation to everything?
I figured it out. This PDF is mainly created by metadata (didn't know) and great tool in R for accessing metadata in PDFs is pdftools
.
library(pdftools)
pdf_info(path.pdf)
and you can wrangle out all the important metadata bits.