pdfpdfbox

How can I copy/extract text from a specific kind of PDF file


I would like to copy a few text values (extract them via PDFBox) from a set of given multi-page PDF files. However for some of the pages in these documents this is not possible whereas for other pages it is. Here is an example of such a document (just 3 pages).

AD-2.LFAB.pdf

In this example document I cannot copy any text from the first page although it works for the two other pages. I am not talking about the texts inside of the graphics. I am mainly interested in the text in the surrounding text boxes.

My question is:

  1. Can anybody tell me what the reason for this failure is?
  2. Is there any way to fix this problem? (I don't have access to the sources of the document.)

I have to admit that I am not a PDF expert. Therefore I have to ask.


Solution

  • As you found and have been mentioned in comments the file has not been well written.

    One initial simple test is to use Adobe Acrobat Reader to check if the accessible text ("save as" accessible text) is actually readable! Then we see the first page is garbled, and the second page is not much better, as left and right have been mixed on each line.

    enter image description here

    However ALL is not lost, because we can try a different approach using PDFtoText -layout. Thus we could extract the 1st page headings via co-ordinate offsets for parsing into a data frame or other output file.

    You will need to specify exactly which block of text is desired, by using -x ## -y ## -W ## -H ## for each region of interest.

    enter image description here

    So here we can test the programming steps to produce the desired output and that can be adjusted to seperate the left column from the right column on page 2 etc. Once you have a good result save the instructions into a file and or a script for running on hundreds of similar inputs.

    enter image description here

    Windows test

    echo/ && echo Page 1 && echo/ && pdftotext -layout -nopgbrk -enc UTF-8 -f 1 -l 1 -W 1000 -H 120 AD-2.LFAB.pdf - && echo/ && echo Page2 && echo/ && pdftotext -layout -nopgbrk -enc UTF-8 -f 2 -l 2 -W 800 -H 120 AD-2.LFAB.pdf -
    

    You can simplify that down to whatever you need , for example:

    echo/ && echo Page 1 && echo/ 
    pdftotext -layout -nopgbrk -enc UTF-8 -f 1 -l 1 -W 1000 -H 120 AD-2.LFAB.pdf - >temp.txt
    echo/ && echo Page2 && echo/ 
    pdftotext -layout -nopgbrk -enc UTF-8 -f 2 -l 2 -W 800 -H 120 AD-2.LFAB.pdf - >>temp.txt
    type temp.txt