javapdfpdfbox

PDFBOX read Skia created PDF file


I read PDF file (Skia/PDF m118 Google Docs Renderer) with PDFBOX but it reads nothing. The document has only one page and does not contain images. I try to read content with PDFTextStripper.

Any idea how read Skia/PDF m118 Google Docs with PDFBOX?

I can open it with Acrobat Reader.

Code snippet

    Document dataDocument = new Document();
    try {
        PDFTextStripper pdfTextStripper = new PDFTextStripper();

        pdfTextStripper.setParagraphStart("/t");
        pdfTextStripper.setSortByPosition(true);

        for (int i = 0; i < document.getNumberOfPages(); i++) {
            pdfTextStripper.setStartPage(i);
            pdfTextStripper.setEndPage(i);
            for (String line : pdfTextStripper.getText(document).split(pdfTextStripper.getParagraphStart())) {
                if (!line.isBlank() && line.length() > 3) {
                    dataDocument.getText().add(line);
                }
            }
            dataDocument.getText().add(":page=" + i);
        }

...

PdfBox version

 implementation 'org.apache.pdfbox:pdfbox:2.0.30'

Pdf file

This is a link for the document, https://jmp.sh/s/vTpGFHq6nLjzWXzBfxIA Zlaja


Solution

  • Both setStartPage() and setEndPage() require the parameter to be 1-based (see javadoc). Thus change your code to:

    pdfTextStripper.setStartPage(i + 1);
    pdfTextStripper.setEndPage(i + 1);