javapdfbox

Extracting text from pdf (java using pdfbox library) from a tables with merged cells


Inspired by discussion Extracting text from pdf (java using pdfbox library) from a table's rows with different heights I'm able to perfectly read "normal" tables. Kudos to mkl.

The issue is that I cannot figure out how to read data from tables where text is merged from few cells. I will still continue my brainstorming, but if somebody has idea how we can improve code from mkl in class PdfBoxFinder to allow processing of tables with merged cells I would appreciate. I will definitely provide solution here if I find myself. Thanks to all in advance.

I was trying to find merged cells based on text, but it is not very effective. This approach generates to many types of tables. I'm looking for more generic solution. I will be trying to check of x positions of texts, but I'm not there yet. Demo if available on GitHub Demo

Example files are:
Merged cells https://github.com/pdob-git/testarea-pdfbox2/blob/Pdob-stack_78001237/pl.pdob.input/merged_cells_example.pdf
Regular tables https://github.com/pdob-git/testarea-pdfbox2/blob/Pdob-stack_78001237/pl.pdob.input/regular_table.pdf

Result how code currently recognizes tables is show in following files:
Merged cells https://github.com/pdob-git/testarea-pdfbox2/blob/Pdob-stack_78001237/pl.pdob.results/merged_cells_example.pdf-rectangles.pdf
Regular tables https://github.com/pdob-git/testarea-pdfbox2/blob/Pdob-stack_78001237/pl.pdob.results/regular_table.pdf-rectangles.pdf

Regular Tables are recognized correctly, but issue is with merged cells.
Document with merged bottom row: Source file for merged cells

Is recognized as regular table - bottom row has 3 cells and should have one Result file for merged cells

Demo code:

package pl.pdob.pdfTables;

import mkl.testarea.pdfbox2.extract.PdfBoxFinder;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDPageContentStream;

import java.awt.*;
import java.awt.geom.Rectangle2D;
import java.io.File;
import java.io.IOException;

/**
 * Class to demonstrate issue with tables with merged cells<br>
 * <a href="https://stackoverflow.com/questions/78001237/extracting-text-from-pdf-java-using-pdfbox-library-from-a-tables-with-merged-c">
 * Extracting text from pdf (java using pdfbox library) from a tables with merged cells
 * </a>
 * <br>
 * Method Drawing found rectangles taken from {@link mkl.testarea.pdfbox2.extract.ExtractBoxedText}<br>
 * and modified
 */
public class Stack78001237 {


    private static final File RESULT_FOLDER = new File("pl.pdob.results");

    private static final File INPUT_FOLDER = new File("pl.pdob.input");

    private static final String EXAMPLE_PDF = "regular_table.pdf";
//    private static final String EXAMPLE_PDF = "merged_cells_example.pdf";


    static {

        if (!INPUT_FOLDER.exists()) {
            //noinspection ResultOfMethodCallIgnored
            INPUT_FOLDER.mkdirs();
        }

        if (!RESULT_FOLDER.exists()) {
            //noinspection ResultOfMethodCallIgnored
            RESULT_FOLDER.mkdirs();
        }
    }

    public static void main(String[] args) throws IOException {
        Stack78001237 stack78001237 = new Stack78001237();
        stack78001237.drawBoxes(EXAMPLE_PDF);
    }

    @SuppressWarnings("SameParameterValue")
    private void drawBoxes(String fileName) throws IOException {
        File file = new File(INPUT_FOLDER, fileName);

        try (
             PDDocument document = PDDocument.load(file) ) {
            for (PDPage page : document.getDocumentCatalog().getPages()) {
                PdfBoxFinder boxFinder = new PdfBoxFinder(page);
                boxFinder.processPage(page);

                try (PDPageContentStream canvas = new PDPageContentStream(document, page, PDPageContentStream.AppendMode.APPEND, true, true)) {
                    canvas.setStrokingColor(Color.RED);
                    for (Rectangle2D rectangle : boxFinder.getBoxes().values()) {
                        canvas.addRect((float)rectangle.getX(), (float)rectangle.getY(), (float)rectangle.getWidth(), (float)rectangle.getHeight());
                    }
                    canvas.stroke();
                }
            }
            document.save(new File(RESULT_FOLDER, fileName + "-rectangles.pdf"));
        }
    }
}

Stack78001237.java

The issue is that file PdfBoxFinder.java works perfectly, but only with regular tables.
I'm currently digging how to solve it. If I knew, solution I would not bother stackoverflow community with such question.


Solution

  • I have solved this issue.
    The approach is similar to original solution and have following steps:

    Demo is in files:
    MainDemo.java
    ExtractBoxedTextMergedCellsTest.java