[SOLVED] Extract Tables from complicated document in Python

Extract Tables from complicated document in Python

I am trying to extract tables from a pdf but the format of the table looks like this

I am trying out various libraries (unstructured, pdfplumber, pymupdf, etc) but none of them are giving a readable format. The column headings are being printed in reverse like so

Any idea what I can do or what libraries I can use ? I am new to Python so it might be a silly question.

Tested out Unstructured (both the API and non-API version), pdfplumber, pdf2image, pymupdf, Was expecting to parse in any format but should read the columns properly.

Should I try writing a code to modify it such that the column names are printed in horizontal format ?

Thanks coders

Solution

Try Spire.PDF for Python.

pip install Spire.Pdf

Here is how you can extract tables from PDF with it:

from spire.pdf.common import *
from spire.pdf import *

# Create a PdfDocument object
doc = PdfDocument()
# Load the sample PDF file
doc.LoadFromFile("Table.pdf")

# Create a list to store the extracted data
builder = []

# Create a PdfTableExtractor object
extractor = PdfTableExtractor(doc)

# Loop through the pages
for pageIndex in range(doc.Pages.Count):
    # Extract tables from a specific page
    tableList = extractor.ExtractTable(pageIndex)

    if tableList is not None and len(tableList) > 0:
        # Loop through the tables in the list
        for table in tableList:
            # Get row number and column number of a certain table
            row = table.GetRowCount()
            column = table.GetColumnCount()

            # Loop through the row and column
            for i in range(row):
                for j in range(column):
                    # Get text from the specific cell
                    text = table.GetText(i, j)

                    # Add the text to the list
                    builder.append(text + " ")
                builder.append("\n")
            builder.append("\n")

# Write the content of the list into a text file
with open("Table.txt", "w", encoding="utf-8") as file:
    file.write("".join(builder))

Reference: https://www.e-iceblue.com/Tutorials/Python/Spire.PDF-for-Python/Program-Guide/Table/Python-Extract-Tables-from-PDF.html

Note: I work for the company that developed this module.