pythonparsinglibrariespdfparser

Extract Tables from complicated document in Python


I am trying to extract tables from a pdf but the format of the table looks like this enter image description here

I am trying out various libraries (unstructured, pdfplumber, pymupdf, etc) but none of them are giving a readable format. The column headings are being printed in reverse like so enter image description here

Any idea what I can do or what libraries I can use ? I am new to Python so it might be a silly question.

Tested out Unstructured (both the API and non-API version), pdfplumber, pdf2image, pymupdf, Was expecting to parse in any format but should read the columns properly.

Should I try writing a code to modify it such that the column names are printed in horizontal format ?

Thanks coders


Solution

  • Try Spire.PDF for Python.

    pip install Spire.Pdf
    

    Here is how you can extract tables from PDF with it:

    from spire.pdf.common import *
    from spire.pdf import *
    
    # Create a PdfDocument object
    doc = PdfDocument()
    # Load the sample PDF file
    doc.LoadFromFile("Table.pdf")
    
    # Create a list to store the extracted data
    builder = []
    
    # Create a PdfTableExtractor object
    extractor = PdfTableExtractor(doc)
    
    # Loop through the pages
    for pageIndex in range(doc.Pages.Count):
        # Extract tables from a specific page
        tableList = extractor.ExtractTable(pageIndex)
    
        if tableList is not None and len(tableList) > 0:
            # Loop through the tables in the list
            for table in tableList:
                # Get row number and column number of a certain table
                row = table.GetRowCount()
                column = table.GetColumnCount()
    
                # Loop through the row and column
                for i in range(row):
                    for j in range(column):
                        # Get text from the specific cell
                        text = table.GetText(i, j)
    
                        # Add the text to the list
                        builder.append(text + " ")
                    builder.append("\n")
                builder.append("\n")
    
    # Write the content of the list into a text file
    with open("Table.txt", "w", encoding="utf-8") as file:
        file.write("".join(builder))
    

    Reference: https://www.e-iceblue.com/Tutorials/Python/Spire.PDF-for-Python/Program-Guide/Table/Python-Extract-Tables-from-PDF.html

    Note: I work for the company that developed this module.