I am trying to extract tables from a pdf but the format of the table looks like this
I am trying out various libraries (unstructured, pdfplumber, pymupdf, etc) but none of them are giving a readable format. The column headings are being printed in reverse like so
Any idea what I can do or what libraries I can use ? I am new to Python so it might be a silly question.
Tested out Unstructured (both the API and non-API version), pdfplumber, pdf2image, pymupdf, Was expecting to parse in any format but should read the columns properly.
Should I try writing a code to modify it such that the column names are printed in horizontal format ?
Thanks coders
Try Spire.PDF for Python.
pip install Spire.Pdf
Here is how you can extract tables from PDF with it:
from spire.pdf.common import *
from spire.pdf import *
# Create a PdfDocument object
doc = PdfDocument()
# Load the sample PDF file
doc.LoadFromFile("Table.pdf")
# Create a list to store the extracted data
builder = []
# Create a PdfTableExtractor object
extractor = PdfTableExtractor(doc)
# Loop through the pages
for pageIndex in range(doc.Pages.Count):
# Extract tables from a specific page
tableList = extractor.ExtractTable(pageIndex)
if tableList is not None and len(tableList) > 0:
# Loop through the tables in the list
for table in tableList:
# Get row number and column number of a certain table
row = table.GetRowCount()
column = table.GetColumnCount()
# Loop through the row and column
for i in range(row):
for j in range(column):
# Get text from the specific cell
text = table.GetText(i, j)
# Add the text to the list
builder.append(text + " ")
builder.append("\n")
builder.append("\n")
# Write the content of the list into a text file
with open("Table.txt", "w", encoding="utf-8") as file:
file.write("".join(builder))
Note: I work for the company that developed this module.