I am using pdfplumber
with python to extract data from the following pdf file
import pdfplumber
pdf_file = 'D:/Input/Book1.pdf'
pdf = pdfplumber.open(pdf_file)
page = pdf.pages[0]
text = page.extract_text()
table = page.extract_tables()
for line in text.split("\n"):
print(line)
when I use page.extract_tables()
I only get the row headers and not associated data in the table.
Since extract_tables() not working I am using page.extract_text()
to loop line by line. But, it this extract_text() omitting the empty cells in the table data while reading that line.
Below data got extracted when using
extract_text()
Weekly test report with multiple lines of hedder of the each page of report
col1 col2 col3 Start End Col Group
Name Name Name Date Date Col5 Col6 Col7 Currency
123 ABC 26/8/2024 26/8/2024 1000 20000 26/8/2024 USD
456 DEF New 26/8/2024 2000 15000 27/8/2024 INR
789 GES DDD 26/8/2024 26/8/2023 4000 20/4/2024 AUD
I want to create a df using the table data that I have it in the pdf. any help would be aprreciated.
It seems there is no direct way I can access table data using extract_tables()
since my table doesn't have outlines.
To Read the table I need to use
extract_table()
withtable_settings
.
import os
import pandas as pd
import pdfplumber
pdf_file = 'D:/Input/Book1.pdf'
page_vertical_lines = [40,102,152,203,253,305,360,408,458,510]
table_settings={"vertical_strategy": "explicit",
"horizontal_strategy": "text"
,"explicit_vertical_lines": page_vertical_lines,
}
pdf = pdfplumber.open(pdf_file)
df1 = pd.DataFrame(pdf.pages[0].extract_table(table_settings))
This code Extracted the required number of columns from the pdf file.
Challenge here is getting
explicit_vertical_lines
position
I have used matplotlib
library to identity the correct postion with below code. you need to adjust the postion according to your report columns.
import pdfplumber
import matplotlib.pyplot as plt
pdf_file = 'D:/Input/Book1.pdf'
with pdfplumber.open(pdf_file) as pdf:
page = pdf.pages[0]
im = page.to_image()
vertical_lines = [40,102,152,203,253,305,360,408,458,510]
for x in vertical_lines:
im.draw_vline(x,stroke="blue",stroke_width=1)
plt.imshow(im.annotated)
plt.show()
when you run this code it gives blue lines as per vertical_lines
values on the pdf as shown below in the picture. Adjust the vertical_lines
values as per your need.