I have sample medical report and on top of each page in pdf there is a table that contains personal information.
I have been trying to remove/crop the personal information table from that sample sample_pdf from all pages by finding layout values of the table. I am new to pdfplumber
and not sure if that's the right approach but below is the code that I have tried and I am not able to get layout values
of the table even when I am able to get red box on the table using pdfplumber.
Code that I have tried:
sample_data = []
sample_path = r"local_path_file"
with pdfplumber.open(sample_path) as pdf:
pages = pdf.pages
for p in pages:
sample_data.append(p.extract_tables())
print(sample_data)
pages[0].to_image()
I am able to identify the first table from it by using below code
pages[0].to_image().debug_tablefinder()
Now when I try below code to extract tables then I am not getting anything
with pdfplumber.open(sample_path) as pdf:
pages = pdf.pages[0]
print(pages.extract_tables())
output: []
There is an issue when working on this particular sample pdf but when I used a similar pdf report I was able to crop it based on boundaries like this:
pages[0].find_tables()[0].bbox
output:
(25.19059366666667, 125.0, 569.773065, 269.64727650000003)
This shows the part that I want to get rid of:
p0.crop((25.19059366666667, 125.0, 569.773065, 269.64727650000003)).to_image().debug_tablefinder()
Below takes y0 = 269.64
, where the top table ends, to almost the bottom of the page y1 = 840
, and from the leftmost part x0 = 0
of the page to nearly the right edge x1 = 590
:
p0.crop((0, 269.0, 590, 840)).to_image()
There is an issue when working on this particular sample pdf but when I used a similar pdf report I was able to crop it based on boundaries.
This is what I used:
pages[0].find_tables()[0].bbox
output:
(25.19059366666667, 125.0, 569.773065, 269.64727650000003)
# this shows the part that I want to get rid off
p0.crop((25.19059366666667, 125.0, 569.773065, 269.64727650000003)).to_image().debug_tablefinder()
# below taking y0 value from where top table ends (269.64) to almost bottom of page 840
# x0 from leftmost part (0) of page and x1 as (590) to almost right end of page
p0.crop((0, 269.0, 590, 840)).to_image()
pdfplumber 0.11.4
The issue arises because pdfplumber
filters out tables with a single cell. This behavior is controlled by the following line in the library's source code:
# File: pdfplumber/table.py
def cells_to_tables(cells: List[T_bbox]) -> List[List[T_bbox]]:
...
filtered = [t for t in _sorted if len(t) > 1] # single-cell tables are excluded here
return filtered
We can modify the locally installed package to allow single-cell tables by replacing filtered
with _sorted
(use caution, as this has not been tested but works for this specific case):
# pdfplumber/table.py
def cells_to_tables(cells: List[T_bbox]) -> List[List[T_bbox]]:
...
return _sorted # Return all tables, including single-cell ones
For a more robust approach, we could make a feature enhancement, like adding an allow_one_cell_table
option to the TableSettings
class and then taking it into account when extracting tables:
table_settings = {"allow_one_cell_table": True} # fictitious property
page.extract_tables(table_settings)
While this is not currently supported, discussions on this topic can be found in these GitHub issues:
If modifying the code isn't an option, we can manually inspect the PDF structure. For the provided example, the single-cell table at the top of the page can be found as the first rectangular object. Here's how we can identify and visualize it (this snippet works for the first page of the document, but you can adapt it for other pages as needed):
rt = pdf.pages[0].rects[0]
bbox = (rt['x0'], rt['top'], rt['x1'], rt['bottom'])
page.crop(bbox).to_image(resolution=400).show()
P.S. Regarding your main goal - removing data from a PDF - pdfplumber
might not be the best choice. It's designed for data extraction, not PDF editing.