pythonpdfplumber

Problem with recognizing single-cell tables in pdfplumber


I have sample medical report and on top of each page in pdf there is a table that contains personal information.

I have been trying to remove/crop the personal information table from that sample sample_pdf from all pages by finding layout values of the table. I am new to pdfplumber and not sure if that's the right approach but below is the code that I have tried and I am not able to get layout values of the table even when I am able to get red box on the table using pdfplumber.

Code that I have tried:

sample_data = []
sample_path = r"local_path_file"

with pdfplumber.open(sample_path) as pdf:
    pages = pdf.pages
    for p in pages:
            sample_data.append(p.extract_tables())

print(sample_data)
pages[0].to_image()

sample_image

I am able to identify the first table from it by using below code

pages[0].to_image().debug_tablefinder()

table_identified

Now when I try below code to extract tables then I am not getting anything

with pdfplumber.open(sample_path) as pdf:
    pages = pdf.pages[0]
    print(pages.extract_tables())

output: []


Update

There is an issue when working on this particular sample pdf but when I used a similar pdf report I was able to crop it based on boundaries like this:

pages[0].find_tables()[0].bbox

output:

(25.19059366666667, 125.0, 569.773065, 269.64727650000003)

This shows the part that I want to get rid of:

p0.crop((25.19059366666667, 125.0, 569.773065, 269.64727650000003)).to_image().debug_tablefinder()

Below takes y0 = 269.64, where the top table ends, to almost the bottom of the page y1 = 840, and from the leftmost part x0 = 0 of the page to nearly the right edge x1 = 590:

p0.crop((0, 269.0, 590, 840)).to_image()

There is an issue when working on this particular sample pdf but when I used a similar pdf report I was able to crop it based on boundaries.

This is what I used:

pages[0].find_tables()[0].bbox

output:

(25.19059366666667, 125.0, 569.773065, 269.64727650000003)

# this shows the part that I want to get rid off
p0.crop((25.19059366666667, 125.0, 569.773065, 269.64727650000003)).to_image().debug_tablefinder()

# below taking y0 value from where top table ends (269.64) to almost bottom of page 840 
# x0 from leftmost part (0) of page and x1 as (590) to almost right end of page

p0.crop((0, 269.0, 590, 840)).to_image()

Solution

  • Understanding the Issue

    pdfplumber 0.11.4

    The issue arises because pdfplumber filters out tables with a single cell. This behavior is controlled by the following line in the library's source code:

    # File: pdfplumber/table.py
    
    def cells_to_tables(cells: List[T_bbox]) -> List[List[T_bbox]]:
        ...
        filtered = [t for t in _sorted if len(t) > 1]   # single-cell tables are excluded here
        return filtered
    

    Ad-hoc Workaround

    We can modify the locally installed package to allow single-cell tables by replacing filtered with _sorted (use caution, as this has not been tested but works for this specific case):

    # pdfplumber/table.py
    
    def cells_to_tables(cells: List[T_bbox]) -> List[List[T_bbox]]:
        ...
        return _sorted   # Return all tables, including single-cell ones
    

    For a more robust approach, we could make a feature enhancement, like adding an allow_one_cell_table option to the TableSettings class and then taking it into account when extracting tables:

    table_settings = {"allow_one_cell_table": True}    # fictitious property
    page.extract_tables(table_settings)
    

    While this is not currently supported, discussions on this topic can be found in these GitHub issues:

    Alternative Approach

    If modifying the code isn't an option, we can manually inspect the PDF structure. For the provided example, the single-cell table at the top of the page can be found as the first rectangular object. Here's how we can identify and visualize it (this snippet works for the first page of the document, but you can adapt it for other pages as needed):

    rt = pdf.pages[0].rects[0]
    bbox = (rt['x0'], rt['top'], rt['x1'], rt['bottom'])
    page.crop(bbox).to_image(resolution=400).show()
    

    extracted single-cell table


    P.S. Regarding your main goal - removing data from a PDF - pdfplumber might not be the best choice. It's designed for data extraction, not PDF editing.