pythonpdfpdfplumber

How to extract texts and tables pdfplumber


With the pdfplumber library, you can extract the text of a PDF page, or you can extract the tables from a pdf page.

The issue is that I can't seem to find a way to extract text and tables. Essentially, if the pdf is formatted in this way:

text1
tablename
___________
| Header 1 |
------------
| row 1    |
------------

text 2

I would like the output to be:

["text 1",
 "table name",
 [["header 1"], ["row 1"]],
 "text 2"]

In this example you could run extract_text from pdfplumber:

with pdfplumber.open("example.pdf") as pdf:
    for page in pdf.pages:
        page.extract_text()

but that extracts text and tables as text. You could run extract_tables, but that only gives you the tables. I need a way to extract both text and tables at the same time.

Is this built into the library some way that I don't understand? If not, is this possible?

Edit: Answered

This comes directly from the accepted answer with a slight tweak to fix it. Thanks so much!

from operations import itemgetter

def check_bboxes(word, table_bbox):
    """
    Check whether word is inside a table bbox.
    """
    l = word['x0'], word['top'], word['x1'], word['bottom']
    r = table_bbox
    return l[0] > r[0] and l[1] > r[1] and l[2] < r[2] and l[3] < r[3]


tables = page.find_tables()
table_bboxes = [i.bbox for i in tables]
tables = [{'table': i.extract(), 'top': i.bbox[1]} for i in tables]
non_table_words = [word for word in page.extract_words() if not any(
    [check_bboxes(word, table_bbox) for table_bbox in table_bboxes])]
lines = []
for cluster in pdfplumber.utils.cluster_objects(
        non_table_words + tables, itemgetter('top'), tolerance=5):
    if 'text' in cluster[0]:
        lines.append(' '.join([i['text'] for i in cluster]))
    elif 'table' in cluster[0]:
        lines.append(cluster[0]['table'])

Edit July 19th 2022:

Updated a param to include itemgetter, which is now required for pdfplumber's cluster_objects function (rather than a string)


Solution

  • You can get tables' bounding boxes and then filter out all of the words inside them, something like this:

    def check_bboxes(word, table_bbox):
        """
        Check whether word is inside a table bbox.
        """
        l = word['x0'], word['top'], word['x1'], word['bottom']
        r = table_bbox
        return l[0] > r[0] and l[1] > r[1] and l[2] < r[2] and l[3] < r[3]
    
    
    tables = page.find_tables()
    table_bboxes = [i.bbox for i in tables]
    tables = [{'table': i.extract(), 'doctop': i.bbox[1]} for i in tables]
    non_table_words = [word for word in page.extract_words() if not any(
        [check_bboxes(word, table_bbox) for table_bbox in table_bboxes])]
    lines = []
    for cluster in pdfplumber.utils.cluster_objects(non_table_words+tables, 'doctop', tolerance=5):
        if 'text' in cluster[0]:
            lines.append(' '.join([i['text'] for i in cluster]))
        elif 'table' in cluster[0]:
            lines.append(cluster[0]['table'])