pythonpdfplumber

pdfplumber table-extract inconsistent columns and stripping spaces


Pdfplumber is the most accurate tool I have found so far for extracting text from a PDF, plus it can extract table data in rows and columns. I have encountered two problems with the table function.

  1. a wide column of text (e.g. a description) may be split into smaller columns, or may not.
  2. when joining the split strings to re-form the descriptive column, original white space at start and end of each split-string has been removed, resulting in incorrect re-assembly. All advice would be appreciated.

This sample extracts a table from each PDF. The two tables are identical except that the 2nd has fewer lines. Issue 1: The table from the first shows the leftmost column split into three columns, while the the identical data in the 2nd table is not split. Is it possible to avoid splitting a column? Issue 2: in splitting the first column into 3 parts, whitespace is removed between the parts. I.e. 'Balance at 31 December 2020' is split into 'Balance at 31 Decem', 'ber 2020', ''. Simply joining the parts restores the text to 'Balance at 31 December 2020' - correct. However 'Total comprehensive income for the year' is split into 'Total compre', 'hensive', 'income for', 'the year' and joining the parts results in 'Total comprehensiveincome forthe year' - wrong.

Links to the PDF files: The file in which pdfplumber splits first column: https://www.dropbox.com/s/qlqr27s29vk79j4/pdfdoc-sheet3.pdf?dl=0 The file in which pdfplumber keeps the first column intact: https://www.dropbox.com/s/0cz8szmph847sin/pdfdoc-sheet4.pdf?dl=0

Sample code:

    import pdfplumber

    filepaths = ('C:/ProgramData/PythonProgs/pdfdoc-sheet3.pdf',\
                 'C:/ProgramData/PythonProgs/pdfdoc-sheet4.pdf')
    for filepath in filepaths:
        print('----------------------------------------')
        pdf = pdfplumber.open(filepath)
        for page in pdf.pages:
            text = page.extract_text()
            textlines = text.split('\n')
            tablelines = page.extract_table(table_settings=\
                    {"vertical_strategy": "text", \
                     "horizontal_strategy": "text", \
                     "snap_tolerance":5}) # snap_tolernace 4 - 9 works
            for i in range(len(tablelines)):
                print(i, tablelines[i])

Output:

0 ['', '', '', '', 'Notes', 'Share']
1 ['', '', '', '', '', 'capital']
2 ['', '', '', '', '', '']
3 ['Balance at 1', 'January', '2021', '', '', '12,000']
4 ['Dividends', '', '', '', '', '-']
5 ['Issue of shar', 'e capital', 'on exercis', 'e of', '', '270']
6 ['Employee sh', 'are-base', 'd compens', 'ation', '', '-']
7 ['Issue of shar', 'e capital', 'on private', 'placement', '', '1,500']
8 ['Transactions', 'with own', 'ers', '', 'note1', '1,770']
9 ['Profit for the', 'year', '', '', 'note2', '-']
10 ['Other compre', 'hensive', 'income', '', '', '-']
11 ['Total compre', 'hensive', 'income for', 'the year', 'note3', '-']
12 ['Balance at 3', '1 Decem', 'ber 2021', '', '', '13,770']
13 ['Balance at 1', 'January', '2020', '', '', '12,000']
14 ['Employee sh', 'are-base', 'd compens', 'ation', '', '-']
15 ['Transactions', 'with own', 'ers', '', '', '-']
16 ['Profit for the', 'year', '', '', '', '-']
17 ['Other compre', 'hensive', 'income', '', '', '-']
18 ['Total compre', 'hensive', 'income for', 'the year', '', '-']
19 ['Balance at 3', '1 Decem', 'ber 2020', '', '', '12,000']
--------------  C:/ProgramData/PythonProgs/pdfdoc-sheet4.pdf  ----------------
0 ['', 'Notes', 'Share']
1 ['', '', 'capital']
2 ['', '', '']
3 ['Balance at 1 January 2021', '', '12,000']
4 ['Transactions with owners', 'note1', '1,770']
5 ['Profit for the year', 'note2', '-']
6 ['Other comprehensive income', '', '-']
7 ['', '', '']
8 ['Total comprehensive income for the year', 'note3', '-']
9 ['', '', '']
10 ['Balance at 31 December 2020', '', '12,000']

Solution

  • Can you use the headers as vertical line markers? enter image description here

    headers = [
       page1.search('Notes')[0],
       page1.search('Share\s+capital')[0],
    ]
    
    vlines = [
        1,
        headers[0]['x0'],
        headers[1]['x0'],
        headers[1]['x1'],
    ]
    
    hlines = [line['top'] for line in page1.vertical_edges]
    
    # we need to add top/bottom lines to get first/last rows
    hlines.insert(0, headers[-1]['bottom'])
    hlines.append(page1.vertical_edges[-1]['bottom'] + 10)
    
    """
    im = page1.to_image(300)
    im.draw_vlines(vlines, stroke_width=3)
    im.draw_hlines(hlines, stroke_width=3)
    im.save('lines.png')
    """
    
    page1.extract_table(dict(
       explicit_vertical_lines = vlines,
       explicit_horizontal_lines = hlines,
    ))
    
    [['Balance at 1 January 2021', '', '12,000'],
     ['Dividends', '', '-'],
     ['Issue of share capital on exercise of', '', '270'],
     ['Employee share-based compensation', '', '-'],
     ['Issue of share capital on private placement', '', '1,500'],
     ['Transactions with owners', 'note1', '1,770'],
     ['Profit for the year', 'note2', '-'],
     ['Other comprehensive income', '', '-'],
     ['Total comprehensive income for the year', 'note3', '-'],
     ['Balance at 31 December 2021', '', '13,770'],
     ['Balance at 1 January 2020', '', '12,000'],
     ['Employee share-based compensation', '', '-'],
     ['Transactions with owners', '', '-'],
     ['Profit for the year', '', '-'],
     ['Other comprehensive income', '', '-'],
     ['Total comprehensive income for the year', '', '-'],
     ['Balance at 31 December 2020', '', '12,000']]
    

    Doing the same for page 2:

    [['Balance at 1 January 2021', '', '12,000'],
     ['Transactions with owners', 'note1', '1,770'],
     ['Profit for the year', 'note2', '-'],
     ['Other comprehensive income', '', '-'],
     ['Total comprehensive income for the year', 'note3', '-'],
     ['Balance at 31 December 2020', '', '12,000']]