pythonpdfweb-scrapingtabula

Extract table from PDF - text in different rows


I have a bunch of PDFs like this: https://www.premera.com/documents/045632_2023.pdf

Where there is a large table: Important Questions | Answers | Why this matters. I need to extract that table and have it as CSV. I need CSVs that look like the table in the PDF. The problem: the text of one cell gets split into multiple rows.

I've seen others have had a similar problem but they used these PDFs as images: What is the best way to extract text contained within a table in a pdf using python? But I still need to use the PDF as "text' (not image).

When I use :

df = tabula.read_pdf('045632_2023.pdf', pages = 'all')

df[0]

The table looks like this. See for example rows 2-4 for the first column (Important Questions). That question, which in the original table was in one cell, is now split into three rows.

enter image description here

When I use :

df = tabula.read_pdf('045632_2023.pdf', pages='all', lattice=True)

df[0]

This partially solves the problem; see row 1 for the first column where the question is now in one cell only. However, there are still plenty of sentences/cells that are split into several rows.

enter image description here

Please, how would you solve this problem? I need to extract that table and have it as CSV. I need CSVs that look like the table in the PDF.


Solution

  • It's somewhat understandable that the other solution used the PDFs as images since, as explained in the pypdf docs,

    Typically, tables are just absolutely positioned text. In the worst case, ever single letter could be absolutely positioned. That makes it hard to tell where columns / rows are.

    If you can find any other source for the data [are the files available in any other format?], I would strongly suggest trying to extract tables from those instead.

    Sometimes there's a pattern is how the table has been scrambled and then you can try to shift the contents around [see this example]; I don't see any patterns, but if you're more familiar with the data, you might...

    But if this is your only option, then I have this set of functions which try to extract a list of dictionaries [with each dictionary potentially representing a row] from a PDF file using the minecart library. It's still glitchy and rather limited [it can't blindly find tables from any document like tabula tries to - instead, it requires information about the table columns, and preferably the first row/line as well], but you might find it useful for your issue here.

    Of them, pdfPg_table is the shortest method - along with passing the document, you have to specify the page index and [optionally] the starting positions of the columns and first row. [The first page is the default page to parse; and if first row position isn't specified, it just splits up the whole page; and if the columns' left-border positions aren't specified, you have to instead tell it how many columns there should be, and it just guessed the left borders based on the frequency of character positions.]

    You could just call it like pdfPg_table('045632_2023.pdf', 3) for 3 columns [from the first page - by default] [view output] and then trim off the unnecessary rows and columns, but getRows_fromPdfDoc (which depends on all the other functions) accepts hints to figure out where the table starts and the column divisions:

    # import pandas
    # PASTE EVERYTHING FROM https://pastebin.com/trKxysh6
    
    colHead = ['Important Questions', 'Answers', 'Why This Matters:']
    pandas.DataFrame(getRows_fromPdfDoc(
        pDoc='045632_2023.pdf',  startSeq=' '.join(colHead), colHint=colHead, skipRows=1
    )).drop(['col_0'], axis='columns').dropna(thresh=2, axis='rows').set_axis(colHead, axis='columns')
    

    op1

    [Once you have the DataFrame, you can of course use .to_csv to save as CSV.]


    getRows_fromPdfDoc is also capable of extracting a table across multiple pages [as long as it has the same header in every page]:

    csSamps = [ 'Common Medical Event', 'Services You May Need', '$30 copayment', 
                'Non-Preferred: 40% coinsurance', 'Limitations, Exceptions, & Other']
    headTxt = ' '.join(csSamps[:2] + ['What You Will Pay'] + csSamps[-1:])
    rList = getRows_fromPdfDoc('045632_2023.pdf', headTxt, csSamps, 2)
    

    But at this point, you might notice that merged cells are left empty except the first row. If it's just merged across rows, it's not difficult to fill them up from the previous rows:

    prev_c1 = ''
    for ri, r in enumerate(rList):
        if r['col_1']: prev_c1 = r['col_1']
        else: rList[ri]['col_1'] = prev_c1
    

    However, if they're merged across columns or split in any way, then the merged columns are split back up, and partially filled rows are added for the split row. [I expect splits across columns will remain entirely undetected and the contents will be merged back into a single cell.]

    Also, note that there's no way to extract nested tables with the current functions, although it's definitely not impossible. I might be able to com up with more reliable methods if I can figure out how to detect the background color of the characters and when they cross a cell border....