I have a bunch of PDFs like this: https://www.premera.com/documents/045632_2023.pdf
Where there is a large table: Important Questions | Answers | Why this matters. I need to extract that table and have it as CSV. I need CSVs that look like the table in the PDF. The problem: the text of one cell gets split into multiple rows.
I've seen others have had a similar problem but they used these PDFs as images: What is the best way to extract text contained within a table in a pdf using python? But I still need to use the PDF as "text' (not image).
When I use :
df = tabula.read_pdf('045632_2023.pdf', pages = 'all')
df[0]
The table looks like this. See for example rows 2-4 for the first column (Important Questions). That question, which in the original table was in one cell, is now split into three rows.
When I use :
df = tabula.read_pdf('045632_2023.pdf', pages='all', lattice=True)
df[0]
This partially solves the problem; see row 1 for the first column where the question is now in one cell only. However, there are still plenty of sentences/cells that are split into several rows.
Please, how would you solve this problem? I need to extract that table and have it as CSV. I need CSVs that look like the table in the PDF.
It's somewhat understandable that the other solution used the PDFs as images since, as explained in the pypdf docs,
Typically, tables are just absolutely positioned text. In the worst case, ever single letter could be absolutely positioned. That makes it hard to tell where columns / rows are.
If you can find any other source for the data [are the files available in any other format?], I would strongly suggest trying to extract tables from those instead.
Sometimes there's a pattern is how the table has been scrambled and then you can try to shift the contents around [see this example]; I don't see any patterns, but if you're more familiar with the data, you might...
But if this is your only option, then I have this set of functions which try to extract a list of dictionaries [with each dictionary potentially representing a row] from a PDF file using the minecart library. It's still glitchy and rather limited [it can't blindly find tables from any document like tabula
tries to - instead, it requires information about the table columns, and preferably the first row/line as well], but you might find it useful for your issue here.
Of them, pdfPg_table
is the shortest method - along with passing the document, you have to specify the page index and [optionally] the starting positions of the columns and first row. [The first page is the default page to parse; and if first row position isn't specified, it just splits up the whole page; and if the columns' left-border positions aren't specified, you have to instead tell it how many columns there should be, and it just guessed the left borders based on the frequency of character positions.]
You could just call it like pdfPg_table('045632_2023.pdf', 3)
for 3 columns [from the first page - by default] [view output] and then trim off the unnecessary rows and columns, but getRows_fromPdfDoc
(which depends on all the other functions) accepts hints to figure out where the table starts and the column divisions:
startSeq
: the text expected to be at the beginning of the table ["Important Questions Answers Why This Matters:"
in the example below], andcolHint
: can just be the number of columns like in the example above, but it's more reliable to send the horizontal positions of the left-borders, or [more feasibly] substrings from the beginnings of left-aligned contents from each column. [This can be the column headers (as below), but not always.]# import pandas
# PASTE EVERYTHING FROM https://pastebin.com/trKxysh6
colHead = ['Important Questions', 'Answers', 'Why This Matters:']
pandas.DataFrame(getRows_fromPdfDoc(
pDoc='045632_2023.pdf', startSeq=' '.join(colHead), colHint=colHead, skipRows=1
)).drop(['col_0'], axis='columns').dropna(thresh=2, axis='rows').set_axis(colHead, axis='columns')
[Once you have the DataFrame, you can of course use .to_csv
to save as CSV.]
getRows_fromPdfDoc
is also capable of extracting a table across multiple pages [as long as it has the same header in every page]:
csSamps = [ 'Common Medical Event', 'Services You May Need', '$30 copayment',
'Non-Preferred: 40% coinsurance', 'Limitations, Exceptions, & Other']
headTxt = ' '.join(csSamps[:2] + ['What You Will Pay'] + csSamps[-1:])
rList = getRows_fromPdfDoc('045632_2023.pdf', headTxt, csSamps, 2)
But at this point, you might notice that merged cells are left empty except the first row. If it's just merged across rows, it's not difficult to fill them up from the previous rows:
prev_c1 = ''
for ri, r in enumerate(rList):
if r['col_1']: prev_c1 = r['col_1']
else: rList[ri]['col_1'] = prev_c1
However, if they're merged across columns or split in any way, then the merged columns are split back up, and partially filled rows are added for the split row. [I expect splits across columns will remain entirely undetected and the contents will be merged back into a single cell.]
Also, note that there's no way to extract nested tables with the current functions, although it's definitely not impossible. I might be able to com up with more reliable methods if I can figure out how to detect the background color of the characters and when they cross a cell border....