Using reportlab I made 2 1 page pdfs with 1 table:
The data in the table is this:
data1 = [['00', '', '02', '', '04'],
['', '11', '', '13', ''],
['20', '', '22', '23', '24'],
['30', '31', '32', '', '34']]
The point is, to get the rows including the empty cells. If the table has borders, no problem.
But if the table has no borders, I don't get any results for table from the code below.
Any ideas why?
Like I said, the pdfs are identical except for pdf1 does not have borders, pdf2 has borders.
with pdfplumber.open(path2pdf + savename1) as pdf1:
# Get the first page of the object
page = pdf1.pages[0]
# Get the text data of the page
text = page.extract_text()
# Get all the tabular data of this page
tables = page.extract_tables()
# Traversing table
for t_index in range(len(tables)):
table = tables[t_index]
# Traversing each row of data
for data in table:
print(data)
Change pdf1 for pdf2 and I get the required result.
EDIT: I tried with this, but get an error. Not sure how I should format it:
pdf_table = page.extract_tables(vertical_strategy='text', horizontal_strategy='text') Traceback (most recent call last): File "/usr/lib/python3.8/idlelib/run.py", line 559, in runcode exec(code, self.locals) File "<pyshell#70>", line 1, in TypeError: extract_tables() got an unexpected keyword argument 'vertical_strategy'
As per pdfplumber documentation, when calling the page.extract_tables()
function, you have some table extraction settings that you may want to implement.
By default, the strategy is to use the pages vertical or horizontal lines as cell separators, however, you can specify an alternative extraction strategy.
The method can be customised by the following settings:
{
"vertical_strategy": "lines",
"horizontal_strategy": "lines",
"explicit_vertical_lines": [],
"explicit_horizontal_lines": [],
"snap_tolerance": 3,
"snap_x_tolerance": 3,
"snap_y_tolerance": 3,
"join_tolerance": 3,
"join_x_tolerance": 3,
"join_y_tolerance": 3,
"edge_min_length": 3,
"min_words_vertical": 3,
"min_words_horizontal": 1,
"keep_blank_chars": False,
"text_tolerance": 3,
"text_x_tolerance": 3,
"text_y_tolerance": 3,
"intersection_tolerance": 3,
"intersection_x_tolerance": 3,
"intersection_y_tolerance": 3,
}
The one that you may need to consider is the vertical and horizontal strategy settings.
"vertical_strategy": "text"
, at least min_words_vertical
words must share the same alignment."horizontal_strategy": "text"
, at least min_words_horizontal
words must share the same alignment.For vertical_strategy
: Deduce the (imaginary) lines that connect the left, right, or center of words on the page, and use those lines as the borders of potential table-cells. For horizontal_strategy
, the same but using the tops of words.
.extract_table(table_settings={<put settings you need in here>})
Often it's helpful to crop a page — Page.crop(bounding_box)
— before trying to extract the table.