I am new to pdfplumber, and I have fallen amazed under how it extracts text from tables.
Its easy to work for all-page tables, but in my case, I am using some topological schematics with somes tables inside.
It fails to extract the first column and the last row of every table in document. I have tried to tweak several configuration parameters in table_settings variable, unluckily I haven't been able to achieve any better result (in my case, the rest of texts in the schematic is considered as a table in case I use "text" instead of "lines").
Any help with this? I am using Python 3.9.8 and the pdf for testing can be found in: schematic.pdf
The source code is next:
import pdfplumber
pdf_file = "Schematic.pdf"
tables=[]
with pdfplumber.open(pdf_file) as pdf:
pages = pdf.pages
tbl = pages[0].extract_tables()
print(f'{tbl}')
Some of the edges in the PDF appear as lines but are not exactly what pdfplumber treats as lines and for such cases, all the curves and edges can be explicitly treated as lines. Using the following table settings worked for this case
{
"vertical_strategy": "explicit",
"horizontal_strategy": "explicit",
"explicit_vertical_lines": page.curves+page.edges,
"explicit_horizontal_lines": page.curves+page.edges,
"intersection_tolerance": 15,
}
['(cid:47)(cid:44)(cid:54)(cid:55)(cid:36)(cid:3)(cid:39)(cid:40)(cid:3)(cid:39)(cid:40)(cid:54)(cid:57)(cid:203)(cid:50)(cid:54)', None, None, None, None, None]
['(cid:49)(cid:158)', 'PK', 'VEL.', '(cid:49)(cid:158)', 'PK', 'VEL.']
['A64', '3+100', '100 Km/h', 'A66', '3+365', '100 Km/h']
['A65', '3+189', '100 Km/h', 'S2MSU2', '5+884', '100 Km/h']
['A67', '3+363', '100 Km/h', 'S4MSU1', '6+052', '100 Km/h']
['', '', '', '', '', '']
['(cid:54)(cid:40)(cid:102)(cid:36)(cid:47)(cid:40)(cid:54)', None, None, None]
['NOMBRE', 'PK', 'NOMBRE', 'PK']
['E3', '3+720', 'EMSUF2', '5+766']
['E4', '3+784', 'EMSUF1', '5+766']
['B004F2', '4+295', 'SMSUM2', '6+185']
['B004F1', '4+295', 'SMSUM1', '6+188']
['', '', '', '']