python-3.xtext-extractionpdfminerpython-camelotexcalibur-py

Extract fixed size and position table from pdf files in Python


Say I have many similar pdf files as the one from here:

I woudld like to extract the following table and save as excel file:

enter image description here

I'm able to do extract table and save excel file manually with package excalibur.

After installing Excalibur with pip3, I initialize the metadata database using:

$ excalibur initdb

And then start the webserver using:

$ excalibur webserver

Then go to http://localhost:5000 and start extracting tabular data from PDFs.

I wonder if it's possible to automatically do that with python script for multiple pdf files with packages such as excalibur-py, camelot, pdfminer, etc, since the size and position of table are fixed for same city's reports.

You may download other report files from this link.

Many thanks at advance.


Solution

  • Using Camelot, you can build a pipeline like this:

    import camelot
    
    files_list=['FIRST_PATH','SECOND_PATH',...]
    regions=['REGION_COORDINATES_1', 'REGION_COORDINATES_2',...]
    
    for filepath in files_list:
        tables=camelot.read_pdf(filepath, pages='1-end', table_regions=regions)
        tables.export('tables.xls', f='excel')
    

    table_regions parameter should be used when you know the approximate position of the table inside the page; if you know the exact position of the table, you should use table_areas.

    You can read more about these parameters and other topics in the Camelot documentation.