pythonpandaspdftabulapdf-scraping

Referencing the last page in a PDF with tabula?


I want to reference the last page from a bunch of PDF documents and parse tables from it, however the number of pages in the documents can vary. What I do know is that the last page is the same for these documents.

all_tables_stream = tabula.read_pdf(path, password = password, stream = "True", pages = 'all')

Is there an elegant way to do this where I don't have to scrape all pages in the document just to get to the tables on the final page?


Solution

  • First you should get the number of pages, for example by using pyPdf

    import pyPdf
    from tabula import read_pdf
    
    reader = pyPdf.PdfFileReader(open(path, mode='rb' ))
    n = reader.getNumPages() 
    
    all_tables_stream = tabula.read_pdf(path, password = password, stream = "True", pages = n)