pythontabula

How can i extract pdf tables other than tabula


I have an working script in which we have to read the pdf tables using tabula package , but as tabula is dependent on Java 8 and we have to use java 6 and below due to some internal tools , how can we read the pdf tables of the tables.

from tabula import read_pdf
df_list = tabula.read_pdf(current_file, pages="all", lattice = True)


Solution

  • How to convert a pdf document to an excel spreadsheet:

    Option 1, using the pdf_tables API:

    1. Install pdf_tables with pip install git+https://github.com/pdftables/python-pdftables-api.git
    2. Get an account here

    Once you have everything installed you can run this code:

    import pdftables_api
    
    c = pdftables_api.Client('my-api-key')
    c.xlsx('input.pdf', 'output') 
    #replace c.xlsx with c.csv to convert to CSV 
    #replace c.xlsx with c.xml to convert to XML
    #replace c.xlsx with c.html to convert to HTML
    #This is documentation code for your information
    

    Don't forget to replace my-api-key with your api key, input.pdf with the path of your pdf, and ouput to the path of the directory you would like to save the output excel document to.

    Option 2, using textract to read the pdf and then writing to the spreadsheet using xlwt:

    1. Install textract with pip install textract
    2. Install xlwt with pip install xlwt

    Once you have installed the dependencies, you can run the following code:

    import textract
    import xlwt
    from xlwt import Workbook
    
    wb = Workbook()
    
    text = textract.process("path/to/file.extension") #You'll have to change this to your path to the file
    

    I do not know about how your pdf is organized but you'll have to figure out how to write to the excel document from there. (you can use sheet1.write(1, 0, 'Data') where 1 and 0 are your coordinates on your spreadsheet.

    I personally think you should use the pdf_tables API instead of manually doing the conversion.