pythonpandaspdftabulatabula-py

In Python what is the best way to read a pdf table with no outline?


I am trying to read data from a table in a pdf into a pandas dataframe. I am able to do so using tabula-py when the pdf has outlines around the table, but when I try on the pdf without an outline the script produces an error.

For example, I am looking at the pdfs available from two different urls. I have downloaded the pdfs from the urls and saved them as 'JSE Opts.pdf' and 'JSE Divs.pdf' respectively.

import requests
import pandas as pd

url='https://clientportal.jse.co.za/JSE%20Equity%20Derivatives/Dividends/ED_DividendsReport.pdf'
response = requests.get(url)
fname = 'JSE Divs.pdf'
f= open(fname, 'wb')
f.write(response.content)
f.close()        
    
url='https://clientportal.jse.co.za/JSE%20Equity%20Derivatives/Options%20Daily%20Traded%20Report/ED_OptionsDailyTradedReport.pdf'
response = requests.get(url)
fname = 'JSE Opts.pdf'
f= open(fname, 'wb')
f.write(response.content)
f.close()

I am able to read the 'JSE Opts.pdf' into a pandas dataframe using the code:

import tabula as tb

pdf = './JSE Opts.pdf'
data = tb.read_pdf(pdf,pages = 1)
data = data[0]
print(data)

When I try to do the same for 'JSE Divs.pdf', I get errors and tabula-py is only able to read the header:

pdf = './JSE Divs.pdf'
data = tb.read_pdf(pdf,pages = 1)
data = data[0]
print(data)

I suspect that this is because there are no lines around the table. If that is the case, what is the best way to go about reading the data from 'JSE Divs.pdf' into pandas?


Solution

  • I was able to read the data into a string using pdfplumber, save the string as a CSV file (after cleaning the data to suit my needs) and then import into pandas.

    import pdfplumber
    pdf = pdfplumber.open("./JSE Divs.pdf")
    
    text = ''
    i = 0
    while True:
        try:
            text += pdf.pages[i].extract_text() + '\n'
            i = i+1
        except IndexError:
            break
    
    for replace_s in [' DN',' CA1',' ANY',' CSH',' PHY',' QUANTO']:
        text = text.replace(replace_s,'')
    
    while True:
        try:
            idx = text.index('EXO')
            replace_s =text[idx-1:idx+8]
            text = text.replace(replace_s,'')
        except ValueError:
            break
    
    cols ='EXPIRY_s,USYM,EXPIRY,EX_DATE,CUM_PV_DIVS,CUM_DIVS,ISIN,INSTR_ID\n'
    text = text[text.index('Div\n')+4:]
    text = cols + text
    text = text.replace(' ',',')
    
    f = open('divs.csv','w')
    f.write(text)
    f.close()