pythonpandaspdftabula

Fix missing header while reading table in PDF file using Tabula


I want to extract tabular data from PDF file image below Statistical Import/Export Data of year 1992 to 1993 in PDF file

The data I want is the numerical data to create a data set. I have tried using tabula.py library but the data I am getting is not usable or correctable: Tabular Data extracted by Tabula Library.

Code Snippet :

# Modules Imported
import tabula
from tabula import read_pdf
from tabulate import tabulate
import pandas as pd

# PDF file is being read and tables are being stored in a dataframe
df = read_pdf(PDF_FILE_PATH, stream=True, pages = "all")

Also, another snippet of code is yielding same result even when using area and columns parameters

pdf_reader = pypdf.PdfReader(PDF_FILE_PATH)

num_pages = len(pdf_reader.pages)
num_pages

table_pdf = read_pdf(
    PDF_FILE_PATH,
    guess = False,
    pages = 1,
    stream = True,
    encoding = 'utf-8',
    area = (0, 0, 200, 250),
    columns = None,
)


Any suggestions would be highly appreciated.

Original file here: PBS Govt Statistical Tabular Data


Solution

  • When you have fixed pitch text like from a LinePrinter you can use ye olde methods from a Century Long Time Ago. So ever since I can remember Excel 4 ? had import MSDos text and so I usually use Office 97 for such outputs, anyway here using 2013 as it's only a decade old we can see the inputs in order.

    enter image description here

    So program wise it is very manual, one line of code and a Human Eye.

    Above data frame was generated by

    poppler\22.02>pdftotext -nopgbrk -fixed 6  imports.pdf
    

    The real MacroCoding will be needed to tidy up the third column in a humane logical fashion, to handle any missing wrapping stripped on conversion.

    However to remove those unwanted page headings it is best to use Modern equivalent of DOS Findstr, so to exclude page numbers and ---- we need to run a second line of code.

    Let's check it works in reverse first, we want to be sure they will be excluded, and here we note that after 1005 pages there are a secondary 12 to consider.

    enter image description here

    That was good enough so pre-process data.

    findstr /v /l "IMPORTS Value \| ity ------" imports.txt > cleaner.txt

    enter image description here

    So the text is now good for import in columns to Excel but watch out for that third column ensure it's NO and rightwards.

    Going back to that exclude all headings (including 2 ! sets on page 1) we would need to just export the first set first so now we want:

    enter image description here

    So the whole command set is

    pdftotext -nopgbrk -fixed 6 -f 1 -l 1 -H 140 -W 900 imports.pdf DataFrame.txt
    pdftotext -nopgbrk -fixed 6  imports.pdf -| findstr /v /l "IMPORTS Value \| ity ------" imports.txt >> DataFrame.txt
    

    That's under 10 seconds to run the 2 line program

    enter image description here

    Then Python or Excel /e or /m (with a suitable import macro) to load DataFrame.txt