I want to extract tabular data from PDF file image below Statistical Import/Export Data of year 1992 to 1993 in PDF file
The data I want is the numerical data to create a data set. I have tried using tabula.py library but the data I am getting is not usable or correctable: Tabular Data extracted by Tabula Library.
Code Snippet :
# Modules Imported
import tabula
from tabula import read_pdf
from tabulate import tabulate
import pandas as pd
# PDF file is being read and tables are being stored in a dataframe
df = read_pdf(PDF_FILE_PATH, stream=True, pages = "all")
Also, another snippet of code is yielding same result even when using area and columns parameters
pdf_reader = pypdf.PdfReader(PDF_FILE_PATH)
num_pages = len(pdf_reader.pages)
num_pages
table_pdf = read_pdf(
PDF_FILE_PATH,
guess = False,
pages = 1,
stream = True,
encoding = 'utf-8',
area = (0, 0, 200, 250),
columns = None,
)
Any suggestions would be highly appreciated.
Original file here: PBS Govt Statistical Tabular Data
When you have fixed pitch text like from a LinePrinter you can use ye olde methods from a Century Long Time Ago. So ever since I can remember Excel 4 ? had import MSDos text and so I usually use Office 97 for such outputs, anyway here using 2013 as it's only a decade old we can see the inputs in order.
So program wise it is very manual, one line of code and a Human Eye.
Above data frame was generated by
poppler\22.02>pdftotext -nopgbrk -fixed 6 imports.pdf
The real MacroCoding will be needed to tidy up the third column in a humane logical fashion, to handle any missing wrapping stripped on conversion.
However to remove those unwanted page headings it is best to use Modern equivalent of DOS Findstr, so to exclude page numbers and ---- we need to run a second line of code.
Let's check it works in reverse first, we want to be sure they will be excluded, and here we note that after 1005 pages there are a secondary 12 to consider.
That was good enough so pre-process data.
findstr /v /l "IMPORTS Value \| ity ------" imports.txt > cleaner.txt
So the text is now good for import in columns to Excel but watch out for that third column ensure it's NO
and rightwards.
Going back to that exclude all headings (including 2 ! sets on page 1) we would need to just export the first set first so now we want:
So the whole command set is
pdftotext -nopgbrk -fixed 6 -f 1 -l 1 -H 140 -W 900 imports.pdf DataFrame.txt
pdftotext -nopgbrk -fixed 6 imports.pdf -| findstr /v /l "IMPORTS Value \| ity ------" imports.txt >> DataFrame.txt
That's under 10 seconds to run the 2 line program
Then Python or Excel /e or /m (with a suitable import macro) to load DataFrame.txt