So i'm trying to scrape the following pdf: https://www.mcw.gov.cy/mcw/dca/dca.nsf/All/6D0FA9590337C56CC22589FD001CD4FB/$file/AIRCRAFT%20REGISTER%20%2030%20JUN%202023%20public.pdf
I've written a python script using pdfplumber that seems to get me most of the way there but what i'm struggling with is around the formatting of the data into neat rows that match the original pdf.
The script is show below:
import pandas as pd
import pdfplumber
import numpy as np
pdf_file = 'C:/Users/xxx/Downloads/AIRCRAFT REGISTER 30 JUN 2023 public.pdf'
all_tables = []
with pdfplumber.open(pdf_file) as pdf:
for page in pdf.pages:
cropped_page = page.crop(bbox=(0, 0, 825, 535))
table_settings = {
"vertical_strategy": "text",
"horizontal_strategy": "text",
}
table = cropped_page.extract_table(table_settings=table_settings)
if table:
for row in table:
if row[0] or not all_tables:
all_tables.append(row)
else:
all_tables[-1] = [f"{a} {b}".strip() for a, b in zip(all_tables[-1], row)]
df_tables = pd.DataFrame(all_tables)
df_tables.replace([None, np.nan], '', inplace=True)
df_tables.reset_index(drop=True, inplace=True)
print(df_tables.head())
csv_file = 'C:/Users/xxx/Downloads/Cyprus_register_cleaned.csv'
df_tables.to_csv(csv_file, index=False, encoding='utf-8')
Any help, strategy or guidance on how I can clean this data up post-extraction would be appreciated
So this will get you close. It uses pymupdf
which can extract tables.
import pandas as pd
import fitz
pdf_file = r'AIRCRAFT REGISTER 30 JUN 2023 public.pdf'
list_of_list = []
doc = fitz.open(pdf_file)
for page in doc:
tabs = page.find_tables()
if tabs.tables:
list_of_list.append(tabs[0].extract())
columns = {
0:'S/N',
1:'REG MARKS',
2:'MANUFACTURER/AIRCRAFT TYPE',
3:'AIRCRAFT S/N',
4:'CLASS ICAO',
5:'MTOW/Kg',
6:'REG. DATE',
7:'C of A / ARC Expiry',
8:'AIRCRAFT OWNER / OPERATOR',
9:'ENGINE TYPE',
10:'No of ENG',
11:'PROPELLER TYPE',
12:'AIRCRAFT BASE',
13:'PAX No.'
}
categories = ['page%s' %i for i in range(20)]
df = pd.concat([pd.DataFrame(d) for d in list_of_list], keys=categories)
df = df.rename(columns=columns)
df.head(30)
You will want to do some clean up maybe delete the first two rows from each page. But you might want to keep the table header which is in row zero of each page.
# Make a new column with table title
df['Title'] = df['S/N'].str.replace(r'[0-9]', '', regex=True).str.replace('S/N', '').str.replace('.', '').replace('', np.nan).ffill(axis = 0)
# Drop top two rows from each table
df = df.drop(([0,1]), level=1, axis=0)
This code will give you a multi level index so you can still keep a track of which page the data came from.