pythonpandaspdfplumber

pdfplumber table extraction inconsistent formatting of rows


So i'm trying to scrape the following pdf: https://www.mcw.gov.cy/mcw/dca/dca.nsf/All/6D0FA9590337C56CC22589FD001CD4FB/$file/AIRCRAFT%20REGISTER%20%2030%20JUN%202023%20public.pdf

I've written a python script using pdfplumber that seems to get me most of the way there but what i'm struggling with is around the formatting of the data into neat rows that match the original pdf.

The script is show below:

import pandas as pd
import pdfplumber
import numpy as np

pdf_file = 'C:/Users/xxx/Downloads/AIRCRAFT REGISTER  30 JUN 2023 public.pdf'
all_tables = []

with pdfplumber.open(pdf_file) as pdf:
    for page in pdf.pages:
        cropped_page = page.crop(bbox=(0, 0, 825, 535))

        table_settings = {
            "vertical_strategy": "text",
            "horizontal_strategy": "text",
        }

        table = cropped_page.extract_table(table_settings=table_settings)

        if table:
            for row in table:
                if row[0] or not all_tables:
                    all_tables.append(row)
                else:
                    all_tables[-1] = [f"{a} {b}".strip() for a, b in zip(all_tables[-1], row)]

df_tables = pd.DataFrame(all_tables)

df_tables.replace([None, np.nan], '', inplace=True)

df_tables.reset_index(drop=True, inplace=True)

print(df_tables.head())

csv_file = 'C:/Users/xxx/Downloads/Cyprus_register_cleaned.csv'  
df_tables.to_csv(csv_file, index=False, encoding='utf-8')

Any help, strategy or guidance on how I can clean this data up post-extraction would be appreciated


Solution

  • So this will get you close. It uses pymupdf which can extract tables.

    import pandas as pd
    import fitz
    
    pdf_file = r'AIRCRAFT REGISTER  30 JUN 2023 public.pdf'
    
    list_of_list = []
    doc = fitz.open(pdf_file)
    for page in doc:
        tabs = page.find_tables()
        if tabs.tables:
            list_of_list.append(tabs[0].extract())
    
    columns = {
        0:'S/N',
        1:'REG MARKS',
        2:'MANUFACTURER/AIRCRAFT TYPE',
        3:'AIRCRAFT S/N',
        4:'CLASS ICAO',
        5:'MTOW/Kg',
        6:'REG. DATE',
        7:'C of A / ARC Expiry',
        8:'AIRCRAFT OWNER / OPERATOR',
        9:'ENGINE TYPE',
        10:'No of ENG',
        11:'PROPELLER TYPE',
        12:'AIRCRAFT BASE',
        13:'PAX No.'
        }
            
    
    categories = ['page%s' %i for i in range(20)]
    df = pd.concat([pd.DataFrame(d) for d in list_of_list], keys=categories)
    df = df.rename(columns=columns)
    df.head(30)
    
    

    You will want to do some clean up maybe delete the first two rows from each page. But you might want to keep the table header which is in row zero of each page.

    # Make a new column with table title
    df['Title'] = df['S/N'].str.replace(r'[0-9]', '', regex=True).str.replace('S/N', '').str.replace('.', '').replace('', np.nan).ffill(axis = 0)
    
    # Drop top two rows from each table
    df = df.drop(([0,1]), level=1, axis=0)
    

    This code will give you a multi level index so you can still keep a track of which page the data came from.