So I'm struggling a little bit on one particular aspect of my data extraction script, i've successfully got the script to extract the data but i'm not sure how to deal with merging multi-line cells. For example: I have some data that is spread across two rows (highlighted)
but i want them merged in to one row in to the appropriate column. so instead of:
AÉRO-SPORT DU GRAND-DUCHÉ
DE LUXEMBOURG A.S.B.L.
it reads AÉRO-SPORT DU GRAND-DUCHÉ DE LUXEMBOURG A.S.B.L.
I want this to apply across all columns.
Here is the script that i've written so far:
import pdfplumber
import pandas as pd
pdf_path = "C:/Users/xxxx/Downloads/relev-aronefs-11-12-2023.pdf"
all_rows = []
explicit_vertical_lines = [45, 100, 250, 410, 480, 640, 780]
with pdfplumber.open(pdf_path) as pdf:
for page_num in range(len(pdf.pages)):
page = pdf.pages[page_num]
if page_num == 0:
top_region = 50
else: # For all other pages
top_region = 0
cropped_page = page.crop((0, top_region, page.width, page.height))
table = cropped_page.extract_table({
"vertical_strategy": "explicit",
"explicit_vertical_lines": explicit_vertical_lines,
"horizontal_strategy": "text",
})
if table:
if page_num == 0:
all_rows += table[1:]
else:
all_rows += table
df = pd.DataFrame(all_rows)
df.columns = ['Immat', 'Constructeur', 'Type d’aéronef', 'SN aéronef', 'Propriétaire', 'Exploitant']
output_csv_path = "C:/Users/xxxx/Downloads/extracted_table_data.csv"
df.to_csv(output_csv_path, index=False)
print(df.head())
If you want to download the pdf i used then its available here:
Any help/guidance would be appreciated.
If you can't avoid the empty rows upfront, here is a quick solution :
out = (
df
# to get rid of the duplicated headers and footers
.query("Immat != 'Immat' and ~Exploitant.str.contains('Page')")
.pipe( # here we ffill the Immat, then groupby & agg/join
lambda x: x.groupby(
x["Immat"].replace(
{"": None}).ffill(), sort=False)
.agg(lambda s: " ".join(s).strip()) # << we join the rows here
)
# to make a default index
.reset_index(drop=True)
)
Output (of the highlighted entry) :
Immat | Constructeur | Type d’aéronef | SN aéronef | Propriétaire | Exploitant | |
---|---|---|---|---|---|---|
2 | LX-AIB | THE NEW PIPER AIRCRAFT, INC | PA28-161 Cadet | 28-41086 | AÉRO-SPORT DU GRAND-DUCHÉ DE LUXEMBOURG A.S.B.L. | AÉRO-SPORT DU GRAND-DUCHÉ DE LUXEMBOURG A.S.B.L. |
Ouptut (truncated):
print(out)
Immat Constructeur Type d’aéronef SN aéronef Propriétaire Exploitant
0 LX-ABC BOMBARDI... BD-700-1... 9213 DMH AVIA... GLOBAL J...
1 LX-AIA THE NEW ... PA18C 18-1011 PIPER CL... PIPER CL...
2 LX-AIB THE NEW ... PA28-161... 28-41086 AÉRO-SPO... AÉRO-SPO...
3 LX-AIC REIMS AV... F172L 0852 AÉRO-SPO... AÉRO-SPO...
4 LX-AID REIMS AV... F172N 1972 AÉRO-SPO... AÉRO-SPO...
.. ... ... ... ... ... ...
267 LX-XMR STORCH A... Slepcev ... 136 PROPRIÉT... EXPLOITA...
268 LX-YCV BOEING C... B747-4R7F 35805 CARGOLUX... CARGOLUX...
269 LX-ZAP PILATUS ... PC-24 315 ZAPLEX S.A. JETFLY A...
270 LX-ZEN CIRRUS D... SR22 2972 PROPRIÉT... EXPLOITA...
271 LX-ZIO BOMBARDI... BD-700-1... 9369 ER FRANZ... GLOBAL J...
[272 rows x 6 columns]