I am trying to convert a PDF file into CSV using Python and the below code. Earlier it was working; however, recently it is not working. I am getting interchanged column contents in the converted CSV file.
Guide me to fix this column issue in my code. I am only concerned about the first page of the PDF conversion as I need to remove the first rows of the table.
#!/usr/bin/env python3
import tabula
import pandas as pd
import csv
pdf_file='/pdf2xls/Input.pdf'
column_names=['Product','Batch No','Machin No','Time','Date','Drum/Bag No','Tare Wt.kg','Gross Wt.kg',
'Net Wt.kg','Blender','Remarks','Operator']
# Page 1 processing
df1 = tabula.read_pdf(pdf_file, pages=1,area=(95,20, 800, 840),columns=[93,180,220,252,310,315,333,367,
410,450,480,520]
,pandas_options={'header': None}) #(top,left,bottom,right)
df1[0]=df1[0].drop(columns=5)
df1[0].columns=column_names
#df1[0].head(2)
#df1[0].to_csv('result.csv')
result = pd.DataFrame(df1[0]) # concate both the pages and then write to CSV
result.to_csv("/pdf2xls/Input.csv")
Assuming your pdf have always at least two pages with a footer in the last one, you can try :
# pip install pdfplumber
import pdfplumber
import pandas as pd
pdf = pdfplumber.open("23JJ0WL139.pdf")
tables = []
for p in pdf.pages:
ta = p.extract_tables()[0]
if str(p) == "<Page:1>":
header = ta[4]
tables.append(pd.DataFrame(ta[5:]))
else:
tables.append(pd.DataFrame(ta))
df = pd.concat(tables).iloc[:-3].set_axis(header, axis=1)
Output :
print(df)
Product Batch No Machin\nNo ... Net\nWt.kg Blender Operator
0 GC950 23JJ0WL139 WB_101 ... 51.40 Anand
1 GC950 23JJ0WL139 WB_101 ... 51.60 Anand
2 GC950 23JJ0WL139 WB_101 ... 51.20 Anand
3 GC950 23JJ0WL139 WB_101 ... 51.20 Anand
4 GC950 23JJ0WL139 WB_101 ... 51.80 Anand
.. ... ... ... ... ... ... ...
11 GC950 23JJ0WL139 WB_101 ... 51.60 RAHUL
12 GC950 23JJ0WL139 WB_101 ... 51.60 RAHUL
13 GC950 23JJ0WL139 WB_101 ... 51.80 RAHUL
14 GC950 23JJ0WL139 WB_101 ... 51.40 RAHUL
15 GC950 23JJ0WL139 WB_101 ... 51.80 RAHUL
[140 rows x 11 columns]