I am trying to convert a PDF file into CSV using python and written below code for the same. Earlier it was working however recently its not working. I am getting interchanged column contents in the converted CSV file.
Guide me to fix this column issue in my code.
#!/usr/bin/env python3
import tabula
import pandas as pd
import csv
pdf_file='/pdf2xls/Input.pdf'
column_names=['Product','Batch No','Machin No','Time','Date','Drum/Bag No','Tare Wt.kg','Gross Wt.kg',
'Net Wt.kg','Blender','Remarks','Operator']
# Page 1 processing
df1 = tabula.read_pdf(pdf_file, pages=1,area=(95,20, 800, 840),columns=[93,180,220,252,310,315,333,367,
410,450,480,520]
,pandas_options={'header': None}) #(top,left,bottom,right)
df1[0]=df1[0].drop(columns=5)
df1[0].columns=column_names
#df1[0].head(2)
#df1[0].to_csv('result.csv')
result = pd.DataFrame(df1[0]) # concate both the pages and then write to CSV
result.to_csv("/pdf2xls/Input.csv")
You can use pdfplumber:
# pip install pdfplumber
import pdfplumber
pdf = pdfplumber.open(pdf_file)
tables = pdf.pages[0].extract_tables()
(
pd.DataFrame(
# get the second table and skip the last three rows
data=tables[1][:-3],
# get the last row of the first table
columns=tables[0][-1]
)
.replace("", float("nan")) # get rid of the empty strings
# .to_csv("out.csv", index=False) # uncomment to make a fresh csv
)
Output :
Product Batch No Machin\nNo Time Date Drum/\nBag\nNo Tare\nWt.kg Gross\nWt.kg Net\nWt.kg Blender Operator
0 L1050 23JJ0AL051 WB-102 01:07 16-10-2023 1 57.20 1398.80 1341.60 NaN Amit
1 L1050 23JJ0AL051 WB-102 01:22 16-10-2023 2 57.40 1398.80 1341.40 NaN Amit
2 L1050 23JJ0AL051 WB-102 01:33 16-10-2023 3 58.20 1399.60 1341.40 NaN Amit
3 L1050 23JJ0AL051 WB-102 01:44 16-10-2023 4 58.80 1400.60 1341.80 NaN Amit
4 L1050 23JJ0AL051 WB-102 01:55 16-10-2023 5 57.20 1399.00 1341.80 NaN Amit
.. ... ... ... ... ... ... ... ... ... ... ...
20 L1050 23JJ0AL051 WB-102 05:42 16-10-2023 21 57.40 1398.60 1341.20 NaN Amit
21 L1050 23JJ0AL051 WB-102 05:52 16-10-2023 22 57.40 1399.00 1341.60 NaN Amit
22 L1050 23JJ0AL051 WB-102 06:00 16-10-2023 23 57.40 1398.80 1341.40 NaN Amit
23 L1050 23JJ0AL051 WB-102 06:10 16-10-2023 24 57.80 1399.60 1341.80 NaN Amit
24 L1050 23JJ0AL051 WB-102 06:19 16-10-2023 25 57.80 1399.40 1341.60 NaN Amit
[25 rows x 11 columns]