pythonpython-3.xcsvtabula-py

PDF to CSV - converted CSV has interchanged contents of the columns


I am trying to convert a PDF file into CSV using Python and the below code. Earlier it was working; however, recently it is not working. I am getting interchanged column contents in the converted CSV file.

Guide me to fix this column issue in my code. I am only concerned about the first page of the PDF conversion as I need to remove the first rows of the table.

#!/usr/bin/env python3
import tabula
import pandas as pd
import csv

pdf_file='/pdf2xls/Input.pdf'
column_names=['Product','Batch No','Machin No','Time','Date','Drum/Bag No','Tare Wt.kg','Gross Wt.kg',
              'Net Wt.kg','Blender','Remarks','Operator']

# Page 1 processing
df1 = tabula.read_pdf(pdf_file, pages=1,area=(95,20, 800, 840),columns=[93,180,220,252,310,315,333,367,
                                                                      410,450,480,520]
                     ,pandas_options={'header': None}) #(top,left,bottom,right)

df1[0]=df1[0].drop(columns=5)
df1[0].columns=column_names
#df1[0].head(2)

#df1[0].to_csv('result.csv')

result = pd.DataFrame(df1[0]) # concate both the pages and then write to CSV
result.to_csv("/pdf2xls/Input.csv")

Solution

  • Assuming your pdf have always at least two pages with a footer in the last one, you can try :

    # pip install pdfplumber
    import pdfplumber
    import pandas as pd
    
    pdf = pdfplumber.open("23JJ0WL139.pdf")
    
    tables = []
    for p in pdf.pages:
        ta = p.extract_tables()[0]
        if str(p) == "<Page:1>":
            header = ta[4]
            tables.append(pd.DataFrame(ta[5:]))
        else:
            tables.append(pd.DataFrame(ta))
                        
    df = pd.concat(tables).iloc[:-3].set_axis(header, axis=1)
    

    Output :

    print(df)
    
       Product    Batch No Machin\nNo  ... Net\nWt.kg Blender Operator
    0    GC950  23JJ0WL139     WB_101  ...      51.40            Anand
    1    GC950  23JJ0WL139     WB_101  ...      51.60            Anand
    2    GC950  23JJ0WL139     WB_101  ...      51.20            Anand
    3    GC950  23JJ0WL139     WB_101  ...      51.20            Anand
    4    GC950  23JJ0WL139     WB_101  ...      51.80            Anand
    ..     ...         ...        ...  ...        ...     ...      ...
    11   GC950  23JJ0WL139     WB_101  ...      51.60            RAHUL
    12   GC950  23JJ0WL139     WB_101  ...      51.60            RAHUL
    13   GC950  23JJ0WL139     WB_101  ...      51.80            RAHUL
    14   GC950  23JJ0WL139     WB_101  ...      51.40            RAHUL
    15   GC950  23JJ0WL139     WB_101  ...      51.80            RAHUL
    
    [140 rows x 11 columns]