[SOLVED] How to Convert PDF file into CSV file using Python Pandas

How to Convert PDF file into CSV file using Python Pandas

I have a PDF file, I need to convert it into a CSV file this is my pdf file example as link https://online.flippingbook.com/view/352975479/ the code used is

import re
import parse
import pdfplumber
import pandas as pd
from collections import namedtuple
file = "Battery Voltage.pdf"
lines = []
total_check = 0

with pdfplumber.open(file) as pdf:
    pages = pdf.pages
    for page in pdf.pages:
        text = page.extract_text()
        for line in text.split('\n'):
            print(line)

with the above script I am not getting proper output, For Time column "AM" is getting in the next line. The output I am getting is like this

Solution

For cases like these, build a parser that converts the unusable data into something you can use.

Logic below converts that exact file to a CSV, but will only work with that specific file contents.

Note that for this specific file you can ignore the AM/PM as the time is in 24h format.

import pdfplumber


file = "Battery Voltage.pdf"
skiplines = [
    "Battery Voltage",
    "AM",
    "PM",
    "Sr No DateTIme Voltage (v) Ignition",
    ""
]


with open("output.csv", "w") as outfile:
    header = "serialnumber;date;time;voltage;ignition\n"
    outfile.write(header)
    with pdfplumber.open(file) as pdf:
        for page in pdf.pages:
            for line in page.extract_text().split('\n'):
                if line.strip() in skiplines:
                    continue
                outfile.write(";".join(line.split())+"\n")

EDIT

So, JSON files in python are basically just a list of dict items (yes, that's oversimplification).

The only thing you need to change is the way you actually process the lines. The actual meat of the logic doesn't change...

import pdfplumber
import json


file = "Battery Voltage.pdf"
skiplines = [
    "Battery Voltage",
    "AM",
    "PM",
    "Sr No DateTIme Voltage (v) Ignition",
    ""
]
result = []


with pdfplumber.open(file) as pdf:
    for page in pdf.pages:
        for line in page.extract_text().split("\n"):
            if line.strip() in skiplines:
                continue
            serialnumber, date, time, voltage, ignition = line.split()
            result.append(
                {
                    "serialnumber": serialnumber,
                    "date": date,
                    "time": time,
                    "voltage": voltage,
                    "ignition": ignition,
                }
            )

with open("output.json", "w") as outfile:
    json.dump(result, outfile)