pythonexcelcsvxlsx

Issue with Number Formatting When Converting CSV to Excel Using Pandas


I'm having trouble with a Python script that converts CSV files to Excel format using pandas. The issue is that numbers in scientific notation in the CSV file are not being correctly interpreted as numbers in the resulting Excel file. Instead, they appear in an undesired scientific format. Here's a simplified version of my function:

import pandas as pd
import os
from tkinter import filedialog, messagebox

def xlsx_conversion():
    file_paths = filedialog.askopenfilenames(filetypes=[("Text and CSV files", "*.txt *.csv")])
    if not file_paths:
        return

    for file_path in file_paths:
        try:
            sep = '\t' if file_path.lower().endswith('.txt') else ','
            df = pd.read_csv(file_path, sep=sep)

            for col in df.columns:
                df[col] = pd.to_numeric(df[col], errors='ignore')

            excel_file_path = os.path.splitext(file_path)[0] + '.xlsx'
            with pd.ExcelWriter(excel_file_path, engine='openpyxl') as writer:
                df.to_excel(writer, index=False, float_format="%.3f")

        except Exception as e:
            messagebox.showerror("Error", f"Error converting file {file_path}: {e}")
            return

    messagebox.showinfo("Success", "All files have been converted successfully!")

xlsx_conversion()

Here is an example of my csv file :

enter image description here

And this is what i have after using the function :

enter image description here

The data in row 5 and below are strings, not numeric, and thus unusable for creating graphs, which is the ultimate goal.


Solution

  • Analysis of the problem

    The problem is due to the fact that your columns are not purely numeric.

    If I provide this .csv file, I get numeric values in the Excel file:

    Frequency,Random
    5.0e+3,6.01e-4
    5.0e+3,6.01e-4
    5.0e+3,6.01e-4
    5.0e+3,6.01e-4
    

    Your file has some additional material before the headers, and that prevents pandas.read_csv from detecting that the columns are numerical.

    When I use this file, all my numbers end up as strings in the Excel file, just like you get:

    !Keysignth,Easdf,MYA1234,A.04.00
    !Date: Thur
    
    BEGIN CH1_DATA
    Frequency,Random
    5.0e+3,6.01e-4
    5.0e+3,6.01e-4
    5.0e+3,6.01e-4
    5.0e+3,6.01e-4
    

    The key mistake was to assume that errors="ignore" meant "ignore values that cannot be converted, but keep doing the rest of the column". That's not what happens: conversion does not happen at all if any error happens, and errors="ignore" causes that to be silently ignored.

    Using header=3 to skip the metadata

    Now, you can give read_csv a header parameter, which indicates where the column headers are, and then it will only start reading columnar data after the headers.

    In this specific case, using df = pd.read_csv(file_path, sep=sep, header=3) indicated that the headers were after the three rows of metadata.

    Caveat: the resulting Excel file does not contain those meta data lines at the top, since read_csv ignores all contents before the header line.

    Keeping the metadata by converting only below them

    Presumably you want to keep your metadata.

    You can call pd.to_numeric with a slice, telling it to start converting only from row 3 (to skip the three rows of metadata):

    df.loc[3:, col] = pd.to_numeric(df[col][3:])
    

    and now the Excel file has all the metadata, and the numeric data is properly numeric and ready to graph.

    Making it more general

    If your number of metadata rows is not constant, you might need to write some logic to find the first numerical row.

    One heuristic could be to look for the first numeric value in the first column:

    for row, value in enumerate(df[df.columns[0]]):
        try:
            pd.to_numeric(value)
            first_numeric_row = row
            break
        except ValueError:
            pass
    else:
        print("Warning: no numeric data in the first column")
        first_numeric_row = row + 1
    

    And then use

    df.loc[first_numeric_row:, col] = pd.to_numeric(df[col][first_numeric_row:])
    

    to convert the numerical data.

    I am making a lot of assumptions here, starting with presuming that all columns have the same range of numerical data, and that no non-numeric data exists after the first numeric row. You'll have to adjust this to your real requirements. You might want to look for BEGIN CH1_DATA instead if you know it's always there, for example.

    Side note: remove float_format

    You're passing float_format="%.3f" to to_excel(). Don't do that. It will force rounding your data before writing it to the Excel file. As @chrslg pointed out, 6e-4, 6.04e-4 or 6.12e-4 will all become 0.001. Once you've converted to numeric, just let the writer handle numeric values for you.