pythonpandasdataframecsv

How to prevent Pandas to_csv double quoting empty fields in output csv


I currently have a sample python script that reads a csv with double quotes as a text qualifier and removes ascii characters and line feeds in the fields via a dataframe. It then outputs the dataframe to a csv. However the output csv is double quoting empty fields. see below:

import pandas as pd
import re
import csv

# Load the CSV file into a DataFrame
file_path = "\\\\Mylocation\\Original_facility_udfs.csv"  # Replace with your CSV file path
df = pd.read_csv(file_path, quotechar='"')

# Define a function to clean a single cell
def clean_field(value):
    if isinstance(value, str):
        # Remove line feeds (\n, \r) and non-ASCII characters
        value = re.sub(r'[\n\r]', ' ', value)  # Replace line feeds with a space
        value = re.sub(r'[^\x00-\x7F]', '', value)  # Remove non-ASCII characters
    return value

# Apply the cleaning function to all DataFrame fields
df_cleaned = df.map(clean_field)

# Save the cleaned DataFrame back to a new CSV file
cleaned_file_path = "\\\\Mylocation\\facility_udfs.csv" # Output file path
df_cleaned.to_csv(
    cleaned_file_path,
    index=False,
    quotechar ='"',
    quoting=csv.QUOTE_ALL, # Ensure double quotes remain as text qualifiers
    lineterminator='\n'  # Set line feed (\n) as the row terminator
)

print(f"Cleaned CSV saved to: {cleaned_file_path}")

My current output is as follows

"date_key","facility_key","udf_type","udf_area_indic","udf_area"
"20240830","251","GL Unit Code","Facility for Type","",""
"20240830","251","Cost Center","Facility for Type","",""

the desired output should be

"date_key","facility_key","udf_type","udf_area_indic","udf_area"
"20240830","251","GL Unit Code","Facility for Type",,
"20240830","251","Cost Center","Facility for Type",,

Solution

  • There is currently no direct way to do this. Python 3.12 added a csv.QUOTE_NOTNULL option, but it is currently not supported in pandas.

    If you don't have tricky strings that need proper escaping, you could "escape" the values yourself with:

    (df.map(lambda x: '' if pd.isna(x) else f'"{x}"')
       .rename(columns=lambda x: f'"{x}"')
       .to_csv(cleaned_file_path,
               index=False,
               quotechar ='"', 
               quoting=csv.QUOTE_NONE,
               lineterminator='\n' 
    )
    

    Output:

    "date_key","facility_key","udf_type","udf_area_indic","udf_area"
    "251","GL Unit Code","Facility for Type",,
    "251","Cost Center","Facility for Type",,
    

    NB. your input was unclear, in my case "20240830" was the index, and thus absent from the output.