I currently have a sample python script that reads a csv with double quotes as a text qualifier and removes ascii characters and line feeds in the fields via a dataframe. It then outputs the dataframe to a csv. However the output csv is double quoting empty fields. see below:
import pandas as pd
import re
import csv
# Load the CSV file into a DataFrame
file_path = "\\\\Mylocation\\Original_facility_udfs.csv" # Replace with your CSV file path
df = pd.read_csv(file_path, quotechar='"')
# Define a function to clean a single cell
def clean_field(value):
if isinstance(value, str):
# Remove line feeds (\n, \r) and non-ASCII characters
value = re.sub(r'[\n\r]', ' ', value) # Replace line feeds with a space
value = re.sub(r'[^\x00-\x7F]', '', value) # Remove non-ASCII characters
return value
# Apply the cleaning function to all DataFrame fields
df_cleaned = df.map(clean_field)
# Save the cleaned DataFrame back to a new CSV file
cleaned_file_path = "\\\\Mylocation\\facility_udfs.csv" # Output file path
df_cleaned.to_csv(
cleaned_file_path,
index=False,
quotechar ='"',
quoting=csv.QUOTE_ALL, # Ensure double quotes remain as text qualifiers
lineterminator='\n' # Set line feed (\n) as the row terminator
)
print(f"Cleaned CSV saved to: {cleaned_file_path}")
My current output is as follows
"date_key","facility_key","udf_type","udf_area_indic","udf_area"
"20240830","251","GL Unit Code","Facility for Type","",""
"20240830","251","Cost Center","Facility for Type","",""
the desired output should be
"date_key","facility_key","udf_type","udf_area_indic","udf_area"
"20240830","251","GL Unit Code","Facility for Type",,
"20240830","251","Cost Center","Facility for Type",,
There is currently no direct way to do this. Python 3.12 added a csv.QUOTE_NOTNULL
option, but it is currently not supported in pandas.
If you don't have tricky strings that need proper escaping, you could "escape" the values yourself with:
(df.map(lambda x: '' if pd.isna(x) else f'"{x}"')
.rename(columns=lambda x: f'"{x}"')
.to_csv(cleaned_file_path,
index=False,
quotechar ='"',
quoting=csv.QUOTE_NONE,
lineterminator='\n'
)
Output:
"date_key","facility_key","udf_type","udf_area_indic","udf_area"
"251","GL Unit Code","Facility for Type",,
"251","Cost Center","Facility for Type",,
NB. your input was unclear, in my case "20240830" was the index, and thus absent from the output.