pythonpandascsv

How to strip quotes from CSV table?


I'm using the pandas library to convert CSVs to other data types. My CSV has the fields quoted, like this:

"Version", "Date", "Code", "Description", "Tracking Number", "Author"
"0.1", "22AUG2022", , "Initial Draft", "NR", "Sarah Marshall"
"0.2", "23SEP2022", "D", "Update 1", "N-1234", "Bill Walter"
"0.3", "09MAY2023", "A\, C", "New update.", "N-1235", "George Orwell"

The problem is that when I read the CSV with pandas.read_csv('myfile.csv'), the quotes are included in the values:

   Version        "Date"   "Code"     "Description"  "Tracking Number"           "Author"
0      0.1   "22AUG2022"            "Initial Draft"               "NR"   "Sarah Marshall"
1      0.2   "23SEP2022"      "D"        "Update 1"           "N-1234"      "Bill Walter"
2      0.3   "09MAY2023"   "A, C"     "New update."           "N-1235"    "George Orwell"

So these quotes are included when converting to HTML:

<table>
  <thead>
    <tr style="text-align: right">
      <th></th>
      <th>Version</th>
      <th>"Date"</th>
      <th>"Code"</th>
      <th>"Description"</th>
      <th>"Tracking Number"</th>
      <th>"Author"</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>0.1</td>
      <td>"22AUG2022"</td>
      <td></td>
      <td>"Initial Draft"</td>
      <td>"NR"</td>
      <td>"Sarah Marshall"</td>
    </tr>
...

I've tried quoting=csv.QUOTE_NONE, but it didn't fix it--in fact, it actually added quotes to the Version column.

I found this question--the answers essentially says to strip out any quotes in post processing.

I can of course loop through the CSV rows and strip out the leading/trailing quotes for each of the values, but since quoted values are common with CSV I'd expect that there would be a parameter or easy way to enable/disable the quotes in the rendered output but I can't find something like that.

Is there a way to accomplish this?


Solution

  • The leading spaces in your input seem to be throwing Pandas off (and some other CSV processors I can think of).

    Try the skipinitialspace=True option to make Pandas ignore every space between a comma and a quote char:

    import pandas as pd
    
    df = pd.read_csv("input.csv", skipinitialspace=True)
    print(df)
    
       Version       Date  ... Tracking Number          Author
    0      0.1  22AUG2022  ...              NR  Sarah Marshall
    1      0.2  23SEP2022  ...          N-1234     Bill Walter
    2      0.3  09MAY2023  ...          N-1235   George Orwell
    
    [3 rows x 6 columns]
    

    Expounding a bit, you can see the difference, where quotes are present, between the default of not skipping the initial space and skipping:

    from io import StringIO
    
    data = """
    Col1
    "foo"
     "foo"
    " foo"
    """.strip()
    
    print("no skip")
    
    df = pd.read_csv(StringIO(data))
    for _, x in df.iterrows():
        print(repr(x["Col1"]))
    
    print()
    print("skip initial")
    
    df = pd.read_csv(StringIO(data), skipinitialspace=True)
    for _, x in df.iterrows():
        print(repr(x["Col1"]))
    
    no skip
    'foo'
    ' "foo"'
    ' foo'
    
    skip initial
    'foo'
    'foo'
    ' foo'