pythonpandasdataframecsv

Pandas read in CSV using single quote as quotechar throws SyntaxError: incomplete input


I am currently trying to read in a .csv with the following structure:

 samplecsv = """ 'column A', 'column b', 'column c',
 'valueA', 'valueb', 'valuec,d',
 'valueA', 'valueb', 'valuecd',
 'valueA', 'valueb', 'valuecd',
 'valueA', 'valueb', 'valuec,d'
 """

Note that I don't include it as a string but I stored the data in the same structure in a .txt file.

I want python to ignore the comma in 'valuec,d' for the separation, for this value to be stored in a single column later. I found out that you can provide a to then be ignored quotechar as argument, yet I can’t get it to run with single quote. I have tried:

import pandas as pd

DF = pd.read_csv(r'Myfilepath', sep = ',', quotechar =''') 

As well as:

DF = pd.read_csv(r'Myfilepath', sep = ',', quotechar ='''') 

I also tried:

DF = pd.read_csv(r'Myfilepath', sep = ',', quotechar ="'")

All 3 versions gave me SyntaxError: incomplete input. What am I doing wrong?

Note: The question marked as duplicate only explains the difference between single and double quotes but not how to exclude them in pd.read_csv or why this doesn't work in this case.


Solution

  • I don't get any errors with this data. The fields contain extra whitespace though, which requires the skipinitialspace parameter. The following code:

    import pandas as pd
    samplecsv = """ 'column A', 'column b', 'column c',
    'valueA', 'valueb', 'valuec,d',
     'valueA', 'valueb', 'valuecd',
    'valueA', 'valueb', 'valuecd',
    'valueA', 'valueb', 'valuec,d'
    """
    from io import StringIO
    df=pd.read_csv(StringIO(csv), sep = ',', quotechar ="'",skipinitialspace=True)
    df
    

    Produces

      column A column b  column c  Unnamed: 3
    0   valueA   valueb  valuec,d         NaN
    1   valueA   valueb   valuecd         NaN
    2   valueA   valueb   valuecd         NaN
    3   valueA   valueb  valuec,d         NaN
    

    The final empty column is sue to the trailing , in most lines. It can be removed with dropna:

    >>> df.dropna(axis=1,how='all')
      column A column b  column c
    0   valueA   valueb  valuec,d
    1   valueA   valueb   valuecd
    2   valueA   valueb   valuecd
    3   valueA   valueb  valuec,d