pythonpandascsvteradata-sql-assistant

remove new lines inside the column content in csv files


I have the following sample csv file:

'TEXT';'DATE'
'hello';'20/02/2002'
'hello!
how are you?';'21/02/2002'

So, as you can see, the separator between columns is ; and the content of each column is delimited by '. This brings me problems when processing the file with pandas, because it uses line breaks as a delimiter between rows. That is, it interprets the line break between "hello!" and "how are you" as a separator between rows.

So what I would need is to remove the newlines within the content of each column, so that the file looks like this:

'TEXT';'DATE'
'hello';'20/02/2002'
'hello! how are you?';'21/02/2002'

Removing the r'\n sequence would not work, because then I would lose the row separation. What can I try? I'm using Teradata SQL Assistant to generate the csv file.


Solution

  • You can use sep= and quotechar= parameters in pd.read_csv:

    df = pd.read_csv('your_file.csv', sep=';', quotechar="'")
    print(df)
    

    Prints:

                         TEXT        DATE
    0                   hello  20/02/2002
    1  hello!\r\nhow are you?  21/02/2002
    

    If you want to further replace the newlines:

    df['TEXT'] = df['TEXT'].str.replace('\r', '').str.replace('\n', ' ')
    print(df)
    

    Prints:

                      TEXT        DATE
    0                hello  20/02/2002
    1  hello! how are you?  21/02/2002