pythoncsvopencsv

Parsing non standard csv file in python


I am trying to parse csv file that I don't control. The file contains quoted text like ...,text ""text"",... and I want the parsed field to look like text "text".

Java Opencsv can parse this file just fine, but python library needs the whole field to be quoted (...,"text ""text""",...) to properly unquote the text.

I've prepared a test suite that illustrates my problem.

test.csv

col1,col2
idk,text ""text""
idk,"text ""text"""
idk,"text, text"
idk,text """"text""""
idk,"text """"text"""""

Expected output:

[
    {'col1': 'idk', 'col2': 'text "text"'},
    {'col1': 'idk', 'col2': 'text "text"'},
    {'col1': 'idk', 'col2': 'text, text'},
    {'col1': 'idk', 'col2': 'text ""text""'},
    {'col1': 'idk', 'col2': 'text ""text""'},
]

Test program:

import csv
import unittest
from typing import Dict, List


def parse_input(path: str) -> List[Dict[str, str]]:
    with open(path, 'r') as f:
        reader = csv.DictReader(f)
        return list(reader)


class TestStringMethods(unittest.TestCase):
    def test_parsing(self) -> None:
        parsed = parse_input('./test.csv')
        self.assertEqual(
            parsed,
            [
                {
                    'col1': 'idk',
                    'col2': 'text "text"',
                },
                {
                    'col1': 'idk',
                    'col2': 'text "text"',
                },
                {
                    'col1': 'idk',
                    'col2': 'text, text',
                },
                {
                    'col1': 'idk',
                    'col2': 'text ""text""',
                },
                {
                    'col1': 'idk',
                    'col2': 'text ""text""',
                },
            ],
        )


if __name__ == '__main__':
    unittest.main()

I haven't found any parameters that would alow me to do what I want. I thought doublequote parameter would work, but default value is already True.

I have tried python built-in csv lib, pandas and clevercsv. They all treat the double quotes as simple characters not as one escaped double quote. Generally that is a good thing, but now I need to properly parse this file preferably without writing my own parser.

Do you have some suggestions on some python libraries which can parse this file?


Solution

  • Thanks for suggestions.

    I solved this problem by preprocessing rows while reading the file. I used regex to match all fields containing "" which are not inside double quotes, then I insert double quotes around the field.

    It din't have a big impact on performance and I didn't notice any problems.