I am trying to parse csv file that I don't control. The file contains quoted text like ...,text ""text"",...
and I want the parsed field to look like text "text"
.
Java Opencsv can parse this file just fine, but python library needs the whole field to be quoted (...,"text ""text""",...
) to properly unquote the text.
I've prepared a test suite
that illustrates my problem.
test.csv
col1,col2
idk,text ""text""
idk,"text ""text"""
idk,"text, text"
idk,text """"text""""
idk,"text """"text"""""
Expected output:
[
{'col1': 'idk', 'col2': 'text "text"'},
{'col1': 'idk', 'col2': 'text "text"'},
{'col1': 'idk', 'col2': 'text, text'},
{'col1': 'idk', 'col2': 'text ""text""'},
{'col1': 'idk', 'col2': 'text ""text""'},
]
Test program:
import csv
import unittest
from typing import Dict, List
def parse_input(path: str) -> List[Dict[str, str]]:
with open(path, 'r') as f:
reader = csv.DictReader(f)
return list(reader)
class TestStringMethods(unittest.TestCase):
def test_parsing(self) -> None:
parsed = parse_input('./test.csv')
self.assertEqual(
parsed,
[
{
'col1': 'idk',
'col2': 'text "text"',
},
{
'col1': 'idk',
'col2': 'text "text"',
},
{
'col1': 'idk',
'col2': 'text, text',
},
{
'col1': 'idk',
'col2': 'text ""text""',
},
{
'col1': 'idk',
'col2': 'text ""text""',
},
],
)
if __name__ == '__main__':
unittest.main()
I haven't found any parameters that would alow me to do what I want. I thought doublequote
parameter would work, but default value is already True
.
I have tried python built-in csv lib, pandas and clevercsv. They all treat the double quotes as simple characters not as one escaped double quote. Generally that is a good thing, but now I need to properly parse this file preferably without writing my own parser.
Do you have some suggestions on some python libraries which can parse this file?
Thanks for suggestions.
I solved this problem by preprocessing rows while reading the file.
I used regex to match all fields containing ""
which are not inside double quotes, then I insert double quotes around the field.
It din't have a big impact on performance and I didn't notice any problems.