I am trying to parse some csv files with a few thousand rows. The csv is structured with comma as a delimiter and quotation marks enclosing each field. I want to use pd.read_csv() to parse the data without skipping the faulty lines with on_bad_line='pass' argument.
Example:
"Authors","Title","ID","URN"
"Overflow, Stack, Doe, John Society of overflowers "Stack" (something) ","50 years of overflowing in "Stack" : 50 years of stacking---",""117348359","URN:ND_649C52T1A9K1JJ51"
"Tyson, Mike","Me boxing","525166266","URN:SD_95125N2N3523N5KB"
"Smith, Garry",""Funny name" : a book about names","951992851","URN:ND_95N5J2N352BV525N25"
"The club of clubs","My beloved, the "overflow stacker"","9551236651","URN:SD_955K2B61J346F1N25"
I have tried to illustrate the problematic structure of the csv file. In the example above, only the second row would get parsed without problems and others would have issues because of enclosed quotation marks or commas within the field bounds.
When I ran the script with the following command:
df = pd.read_csv(path, engine='python', delimiter=",", quotechar='"', encoding="utf-8", on_bad_lines='warn')
I get errors on problematic lines: ParserWarning: Skipping line 1477: ',' expected after '"' Similar error when I try the default engine or pyarrow.
Now, what I want to accomplish is pass a handler function to the on_bad_lines= argument that would skip the line if it can not be parsed but at the same time store the values that are found in a new variable as a dictionary or a list so I can manually add that data later in the code. I have tried this function and passed it as a value for on_bad_lines= but the bad_lines list just ended up empty anyway:
bad_lines = []
def handle_bad_lines(bad_line):
print(f"Skipping bad line: {bad_line}")
bad_lines.append(bad_line)
return None
Thank you.
I wonder if you can fix your bad CSV first, then run the good CSV through Pandas; don't try to fix it in Pandas (wrong tool for the job).
I don't know how representative your sample CSV really is, but it looks like someone hand-rolled their own CSV encoder by doing something like enclosing every item in a list with double quotes, then joining those elements with commas:
data = [
["amy bob", "i"],
["cam, deb", "ii"],
['"el" fay', "iii"],
]
for row in data:
quoted = [f'"{x}"' for x in row]
print(",".join(quoted))
which prints:
"amy bob","i"
"cam, deb","ii"
""el" fay","iii"
Given your sample, would the original data look something like?
[
['Authors', 'Title', 'ID', 'URN' ],
['..."Stack"...', '...in "Stack"...', '"117348359', 'URN:ND_649C52T1A9K1JJ51' ],
['Tyson, Mike', 'Me boxing', '525166266', 'URN:SD_95125N2N3523N5KB' ],
['Smith, Garry', '"Funny name"...', '951992851', 'URN:ND_95N5J2N352BV525N25'],
['The club of...', '..."overflow stacker"', '9551236651', 'URN:SD_955K2B61J346F1N25' ],
]
Not sure about the double quote in the ID of the first record, "117348359. I assume that's a typo you made typing up the sample.
If so, you might start by assuming that all lines:
","
only appears in between fields, you don't expect to see that sequence in the data itselfN_COLS = 4
with open("input-bad.csv") as f:
for i, line in enumerate(f, start=1):
line = line.strip()
if line[0] != '"' or line[-1] != '"':
print(f"line {i} doesn't have bookend quotes: {line}")
if (n := line.count('","')) != N_COLS - 1:
print(f"line {i} appears to have {n} cols: {line}")
Adding a "bad record" at the end of your sample:
['Mr Baz', 'foo"," the bar', '99999999', 'URN:TX_77777777']
Mr Baz","foo"," the bar","99999999","URN:TX_77777777
would print:
line 6 doesn't have bookend quotes: Mr Baz","foo"," the bar","99999999","URN:TX_77777777
line 6 appears to have 4 cols: Mr Baz","foo"," the bar","99999999","URN:TX_77777777
Hopefully that doesn't print out anything, or only a small number of records you could deal with (I don't how you'd deal with them, though).
If so, then you can fix the file and output good CSV:
import csv, sys
def process_line(line: str) -> str:
"""
Strip surrounding whitespace, remove bookend quotes.
"""
return line.strip()[1:-1]
writer = csv.writer(sys.stdout)
with open("input-bad.csv") as f:
for line in f:
line = process_line(line)
fields = line.split('","')
writer.writerow(fields)
Running that on your sample I get (with leading spaces for readability):
Authors, Title, ID, URN
"...""Stack""...", "...in ""Stack""...", """117348359", URN:ND_649C52T1A9K1JJ51
"Tyson, Mike", Me boxing, 525166266, URN:SD_95125N2N3523N5KB
"Smith, Garry", """Funny name""...", 951992851, URN:ND_95N5J2N352BV525N25
The club of..., "...""overflow stacker""...", 9551236651, URN:SD_955K2B61J346F1N25
Again, ""117348359 looks weird, but I'll leave that for you.