I'm trying to split a CSV file into an array, but encountering two problems: quoted commas and empty elements.
I tried numerous possibilities. I've not found a a regex that works 100%.
The CSV looks like:
123,2.99,AMO024,Title,"Description, more info",,123987564
The regex I tried:
thisLine.split(/,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))/)
In my output array the fifth element comes out as 123987564 not an empty string.
Instead of using a split, I think it would be easier to simply execute a match and process all the found matches.
This expression will:
Regex: (?:^|,)(?=[^"]|(")?)"?((?(1)[^"]*|[^,"]*))"?(?=,|$)
Sample Text
123,2.99,AMO024,Title,"Description, more info",,123987564
ASP example using the non-java expression
Set regEx = New RegExp
regEx.Global = True
regEx.IgnoreCase = True
regEx.MultiLine = True
sourcestring = "your source string"
regEx.Pattern = "(?:^|,)(?=[^""]|("")?)""?((?(1)[^""]*|[^,""]*))""?(?=,|$)"
Set Matches = regEx.Execute(sourcestring)
For z = 0 to Matches.Count-1
results = results & "Matches(" & z & ") = " & chr(34) & Server.HTMLEncode(Matches(z)) & chr(34) & chr(13)
For zz = 0 to Matches(z).SubMatches.Count-1
results = results & "Matches(" & z & ").SubMatches(" & zz & ") = " & chr(34) & Server.HTMLEncode(Matches(z).SubMatches(zz)) & chr(34) & chr(13)
next
results=Left(results,Len(results)-1) & chr(13)
next
Response.Write "<pre>" & results
Matches using the non-java expression
Group 0 gets the entire substring which includes the comma
Group 1 gets the quote if it's used
Group 2 gets the value not including the comma
[0][0] = 123
[0][1] =
[0][2] = 123
[1][0] = ,2.99
[1][1] =
[1][2] = 2.99
[2][0] = ,AMO024
[2][1] =
[2][2] = AMO024
[3][0] = ,Title
[3][1] =
[3][2] = Title
[4][0] = ,"Description, more info"
[4][1] = "
[4][2] = Description, more info
[5][0] = ,
[5][1] =
[5][2] =
[6][0] = ,123987564
[6][1] =
[6][2] = 123987564
As Boris pointed out CSV format will escape a double quote "
as a double double quote ""
. Although this requirement wasn't included by the OP, if your text includes double double quotes then you'll want to use a this modified expression:
Regex: (?:^|,)(?=[^"]|(")?)"?((?(1)(?:[^"]|"")*|[^,"]*))"?(?=,|$)
See also: https://regex101.com/r/y8Ayag/1
It should also be pointed out that Regex is a pattern matching tool not a parsing engine. Therefore if your text includes double double quotes it will still contain the double double quotes after pattern matching is completed. With this solution you'd still need to search for the double double quotes and replace them in your captured text.