regexcsvasp-classicsplitquotes

Split a CSV where some entries have double quotes


I'm trying to split a CSV file into an array, but encountering two problems: quoted commas and empty elements.

I tried numerous possibilities. I've not found a a regex that works 100%.

The CSV looks like:

123,2.99,AMO024,Title,"Description, more info",,123987564

The regex I tried:

thisLine.split(/,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))/)

In my output array the fifth element comes out as 123987564 not an empty string.


Solution

  • Description

    Instead of using a split, I think it would be easier to simply execute a match and process all the found matches.

    This expression will:

    Regex: (?:^|,)(?=[^"]|(")?)"?((?(1)[^"]*|[^,"]*))"?(?=,|$)

    enter image description here

    Example

    Sample Text

    123,2.99,AMO024,Title,"Description, more info",,123987564
    

    ASP example using the non-java expression

    Set regEx = New RegExp
    regEx.Global = True
    regEx.IgnoreCase = True
    regEx.MultiLine = True
    sourcestring = "your source string"
    regEx.Pattern = "(?:^|,)(?=[^""]|("")?)""?((?(1)[^""]*|[^,""]*))""?(?=,|$)"
    Set Matches = regEx.Execute(sourcestring)
      For z = 0 to Matches.Count-1
        results = results & "Matches(" & z & ") = " & chr(34) & Server.HTMLEncode(Matches(z)) & chr(34) & chr(13)
        For zz = 0 to Matches(z).SubMatches.Count-1
          results = results & "Matches(" & z & ").SubMatches(" & zz & ") = " & chr(34) & Server.HTMLEncode(Matches(z).SubMatches(zz)) & chr(34) & chr(13)
        next
        results=Left(results,Len(results)-1) & chr(13)
      next
    Response.Write "<pre>" & results
    

    Matches using the non-java expression

    Group 0 gets the entire substring which includes the comma
    Group 1 gets the quote if it's used
    Group 2 gets the value not including the comma

    [0][0] = 123
    [0][1] = 
    [0][2] = 123
    
    [1][0] = ,2.99
    [1][1] = 
    [1][2] = 2.99
    
    [2][0] = ,AMO024
    [2][1] = 
    [2][2] = AMO024
    
    [3][0] = ,Title
    [3][1] = 
    [3][2] = Title
    
    [4][0] = ,"Description, more info"
    [4][1] = "
    [4][2] = Description, more info
    
    [5][0] = ,
    [5][1] = 
    [5][2] = 
    
    [6][0] = ,123987564
    [6][1] = 
    [6][2] = 123987564
    

    Edited

    As Boris pointed out CSV format will escape a double quote " as a double double quote "". Although this requirement wasn't included by the OP, if your text includes double double quotes then you'll want to use a this modified expression:

    Regex: (?:^|,)(?=[^"]|(")?)"?((?(1)(?:[^"]|"")*|[^,"]*))"?(?=,|$)

    See also: https://regex101.com/r/y8Ayag/1

    It should also be pointed out that Regex is a pattern matching tool not a parsing engine. Therefore if your text includes double double quotes it will still contain the double double quotes after pattern matching is completed. With this solution you'd still need to search for the double double quotes and replace them in your captured text.