rubycsvruby-1.9.3ruby-1.9.2ruby-2.0

quote_char causing fits in ruby CSV import


I have a simple CSV file that uses the | (pipe) as a quote character. After upgrading my rails app from Ruby 1.9.2 to 1.9.3 I'm getting an "CSV::MalformedCSVError: Missing or stray quote in line 1" error.

If I pop open vim and replace the | with regular quotes, single quotes or even "=", the file works fine, but | and * result in the error. Anyone have any thoughts on what might be causing this? Here's a simple one-liner that can reproduce the error:

@csv = CSV.read("public/sample_file.csv", {quote_char: '|', headers: false})

Also reproduced this in Ruby 2.0 and also in irb w/out loading rails.

Edit: here are some sample lines from the CSV

|076N102                 |,|CARD                                    |,|         1|,|NEW|,|PCS       |
|07-1801                 |,|BASE                                    |,|        18|,|NEW|,|PCS       |

Solution

  • I think you've just discovered a bug in CSV ruby module. From csv.rb :

    1587:  @re_chars =   /#{%"[-][\\.^$?*+{}()|# \r\n\t\f\v]".encode(@encoding)}/
    

    This Regexp is used to escape characters conflicting with special regular expression symbols, including your "pipe" char | . I don't see any reason for the prepending [-], so if you do remove it, your example starts to work:

    edit: the hyphen has to be escaped inside character set expression (surrounded with brackets []) only when not as the leading character. So had to update the fixed Regexp:

    1587:  @re_chars =   /#{%"(?<!\\[)-(?=.*\\])|[\\.^$?*+{}()|# \r\n\t\f\v]".encode(@encoding)}/
    
    CSV.read('sample.csv', {quote_char: '|'})
    # [["076N102                 ",
    #  "CARD                                    ",
    #  "         1", "NEW", "PCS       "],
    # ["07-1801                 ",  
    #  "BASE                                    ",
    #  "        18", "NEW", "PCS       "]]
    

    As most languages does not support lookbehind expressions with quantifiers, Ruby included, I had to write it as a negative version for the left bracket. It would also match hyphens with missing left one of a bracket pair. If you'd find a better solution, leave a comment pls.

    Glad to hear any comments before fill in a bug report to ruby-lang.org .