rubycsvbyte-order-markzero-width-space

mysterious leading "empty" character at beginning of a string which came from CSV file


During the process of reading a CSV file into an Array I noticed the very first array element, which is a string, contains a leading "" .

For example:

str = contacts[0][0]
p str

gives me...

"SalesRepName"

Then by sheer chance I happened to try:

str = contacts[0][0].split(//)
p str

and that gave me...

["", "S", "a", "l", "e", "s", "R", "e", "p", "N", "a", "m", "e"]

I've checked every other element in the array and this is the only one that has a string containing leading "".


Solution

  • Now, before I could post this question I stumbled upon the answer. Apparently, the act of me writing up the question gave me the idea of determining the ascii number of this "" character.

    str = contacts[0][0].split(//)
    p str[0].codepoints
    

    gave me

    [65279]

    upon inquiring about ascii character 65279 I found this article: https://stackoverflow.com/a/6784805/3170942

    According to SLaks:

    It's a zero-width no-break space. It's more commonly used as a byte-order mark (BOM).

    This, in turn, led me to the solution here: https://stackoverflow.com/a/7780559/3170942
    In this response, knut provided an elegant solution, which looked like this:

    File.open('file.txt', "r:bom|utf-8"){|file|
      text_without_bom = file.read
    }
    

    With , "r:bom|utf-8" being the key element I was looking for. So I adapated it to my code, which became this:

    CSV.foreach($csv_path + $csv_file, "r:bom|utf-8") do |row|
      contacts << row
    end
    

    I spent hours on this stupid problem. Hopefully, this will save you some time!