rubyamazon-web-servicesamazon-s3aws-sdk-ruby

File encoding issue when downloading file from AWS S3


I have a CSV file in AWS S3 that I'm trying to open in a local temp file. This is the code:

s3 = Aws::S3::Resource.new
bucket = s3.bucket({bucket name})
obj = bucket.object({object key})
temp = Tempfile.new('temp.csv')
obj.get(response_target: temp)

It pulls the file from AWS and loads it in a new temp file called 'temp.csv'. For some files, the obj.get(..) line throws the following error:

WARN: Encoding::UndefinedConversionError: "\xEF" from ASCII-8BIT to UTF-8
WARN: /Users/.rbenv/versions/2.5.0/lib/ruby/2.5.0/delegate.rb:349:in `write'
/Users/.rbenv/versions/2.5.0/lib/ruby/2.5.0/delegate.rb:349:in `block in delegating_block'
/Users/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/aws-sdk-core-3.21.2/lib/seahorse/client/http/response.rb:62:in `signal_data'
/Users/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/aws-sdk-core-3.21.2/lib/seahorse/client/net_http/handler.rb:83:in `block (3 levels) in transmit'
...
/Users/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/aws-sdk-s3-1.13.0/lib/aws-sdk-s3/client.rb:2666:in `get_object'
/Users/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/aws-sdk-s3-1.13.0/lib/aws-sdk-s3/object.rb:657:in `get'

Stacktrace shows the error initially gets thrown by the .get from the AWS SDK for Ruby.

Things I've tried:

When uploading the file (object) to AWS S3, you can specify content_encoding, so I tried setting that to UTF-8:

obj.upload_file({file path}, content_encoding: 'utf-8')

Also when you call .get you can set response_content_encoding:

obj.get(response_target: temp, response_content_encoding: 'utf-8')

Neither of those work, they result in the same error as above. I would really expect that to do the trick. In the AWS S3 dashboard I can see that the content encoding is indeed set correctly via the code but it doesn't appear to make a difference.

It does work when I do the following, in the first code snippet above:

temp = Tempfile.new('temp.csv', encoding: 'ascii-8bit')

But I'd prefer to upload and/or download the file from AWS S3 with the proper encoding. Can someone explain why specifying the encoding on the tempfile works? Or how to make it work through the AWS S3 upload/download?

Important to note: The problematic character in the error message appears to just be a random symbol added at the beginning of this auto-generated file I'm working with. I'm not worried about reading the character correctly, it gets ignored when I parse the file anyways.


Solution

  • I don't have a full answer to all your question, but I think I have a generalized solution, and that is to always put the temp file into binary mode. That way the AWS gem will simply dump the data from the bucket into the file, without any further re/encoding:

    Step 1 (put the Tempfile into binmode):

    temp = Tempfile.new('temp.csv')
    temp.binmode
    

    You will however have a problem, and that is the fact that there is a 3-byte BOM header in your UTF-8 file now.

    I don't know where this BOM came from. Was it there when the file was uploaded? If so, it might be a good idea to strip the 3 byte BOM before uploading.

    However, if you set up your system as below, it will not matter, because Ruby supports transparent reading of UTF-8 with or without BOM, and will return the string correctly regardless of if the BOM header is in the file or not:

    Step 2 (process the file using bom|utf-8):

    File.read(temp.path, encoding: "bom|utf-8")
    # or...
    CSV.read(temp.path,  encoding: "bom|utf-8")
    

    This should cover all your bases I think. Whether you receive files encoded as BOM + UTF-8 or plain UTF-8, you will process them correctly this way, without any extra header characters appearing in the final string, and without errors when saving them with AWS.

    Another option (from OP)

    Use obj.get.body instead, which will bypass the whole issue with response_target and Tempfile.

    Useful references:
    Is there a way to remove the BOM from a UTF-8 encoded file?
    How to avoid tripping over UTF-8 BOM when reading files
    What's the difference between UTF-8 and UTF-8 without BOM?
    How to write BOM marker to a file in Ruby