google-cloud-platformgsutilgoogle-cloud-sdknon-unicode

UnicodeEncodeError while transferring ".eml" file to Google Cloud Platform (gsutil v4.6.1 on Linux)


While transferring file(s) from a Linux system to Google Cloud Platform using the gsutil cp command, it fails at some old ".eml" files when trying to process its content (not just file name!) which contains non-English characters not encoded in Unicode.

The command attempted was:

gsutil cp "/home/darsenlu/Home/mail/Pan/Fw_ japanese_lyrics.eml" gs://darsen_backup_monthly/

The error message was:

UnicodeEncodeError: 'ascii' codec can't encode character '\udca8' in position 22881: ordinal not in range(128)

gsutil rsync gives a very similar error. Position 22881 (0x5961) turns out to be towards the end of the multi-part e-mail source file. Following shows the hex-dumped file content:

00005960: 20a8 43a4 d1b3 a320 5961 686f 6f21 a95f   .C.... Yahoo!._
00005970: bcaf 203e 2020 7777 772e 7961 686f 6f2e  .. >  www.yahoo.
00005980: 636f 6d2e 7477 0d0a                      com.tw..

We see byte "0xa8" at position 0x5961, which was the source of the problem as indicated by the error message. For some reason gsutil was trying to encode the text. When opening the file in a terminal that supports Chinese characters, we see this:

< 每天都 Yahoo!奇摩 >  www.yahoo.com.tw

The first Chinese character "每" is 0xa843 when encoded in Big-5. A simple work-around was to rename the file extension to something other than ".eml" such as ".eml.bak" so that gsutil does not process the file content. Unfortunately it is difficult to know the existence of files with such non-English character in advance while doing bulk transfer, and the whole process can be stopped multiple times.

Following is the full error message:

darsenlu@devmodel:~/Home$ gsutil cp "/home/darsenlu/Home/mail/Pan/Fw_ japanese_lyrics.eml" gs://darsen_backup_monthly/
Copying file:///home/darsenlu/Home/mail/Pan/Fw_ japanese_lyrics.eml [Content-Type=message/rfc822]...
Traceback (most recent call last):
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gsutil", line 21, in <module>
    gsutil.RunMain()
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gsutil.py", line 122, in RunMain
    sys.exit(gslib.__main__.main())
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/__main__.py", line 444, in main
    user_project=user_project)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/__main__.py", line 780, in _RunNamedCommandAndHandleExceptions
    _HandleUnknownFailure(e)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/__main__.py", line 639, in _RunNamedCommandAndHandleExceptions
    user_project=user_project)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/command_runner.py", line 411, in RunNamedCommand
    return_code = command_inst.RunCommand()
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/commands/cp.py", line 1124, in RunCommand
    seek_ahead_iterator=seek_ahead_iterator)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/command.py", line 1525, in Apply
    arg_checker, should_return_results, fail_on_error)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/command.py", line 1596, in _SequentialApply
    worker_thread.PerformTask(task, self)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/command.py", line 2316, in PerformTask
    results = task.func(cls, task.args, thread_state=self.thread_gsutil_api)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/commands/cp.py", line 709, in _CopyFuncWrapper
    preserve_posix=cls.preserve_posix_attrs)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/commands/cp.py", line 924, in CopyFunc
    preserve_posix=preserve_posix)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/utils/copy_helper.py", line 3957, in PerformCopy
    gzip_encoded=gzip_encoded)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/utils/copy_helper.py", line 2250, in _UploadFileToObject
    parallel_composite_upload, logger)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/utils/copy_helper.py", line 2066, in _DelegateUploadFileToObject
    elapsed_time, uploaded_object = upload_delegate()
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/utils/copy_helper.py", line 2227, in CallNonResumableUpload
    gzip_encoded=gzip_encoded_file)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/utils/copy_helper.py", line 1762, in _UploadFileToObjectNonResumable
    gzip_encoded=gzip_encoded)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/cloud_api_delegator.py", line 388, in UploadObject
    gzip_encoded=gzip_encoded)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/gcs_json_api.py", line 1712, in UploadObject
    gzip_encoded=gzip_encoded)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/gcs_json_api.py", line 1534, in _UploadObject
    global_params=global_params)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/third_party/storage_apitools/storage_v1_client.py", line 1182, in Insert
    upload=upload, upload_config=upload_config)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/third_party/apitools/apitools/base/py/base_api.py", line 703, in _RunMethod
    download)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/third_party/apitools/apitools/base/py/base_api.py", line 679, in PrepareHttpRequest
    upload.ConfigureRequest(upload_config, http_request, url_builder)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/third_party/apitools/apitools/base/py/transfer.py", line 763, in ConfigureRequest
    self.__ConfigureMultipartRequest(http_request)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/third_party/apitools/apitools/base/py/transfer.py", line 823, in __ConfigureMultipartRequest
    g.flatten(msg_root, unixfrom=False)
  File "/usr/lib/python3.6/email/generator.py", line 116, in flatten
    self._write(msg)
  File "/usr/lib/python3.6/email/generator.py", line 181, in _write
    self._dispatch(msg)
  File "/usr/lib/python3.6/email/generator.py", line 214, in _dispatch
    meth(msg)
  File "/usr/lib/python3.6/email/generator.py", line 272, in _handle_multipart
    g.flatten(part, unixfrom=False, linesep=self._NL)
  File "/usr/lib/python3.6/email/generator.py", line 116, in flatten
    self._write(msg)
  File "/usr/lib/python3.6/email/generator.py", line 181, in _write
    self._dispatch(msg)
  File "/usr/lib/python3.6/email/generator.py", line 214, in _dispatch
    meth(msg)
  File "/usr/lib/python3.6/email/generator.py", line 361, in _handle_message
    payload = self._encode(payload)
  File "/usr/lib/python3.6/email/generator.py", line 412, in _encode
    return s.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode character '\udca8' in position 22881: ordinal not in range(128)

The Linux system is Ubuntu 18.04.4 LTS (GNU/Linux 4.15.0-76-generic x86_64).


Solution

  • I took your string with Chinese characters and was able to reproduce your error. I fixed it after updating to gsutil 4.62. Here's the merged PR and issue tracker as reference.

    Update Cloud SDK by running:

    gcloud components update