amazon-s3 apache-spark jets3t parquet apache-spark-sql

EntityTooLarge error when uploading a 5G file to Amazon S3

Amazon S3 file size limit is supposed to be 5T according to this announcement, but I am getting the following error when uploading a 5G file

'/mahler%2Fparquet%2Fpageview%2Fall-2014-2000%2F_temporary%2F_attempt_201410112050_0009_r_000221_2222%2Fpart-r-222.parquet' XML Error Message: 
  <?xml version="1.0" encoding="UTF-8"?>
  <Error>
    <Code>EntityTooLarge</Code>
    <Message>Your proposed upload exceeds the maximum allowed size</Message>
    <ProposedSize>5374138340</ProposedSize>
    ...
    <MaxSizeAllowed>5368709120</MaxSizeAllowed>
  </Error>

This makes it seem like S3 is only accepting 5G uploads. I am using Apache Spark SQL to write out a Parquet data set using SchemRDD.saveAsParquetFile method. The full stack trace is

org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: S3 PUT failed for '/mahler%2Fparquet%2Fpageview%2Fall-2014-2000%2F_temporary%2F_attempt_201410112050_0009_r_000221_2222%2Fpart-r-222.parquet' XML Error Message: <?xml version="1.0" encoding="UTF-8"?><Error><Code>EntityTooLarge</Code><Message>Your proposed upload exceeds the maximum allowed size</Message><ProposedSize>5374138340</ProposedSize><RequestId>20A38B479FFED879</RequestId><HostId>KxeGsPreQ0hO7mm7DTcGLiN7vi7nqT3Z6p2Nbx1aLULSEzp6X5Iu8Kj6qM7Whm56ciJ7uDEeNn4=</HostId><MaxSizeAllowed>5368709120</MaxSizeAllowed></Error>
        org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.storeFile(Jets3tNativeFileSystemStore.java:82)
        sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        java.lang.reflect.Method.invoke(Method.java:606)
        org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
        org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
        org.apache.hadoop.fs.s3native.$Proxy10.storeFile(Unknown Source)
        org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsOutputStream.close(NativeS3FileSystem.java:174)
        org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:61)
        org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:86)
        parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:321)
        parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:111)
        parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73)
        org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:305)
        org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
        org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
        org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
        org.apache.spark.scheduler.Task.run(Task.scala:54)
        org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
        java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        java.lang.Thread.run(Thread.java:745)

Is the upload limit still 5T? If it is why am I getting this error and how do I fix it?

Solution

The object size is limited to 5 TB. The upload size is still 5 GB, as explained in the manual:

Depending on the size of the data you are uploading, Amazon S3 offers the following options:

Upload objects in a single operation—With a single PUT operation you can upload objects up to 5 GB in size.

Upload objects in parts—Using the Multipart upload API you can upload large objects, up to 5 TB.

http://docs.aws.amazon.com/AmazonS3/latest/dev/UploadingObjects.html

Once you do a multipart upload, S3 validates and recombines the parts, and you then have a single object in S3, up to 5TB in size, that can be downloaded as a single entitity, with a single HTTP GET request... but uploading is potentially much faster, even on files smaller than 5GB, since you can upload the parts in parallel and even retry the uploads of any parts that didn't succeed on first attempt.