javascriptnode.jsamazon-web-servicesamazon-s3nodejs-stream

Use Read Stream from AWS S3 getObject to read and upload to a different bucket


I am trying to read a file from a third party AWS S3 bucket which is in a .gz format. I need to process the data in the file and upload the file to our own S3 Bucket.

For reading the file, I am creating a readStream from S3.getBucket as shown below:

const fileStream = externalS3.getObject({Bucket: <bucket-name>, Key: <key>}).createReadStream();

For making the code more efficient, I am planning to use the same fileStream for both processing the contents and uploading to our own S3. I have the code below, which does not upload the file to the internal S3 bucket.

import Stream from "stream";

const uploadStream = fileStream.pipe(new stream.PassThrough());
const readStream = fileStream.pipe(new stream.PassThrough());

await internalS3.upload({Bucket:<bucket-name>, Key: <key>, Body: uploadStream})
.on("httpUploadProgress", progress => {console.log(progress)})
.on("error", error => {console.log(error)})
.promise();

readStream.pipe(createGunzip())
.on("error", err =>{console.log(err)})
.pipe(JSONStream.parse())
.on("data", data => {console.log(data)});

However, the code below successfully uploads the file to the internal s3 bucket.

const uploadStream = fileStream.pipe(new stream.PassThrough());


await internalS3.upload({Bucket:<bucket-name>, Key: <key>, Body: uploadStream})
.on("httpUploadProgress", progress => {console.log(progress)})
.on("error", error => {console.log(error)})
.promise();

What am I doing wrong here ?

NOTE: If I use separate fileStreams to upload and read data, it works fine. However, I need to achieve this using the same fileStream.


Solution

  • The files that you are trying to upload to S3 have a relatively large size (~1 GB) as mentioned by OP. Two streams are being created here piping the single fileStream:

    const uploadStream = fileStream.pipe(new stream.PassThrough());
    const readStream = fileStream.pipe(new stream.PassThrough());
    

    While the operations on readStream are less time consuming uploadStream is responsible for uploading the file to a remote location, in this case S3, over a network which takes relatively more time. This also means that the readStream is pulling/requesting the data from the fileStream at a higher rate. By the time readStream has finished, the fileStream is already consumed and the .upload call to aws sdk hangs. See this issue.

    You can fix it by making use of this library to synchronise the two different streams. An example of how to achieve that can be found here.