I am trying to read a file from a third party AWS S3 bucket which is in a .gz
format. I need to process the data in the file and upload the file to our own S3 Bucket.
For reading the file, I am creating a readStream from S3.getBucket as shown below:
const fileStream = externalS3.getObject({Bucket: <bucket-name>, Key: <key>}).createReadStream();
For making the code more efficient, I am planning to use the same fileStream
for both processing the contents and uploading to our own S3. I have the code below, which does not upload the file to the internal S3 bucket.
import Stream from "stream";
const uploadStream = fileStream.pipe(new stream.PassThrough());
const readStream = fileStream.pipe(new stream.PassThrough());
await internalS3.upload({Bucket:<bucket-name>, Key: <key>, Body: uploadStream})
.on("httpUploadProgress", progress => {console.log(progress)})
.on("error", error => {console.log(error)})
.promise();
readStream.pipe(createGunzip())
.on("error", err =>{console.log(err)})
.pipe(JSONStream.parse())
.on("data", data => {console.log(data)});
However, the code below successfully uploads the file to the internal s3 bucket.
const uploadStream = fileStream.pipe(new stream.PassThrough());
await internalS3.upload({Bucket:<bucket-name>, Key: <key>, Body: uploadStream})
.on("httpUploadProgress", progress => {console.log(progress)})
.on("error", error => {console.log(error)})
.promise();
What am I doing wrong here ?
NOTE: If I use separate fileStream
s to upload and read data, it works fine. However, I need to achieve this using the same fileStream.
The files that you are trying to upload to S3 have a relatively large size (~1 GB) as mentioned by OP. Two streams are being created here piping the single fileStream
:
const uploadStream = fileStream.pipe(new stream.PassThrough());
const readStream = fileStream.pipe(new stream.PassThrough());
While the operations on readStream
are less time consuming uploadStream
is responsible for uploading the file to a remote location, in this case S3, over a network which takes relatively more time. This also means that the readStream
is pulling/requesting the data from the fileStream
at a higher rate. By the time readStream
has finished, the fileStream
is already consumed and the .upload
call to aws sdk
hangs. See this issue.
You can fix it by making use of this library to synchronise the two different streams. An example of how to achieve that can be found here.