I've been trying to transfer a large file using either LWP (or a web service API that depends on LWP) and running into the issue, no matter how I approach it, that the process crumbles at a certain point. On a whim, I watched top
while my script runs and noticed that the memory usage balloons to over 40GB right before things start failing.
I thought the issue was the S3 APIs I used initially, so I decided to use LWP::UserAgent to connect to the server myself. Unfortunately the issues remain using just LWP: memory usage still balloons and while it goes longer before failing, it got halfway through the transfer and then had a segmentation fault.
Simply reading the file I want to transfer in segments works just fine and never takes memory usage above 1.4GB:
my $filename = "/backup/2022-12-13/accounts/backup.tar.gz";
my $size = -s $filename;
my $chunkSize = (1024*1024*100);
my $parts = ceil($size / $chunkSize);
# open 9.6 GB file
open(my $file, '<', $filename) or die("Error reading file, stopped");
binmode($file);
for (my $i = 0; $i <= $parts; $i++) {
my $chunk;
my $offset = $i * $chunkSize + 1;
read($file, $chunk, $chunkSize, $offset);
# Code to do what I need to do with the chunk goes here.
sleep(5);
print STDOUT "Uploaded $i of $parts.\n";
}
However, adding in the LWP code suddens raises the memory usage significantly and, as I said, eventually gets a segmentation fault (at 55% of the transfer). Here's a minimal, complete, reproducible example:
use POSIX;
use HTTP::Request::Common;
use Net::Amazon::Signature::V4;
my $awsSignature = Net::Amazon::Signature::V4->new( $config{'access_key_id'}, $config{'access_key'}, 'us-east-1', 's3' );
# Get Upload ID from Amazon.
our $simpleS3 = Amazon::S3->new({
aws_access_key_id => $config{'access_key_id'},
aws_secret_access_key => $config{'access_key'},
retry => 1
});
my $bucket = $simpleS3->bucket($bucketName);
my $uploadId = $bucket->initiate_multipart_upload('somebigobject');
my $filename = "/backup/2022-12-13/accounts/backup.tar.gz";
my $size = -s $filename;
my $chunkSize = (1024*1024*100);
my $parts = ceil($size / $chunkSize);
# open 9.6 GB file
open(my $file, '<', $filename) or die("Error reading file, stopped");
binmode($file);
for (my $i = 0; $i <= $parts; $i++) {
my $chunk;
my $offset = $i * $chunkSize + 1;
read($file, $chunk, $chunkSize, $offset);
# Code to do what I need to do with the chunk goes here.
my $request = HTTP::Request::Common::PUT("https://bucket.s3.us-east-1.amazonaws.com/somebigobject?partNumber=" . ($i + 1) . "&uploadId=" . $uploadId);
$request->header('Content-Length' => length($chunk));
$request->content($chunk);
my $signed_request = $awsSignature->sign( $request );
my $ua = LWP::UserAgent->new();
my $response = $ua->request($signed_request);
my $etag = $response->header('Etag');
# Try to make sure nothing lingers after this loop ends.
$signed_request = '';
$request = '';
$response = '';
$ua = '';
($partList{$i + 1}) = $etag =~ m#^"(.*?)"$#;
print STDOUT "Uploaded $i of $parts.\n";
}
The same issue occurs -- just even sooner in the process -- if I use Paws::S3
, Net::Amazon::S3::Client
or Amazon::S3
. It appears each chunk somehow stays in memory. As the code progresses I can see a gradual but significant increase in memory usage until it hits that wall at around 40GB. Here's the bit that replaces sleep(5)
in the real world code:
$partList{$i + 1} = $bucket->upload_part_of_multipart_upload('some-big-object', $uploadId, $i + 1, $chunk);
The final code that fails because it uses so much memory:
use Amazon::S3;
our $simpleS3 = Amazon::S3->new({
aws_access_key_id => $config{'access_key_id'},
aws_secret_access_key => $config{'access_key'},
retry => 1
});
my $filename = "/backup/2022-12-13/accounts/backup.tar.gz";
my $size = -s $filename;
my $chunkSize = (1024*1024*100);
my $parts = ceil($size / $chunkSize);
my %partList;
my $uploadId = $bucket->initiate_multipart_upload('some-big-object');
# open 9.6 GB file
open(my $file, '<', $filename) or die("Error reading file, stopped");
binmode($file);
for (my $i = 0; $i <= $parts; $i++) {
my $chunk;
my $offset = $i * $chunkSize + 1;
read($file, $chunk, $chunkSize, $offset);
# Code to do what I need to do with the chunk goes here.
$partList{$i + 1} = $bucket->upload_part_of_multipart_upload('some-big-object', $uploadId, $i + 1, $chunk);
print STDOUT "Uploaded $i of $parts.\n";
}
The problem wasn't actually LWP or the S3 API, but a stupid error in how I was reading the files. I was using read($file, $chunk, $chunkSize, $offset);
.
Which was creating filler with $offset
where I was thinking it was offsetting itself in the file by that much. This was creating chunks that grew in size until it finally crashed. Instead, the code needs to be:
seek ($file, $offset, 0);
read ($file, $chunk, $chunkSize);
Which produces the expected chunk size.