I thought I'd try using mmap
to search a multi-gigabyte file without running out of memory. I tested on a file that did actually fit and the File::Slurp
version took less than a minute but the File::Map
version was still running after many minutes so I killed it.
I tested smaller files and found that the File::Map
version got progressively slower as file size increased (2x size => 4x time) while the File::Slurp
performance remained fairly constant (2x size => 2x time).
Am I not using the module correctly, or does File::Map
always get slow on large files?
for n in 1 4 16 32 64 128 256 512 4096; do
seq $n | xargs -I@ seq 100000 > data
ls -l data
time perl -MFile::Slurp -e '
$s = read_file("data");
$re = qr/^(99999|12345|4325|11111|50000)$/m;
while ($s =~ m/$re/g){ ++$matches }
print $matches;
'
time perl -MFile::Map=:all -e '
map_file $s, "data";
advise $s, "sequential";
$re = qr/^(99999|12345|4325|11111|50000)$/m;
while ($s =~ m/$re/g){ ++$matches }
print $matches;
'
done
n | size | matches | usr(slurp) | usr(slurp)/n | sys(slurp) | sys(slurp)/n | usr(map) | usr(map)/n | sys(map) | sys(map)/n |
---|---|---|---|---|---|---|---|---|---|---|
1 | 588895 | 5 | 0.033 | 0.033 | 0.007 | 0.007 | 0.014 | 0.014 | 0.001 | 0.001 |
4 | 2355580 | 20 | 0.051 | 0.013 | 0.007 | 0.002 | 0.032 | 0.008 | 0.005 | 0.001 |
16 | 9422320 | 80 | 0.109 | 0.007 | 0.015 | 0.001 | 0.138 | 0.009 | 0.012 | 0.001 |
32 | 18844640 | 160 | 0.184 | 0.005 | 0.024 | 0.001 | 0.400 | 0.013 | 0.021 | 0.001 |
64 | 37689280 | 320 | 0.328 | 0.005 | 0.049 | 0.001 | 2.666 | 0.042 | 4.305 | 0.067 |
128 | 75378560 | 640 | 0.629 | 0.005 | 0.079 | 0.001 | 10.014 | 0.078 | 17.638 | 0.138 |
256 | 150757120 | 1280 | 1.220 | 0.005 | 0.162 | 0.001 | 40.237 | 0.157 | 73.829 | 0.288 |
512 | 301514240 | 2560 | 2.423 | 0.005 | 0.323 | 0.001 | 158.729 | 0.310 | 302.041 | 0.590 |
4096 | 2412113920 | 20480 | 19.468 | 0.005 | 2.424 | 0.001 | ? | ? | ? | ? |
Instead of manually calculating the table from ls
and time
output, following @TLP's suggestion, here's a Perl Benchmark
version (warning output elided) that also indicates that File::Slurp
's performance is independent of file size but File::Map
gets slower:
#!/bin/bash
for n in 1 4 16 32 64 128 256 512; do
seq $n | xargs -I@ seq 100000 > data$n
done
perl -MBenchmark=cmpthese -MFile::Slurp -MFile::Map=:all -e '
@n = (1,4,16,32,64,128,256,512);
sub test_slurp {
my ($s,$re,$matches);
$s = read_file($f);
$re = qr/^(99999|12345|4325|11111|50000)$/m;
while ($s =~ m/$re/g){ ++$matches }
}
sub test_map {
my ($mm,$re,$matches);
map_file $mm, $f;
advise $mm, "sequential";
$re = qr/^(99999|12345|4325|11111|50000)$/m;
while ($mm =~ m/$re/g){ ++$matches }
}
for $n (@n) {
$f = "data$n";
cmpthese(-1, { "map($n)" => \&test_map, "slurp($n)" => \&test_slurp });
}
'
Rate map(1) slurp(1)
map(1) 198/s -- -1%
slurp(1) 200/s 1% --
Rate map(4) slurp(4)
map(4) 38.3/s -- -20%
slurp(4) 48.1/s 26% --
Rate map(16) slurp(16)
map(16) 6.60/s -- -48%
slurp(16) 12.6/s 91% --
Rate map(32) slurp(32)
map(32) 1.98/s -- -62%
slurp(32) 5.17/s 161% --
s/iter map(64) slurp(64)
map(64) 7.93 -- -96%
slurp(64) 0.350 2166% --
s/iter map(128) slurp(128)
map(128) 31.6 -- -98%
slurp(128) 0.730 4233% --
s/iter map(256) slurp(256)
map(256) 129 -- -99%
slurp(256) 1.55 8244% --
s/iter map(512) slurp(512)
map(512) 521 -- -99%
slurp(512) 2.82 18372% --
Perl's copy-on-write ("COW") can't be used for the memory-mapped file string since there's no unused space at the end of the string buffer for the COW data. So when it comes time to make a copy of the scalar for $&
, the whole string buffer (the actual data) is copied. And this happens for every successful match (and maybe for unsuccessful matches too).
Take this program:
#!/usr/bin/perl
use strict;
use warnings;
use Devel::Peek qw( Dump );
use File::Map qw( map_file advise );
my $re = qr/a/;
my $s;
if ( $ARGV[0] == 0 ) {
$s = "aaaaa";
}
elsif ( $ARGV[0] == 1 ) {
map_file $s, "a";
advise $s, "sequential";
}
else {
map_file my $t, "a";
advise $t, "sequential";
$s = $t;
}
Dump( $s );
while ( $s =~ /$re/g ) {
Dump( $s );
last;
}
We first run without File::Map.
$ ./a.pl 0
SV = PV(0x58d5924eaee0) at 0x58d5925260e8
REFCNT = 1
FLAGS = (POK,IsCOW,pPOK) <--- COW flag is on.
PV = 0x58d592581d50 "aaaaa"\0
CUR = 5
LEN = 16
COW_REFCNT = 1 <--- We're sharing with the constant.
SV = PVMG(0x58d5925625c0) at 0x58d5925260e8
REFCNT = 1
FLAGS = (SMG,POK,IsCOW,pPOK) <--- COW flag still on.
IV = 0
NV = 0
PV = 0x58d592581d50 "aaaaa"\0
CUR = 5
LEN = 16
COW_REFCNT = 2 <--- An third scalar now sharing too.
MAGIC = 0x58d592581d70
MG_VIRTUAL = &PL_vtbl_mglob
MG_TYPE = PERL_MAGIC_regex_global(g)
MG_FLAGS = 0x40
BYTES
MG_LEN = 1
Since 5.20, Perl uses a copy-on-write ("COW") mechanism for strings, allowing scalars to share a string buffer. The string buffer is only copied when an attempt to change it occurs.
In the above output, we see that the string buffer becomes shared on a successful match. This is because a copy of the scalar being matched against was made for $&
, $1
, etc. to access. (A single copy of the scalar is made. $&
, $1
, etc are all magic variables that access this one copy.)
Again, because of the COW mechanism, the scalar is copied but not the string buffer. So this is very efficient.
Now, let's run using File::Map.
$ ./a.pl 1
SV = PVMG(0x6006cd285470) at 0x6006cd249178
REFCNT = 1
FLAGS = (SMG,RMG,POK,READONLY,pPOK)
IV = 0
NV = 0
PV = 0x7eb64e26e000 "aaaaa\n"
CUR = 6
LEN = 0
MAGIC = 0x6006cd2aa4b0
MG_VIRTUAL = 0x7eb64e278da0
MG_TYPE = PERL_MAGIC_ext(~)
MG_FLAGS = 0x30
DUP
LOCAL
MG_PTR = 0x6006cd2544f0 ""
SV = PVMG(0x6006cd285470) at 0x6006cd249178
REFCNT = 1
FLAGS = (SMG,RMG,POK,READONLY,pPOK) <--- COW flag isn't set.
IV = 0
NV = 0
PV = 0x7eb64e26e000 "aaaaa\n"
CUR = 6
LEN = 0
MAGIC = 0x6006cd2473b0
MG_VIRTUAL = &PL_vtbl_mglob
MG_TYPE = PERL_MAGIC_regex_global(g)
MG_FLAGS = 0x40
BYTES
MG_LEN = 1
MAGIC = 0x6006cd2aa4b0
MG_VIRTUAL = 0x7eb64e278da0
MG_TYPE = PERL_MAGIC_ext(~)
MG_FLAGS = 0x30
DUP
LOCAL
MG_PTR = 0x6006cd2544f0 ""
The COW mechanism stores information at the end of the string buffer, in the allocated space that's not currently being used by the string. But there's no such space in the memory-mapped file. This prevents the COW mechanism from being used.
The copy of the scalar that's made for $&
, $1
, etc. to access must therefore copy the string buffer as well.
Finally, let use File::Map, but pass a copy of the memory-mapped file to the regex engine.
$ ./a.pl 2
SV = PVMG(0x6475db959130) at 0x6475db8f0660
REFCNT = 1
FLAGS = (POK,pPOK)
IV = 0
NV = 0
PV = 0x6475db8bd770 "aaaaa\n"\0
CUR = 6
LEN = 16
SV = PVMG(0x6475db959130) at 0x6475db8f0660
REFCNT = 1
FLAGS = (SMG,POK,IsCOW,pPOK) <--- COW flag now on.
IV = 0
NV = 0
PV = 0x6475db8bd770 "aaaaa\n"\0
CUR = 6
LEN = 16
COW_REFCNT = 1 <--- An additional scalar now shares the string.
MAGIC = 0x6475db9757d0
MG_VIRTUAL = &PL_vtbl_mglob
MG_TYPE = PERL_MAGIC_regex_global(g)
MG_FLAGS = 0x40
BYTES
MG_LEN = 1
Some extra space was allocated when the copy was made, allowing the COW mechanism to work as intended.
This suggests that making a pre-emptive copy of the memory-mapped file would address the problem. Let's adjust your benchmark code to do that.
#!/bin/bash
for n in 1 4 16 32 64 128 256 512; do
seq $n | xargs -I@ seq 100000 > data$n
done
perl -MBenchmark=cmpthese -MFile::Slurp -MFile::Map=:all -e '
@n = (1,4,16,32,64,128,256,512);
sub test_slurp {
my ($s,$re,$matches);
$s = read_file($f);
$re = qr/^(99999|12345|4325|11111|50000)$/m;
while ($s =~ m/$re/g){ ++$matches }
}
sub test_map {
my ($mm,$re,$matches);
map_file $mm, $f;
advise $mm, "sequential";
$re = qr/^(99999|12345|4325|11111|50000)$/m;
while ($mm =~ m/$re/g){ ++$matches }
}
sub test_modified_map {
my ($mm,$re,$matches);
map_file $mm, $f;
advise $mm, "sequential";
my $s = $mm;
$re = qr/^(99999|12345|4325|11111|50000)$/m;
while ($s =~ m/$re/g){ ++$matches }
}
for $n (@n) {
$f = "data$n";
cmpthese(-1, {
"slurp($n)" => \&test_slurp,
"map($n)" => \&test_map,
"map($n)*" => \&test_modified_map,
});
print "\n";
}
'
Rate map(1) map(1)* slurp(1)
map(1) 146/s -- -3% -4%
map(1)* 150/s 3% -- -1%
slurp(1) 153/s 4% 1% --
Rate map(4) map(4)* slurp(4)
map(4) 29.0/s -- -10% -33%
map(4)* 32.1/s 11% -- -26%
slurp(4) 43.6/s 50% 36% --
Rate map(16) slurp(16) map(16)*
map(16) 8.57/s -- -22% -22%
slurp(16) 11.0/s 28% -- -0%
map(16)* 11.0/s 28% 0% --
Rate map(32) map(32)* slurp(32)
map(32) 3.60/s -- -35% -36%
map(32)* 5.50/s 53% -- -2%
slurp(32) 5.61/s 56% 2% --
(warning: too few iterations for a reliable count)
Rate map(64) map(64)* slurp(64)
map(64) 6.71e-02/s -- -98% -98%
map(64)* 3.25/s 4746% -- -1%
slurp(64) 3.28/s 4785% 1% --
(warning: too few iterations for a reliable count)
(warning: too few iterations for a reliable count)
(warning: too few iterations for a reliable count)
Rate map(128) map(128)* slurp(128)
map(128) 1.56e-02/s -- -99% -99%
map(128)* 1.42/s 8984% -- -9%
slurp(128) 1.56/s 9906% 10% --
^C
Indeed, the problem is gone.