I want to create and manipulate large arrays of (4 byte) integers in-memory. By large, I mean on the order of hundreds of million. Each cell in the array will act as a counter for a position on a chromosome. All I need is for it to fit in memory, and have fast (O(1)) access to elements. The thing I'm counting is not a sparse feature, so I can't use a sparse array.
I can't do this with a regular perl list, because perl (at least on my machine) uses 64 bytes per element, so the genomes of most of the organisms I work with are just too big. I've tried storing the data on-disk via SQLite and hash tying, and though they work, are very slow, especially on ordinary drives. (It works reasonably ok when I run on 4-drive raid 0's).
I thought I could use PDL arrays, b/c PDL stores its arrays just as C does, using only 4 bytes per element. However, I found that update speed to be excruciatingly slow compared to perl lists:
use PDL;
use Benchmark qw/cmpthese/;
my $N = 1_000_000;
my @perl = (0 .. $N - 1);
my $pdl = zeroes $N;
cmpthese(-1,{
perl => sub{
$perl[int(rand($N))]++;
},
pdl => sub{
# note that I'm not even incrementing here just setting to 1
$pdl->set(int(rand($N)), 1);
}
});
Returns:
Rate pdl perl
pdl 481208/s -- -87%
perl 3640889/s 657% --
Does anyone know how to increase pdl set() performance, or know of a different module that can accomplish this?
I cannot tell what sort of performance you will get, but I recommend using the vec
function, documented here, to split a string into bit fields. I have experimented and found that my Perl will tolerate a string up to 500_000_000
characters long. which corresponds to 125,000,000 32-bit values.
my $data = "\0" x 500_000_000;
vec($data, 0, 32)++; # Increment data[0]
vec($data, 100_000_000, 32)++; # Increment data[100_000_000]
If this isn't enough there may be something in the build of Perl that controls the limit. Alternatively if you think you can get smaller fields - say 16-bit counts - vec
will accept field widths of any power of 2 up to 32.
Edit: I believe the string size limit is related to the 2GB maximum private working set on 32-bit Windows processes. If you are running Linux or have a 64-bit perl you may be luckier than me.
I have added to your benchmark program like this
my $vec = "\0" x ($N * 4);
cmpthese(-3,{
perl => sub{
$perl[int(rand($N))]++;
},
pdl => sub{
# note that I'm not even incrementing here just setting to 1
$pdl->set(int(rand($N)), 1);
},
vec => sub {
vec($vec, int(rand($N)), 32)++;
},
});
giving these results
Rate pdl vec perl
pdl 472429/s -- -76% -85%
vec 1993101/s 322% -- -37%
perl 3157570/s 568% 58% --
so using vec
is two-thirds the speed of a native array. Presumably that's acceptable.