arraysperlpdl

C-like arrays in perl


I want to create and manipulate large arrays of (4 byte) integers in-memory. By large, I mean on the order of hundreds of million. Each cell in the array will act as a counter for a position on a chromosome. All I need is for it to fit in memory, and have fast (O(1)) access to elements. The thing I'm counting is not a sparse feature, so I can't use a sparse array.

I can't do this with a regular perl list, because perl (at least on my machine) uses 64 bytes per element, so the genomes of most of the organisms I work with are just too big. I've tried storing the data on-disk via SQLite and hash tying, and though they work, are very slow, especially on ordinary drives. (It works reasonably ok when I run on 4-drive raid 0's).

I thought I could use PDL arrays, b/c PDL stores its arrays just as C does, using only 4 bytes per element. However, I found that update speed to be excruciatingly slow compared to perl lists:

use PDL;
use Benchmark qw/cmpthese/;

my $N = 1_000_000;
my @perl = (0 .. $N - 1);
my $pdl = zeroes $N;

cmpthese(-1,{ 
    perl => sub{
        $perl[int(rand($N))]++;
    },
    pdl => sub{
        # note that I'm not even incrementing here just setting to 1
        $pdl->set(int(rand($N)), 1);
    }
});

Returns:

          Rate  pdl perl
pdl   481208/s   -- -87%
perl 3640889/s 657%   --    

Does anyone know how to increase pdl set() performance, or know of a different module that can accomplish this?


Solution

  • I cannot tell what sort of performance you will get, but I recommend using the vec function, documented here, to split a string into bit fields. I have experimented and found that my Perl will tolerate a string up to 500_000_000 characters long. which corresponds to 125,000,000 32-bit values.

    my $data = "\0" x 500_000_000;
    vec($data, 0, 32)++;            # Increment data[0]
    vec($data, 100_000_000, 32)++;  # Increment data[100_000_000]
    

    If this isn't enough there may be something in the build of Perl that controls the limit. Alternatively if you think you can get smaller fields - say 16-bit counts - vec will accept field widths of any power of 2 up to 32.

    Edit: I believe the string size limit is related to the 2GB maximum private working set on 32-bit Windows processes. If you are running Linux or have a 64-bit perl you may be luckier than me.


    I have added to your benchmark program like this

    my $vec = "\0" x ($N * 4);
    
    cmpthese(-3,{ 
        perl => sub{
            $perl[int(rand($N))]++;
        },
        pdl => sub{
            # note that I'm not even incrementing here just setting to 1
            $pdl->set(int(rand($N)), 1);
        },
        vec => sub {
            vec($vec, int(rand($N)), 32)++; 
        },
    });
    

    giving these results

              Rate  pdl  vec perl
    pdl   472429/s   -- -76% -85%
    vec  1993101/s 322%   -- -37%
    perl 3157570/s 568%  58%   --
    

    so using vec is two-thirds the speed of a native array. Presumably that's acceptable.