I am working on a Perl project that involves building a hash with about 17 million keys. This is too big to be stored in memory (my laptop's memory will only hold about 10 million keys). I know that the solution is to store the data on disk, but I'm having trouble executing this in practice. Here's what I've tried:
DB_File
use strict;
use DB_File;
my $libfile = shift;
my %library;
tie %library, "DB_File", "$libfile";
for (my $a = 1; $a < 17000000; a++) {
# Some code to generate key and value #
$library{$key} = $value;
}
This gives me a segmentation fault: 11 part way through the loop, for reasons I don't understand.
BerkeleyDB
use strict;
use BerkeleyDB;
my $libfile = shift;
my $library = new BerkeleyDB::Hash
-Filename => $libfile,
-Flags => DB_CREATE;
for (my $a = 1; $a < 17000000; a++) {
# Some code to generate key and value #
$library->db_put($key, $value);
}
This seems to work well for about the first 15 million keys, but then slows down dramatically and finally freezes completely near the end of the loop. I don't think this is a memory issue; if I break the loop into four pieces, put them in four separate programs, and run them sequentially (adding ~4 million records to the database each time), the first three complete successfully, but the fourth one hangs when the database has about 15 million keys. So it seems like maybe BerkeleyDB can only handle ~15 million keys in a hash???
DBM::Deep
use strict;
use DBM::Deep;
my $libfile = shift;
my $library = new DBM::Deep $libfile;
for (my $a = 1; $a < 17000000; a++) {
# Some code to generate key and value #
$library->put($key => $value);
}
From preliminary tests this seems to work ok, but it's REALLY slow: about 5 seconds per thousand keys, or ~22 hours to run the whole loop. I'd prefer to avoid this if at all possible.
I'd be very grateful for suggestions on troubleshooting one of these packages, or ideas about other options for accomplishing the same thing.
UPDATE
Switching to btree may improve performance for HUGE BerkeleyDB accessed in "key sorted mode". It reduces number of disk I/O operations required.
Case study: In one case reported in news:comp.mail.sendmail I remember HUGE BerkeleyDB creation time was reduced from a few hours for hash to 20 minutes for btree with "key sorted" appends. It was too long anyway so the guy decided to switch to soft capable to access SQL database directly avoiding needs for SQL database "dumps" to BerkeleyDB. (virtusertable, sendmail->postfix)