perloptimizationsqliteberkeley-db

SQLite Optimization for Millions of Entries?


I'm trying to tackle a problem by using a SQLite database and Perl modules. In the end, there will be tens of millions of entries I need to log. The only unique identifier for each item is a text string for the URL. I'm thinking of doing this in two ways:

Way #1: Have a good table, bad table, unsorted table. (I need to check the html and decide whether I want it.) Say we have 1 billion pages total, 333 million URLs in each table. I have a new URL to add, and I need to check and see if it's in any of the tables, and add it to the Unsorted if it is unique. Also, I would be moving a lot of rows around with this option.

Way #2: I have 2 tables, Master and Good. Master has all 1 billion page URLs, and Good has the 333 million that I want. New URL, need to do the same thing, except this time I am only querying one table, and I would never delete a row from Master, only add the data to Good.

So basically, I need to know the best setup to quickly query a huge SQLite database to see if a text string of ~20 characters is unique, then add if it isn't.

Edit: I'm now trying to get Berkeley DB to work using the Perl module, but no dice. Here's what I have:

use BerkeleyDB;

$dbFolder = 'C:\somedirectory';
my $env = BerkeleyDB::Env->new ( -Home => $dbFolder );

my $db  = BerkeleyDB::Hash->new (
-Filename => "fred.db", 
-Env => $env );
my $status = $db->db_put("apple", "red");

And when I run this, I get the following:

Can't call method "db_put" on an undefined value at C:\Directory\perlfile.pl line 42, <STDIN> line 1.

Solution

  • If $db is undefined, opening the database is failing, and you should inspect $! and $BerkeleyDB::Error to see why.

    Have you created the database already? If not, you need -Flags => DB_CREATE.

    Working example:

    use strict;
    use warnings;
    use BerkeleyDB;
    
    my $dbFolder = '/home/ysth/bdbtmp/';
    
    my $db  = BerkeleyDB::Hash->new (
        -Filename => "$dbFolder/fred.db", 
        -Flags => DB_CREATE,
    ) or die "couldn't create: $!, $BerkeleyDB::Error.\n";
    
    my $status = $db->db_put("apple", "red");
    

    I couldn't get BerkeleyDB::Env to do anything useful, though; whatever I tried, the constructor returned undef.