Perl: Create index for tab file with two columns

I have got a huge tab-seperated file with up to 200 million rows (normally around 20 million) and two columns: the first column contains an ASCII word with up to 40 chars the second contains an integer.

I would like to do the following steps:

sort by first column
delete duplicate rows to make all rows unique
read out all rows for given entry in first column

I have a memory limit of 3 GB (so read all data into a hash won't work), unlimited hard disk space and want to run the script on a single core. I am intending to run several scripts in parallel, so the read and write operations on the hard disk shouldn't be to high.

How should proceed with the implementation of my script (in Perl) considering the size of the file?

Which algorithm do you recommend for the first step considering the size of the file?

Step 3 is the most complex part I think. How should I handle this? I am not familiar with indexing algorithms. Could you suggest one that is best for the problem? Are there any Perl modules that I could use?

Does it make sense to first convert the file into a binary file (Like converting SAM to BAM)? If yes, have you got any instructions or algorithms for converting and handling such files?

Solution

Reading the entire file into a SQLite database would be my first attempt.

Define the table like:

create table mytuples (
    mykey varchar(40),
    myval integer,
    constraint tuple_pk primary key(mykey, myval) on conflict ignore
);

A simple script using DBI which ignores insert errors should do it.

Untested, and error checks omitted

#!/usr/bin/env perl

use strict; use warnings;
use autodie;

use DBI;

my ($infile) = (@ARGV);

open my $in, '<', $infile;

my $dbh = DBI->connect('dbi:SQLite:some.db', undef, undef, {
        AutoCommit => 0,
        RaiseError => 0,
    },
);

while (my $line = <$in>) {
    my ($key, $val) = split ' ', $line;
    $dbh->do(q{INSERT INTO mytuples VALUES(?, ?)}, undef, $key, $val);
}

$dbh->commit;
$dbh->disconnect;

This may end up slower than sort and grep on the command line for the initial processing, but you may appreciating the flexibility of having SQL at your disposal.