I am trying to sort a large file with about billion records (each containing four integers). The size of the file would shoot up beyond 50GB.
I am testing my code with 400 million records (about 6 GB file). My disk configuration looks like this:
disk=/var/tmp/stxxl,50G,syscall delete
My machine has 16 GB RAM and has 8 physical processors (Intel i7), stxxl version 1.4.1. If I run the code with 200 million records, it takes about 5 minutes. But when I run the code with 400 million records, it seems to be running out of disk space. My questions are:
1) Why is my code running out of disk space for sorting even a 6 GB file? Kindly review it (only few important lines attached).
2) Is 5 minute a reasonable time for my PC to sort 200 million records? If its true, I wonder if stxxl can sort 5 billion records within daytime.
3) Do you think stxxl is a nice choice for this sort of problem? I have access to a cluster as well which has mpi installed.
CODE (inspired by examples/algo/sort_file.cpp and examples/algo/phonebills.cpp):
size_t memory_to_use = (1*1024) * 1024 * 1024ul;
typedef stxxl::vector<my_type, 1, stxxl::lru_pager<8>, block_size> vector_type;
std::copy(std::istream_iterator<my_type>(in),
std::istream_iterator<my_type>(),
std::back_inserter(v));
stxxl::sort(v.begin(), v.end(), Cmp(), memory_to_use);
Each vector element or record is a tuple of four unsigned numbers:
struct my_type
{
typedef unsigned short key_type;
typedef std::tuple<key_type, key_type, key_type, key_type> key4tuple;
...
}
If you only want to sort, consider using the stxxl::sorter.
It should require only the expected amount of disk space, the total size of your data, and should sorting with at least ~100 MB/s, depending on your disk(s) and how complicated comparisons are relative to the data type size.
The stxxl::sort() function does more work and needs extra space, since it writes temporary extra data.
Also see my tutorial video :).