[SOLVED] Why does tokyo tyrant slow down exponentially even after adjusting bnum?

Why does tokyo tyrant slow down exponentially even after adjusting bnum?

Has anyone successfully used Tokyo Cabinet / Tokyo Tyrant with large datasets? I am trying to upload a subgraph of the Wikipedia datasource. After hitting about 30 million records, I get exponential slow down. This occurs with both the HDB and BDB databases. I adjusted bnum to 2-4x the expected number of records for the HDB case with only a slight speed up. I also set xmsiz to 1GB or so but ultimately I still hit a wall.

It seems that Tokyo Tyrant is basically an in memory database and after you exceed the xmsiz or your RAM, you get a barely usable database. Has anyone else encountered this problem before? Were you able to solve it?

Solution

I think I may have cracked this one, and I haven't seen this solution anywhere else. On Linux, there are generally two reasons that Tokyo starts to slow down. Lets go through the usual culprits. First, is if you set your bnum too low, you want it to be at least equal to half of the number of items in the hash. (Preferrably more.) Second, you want to try to set your xmsiz to be close to the size of the bucket array. To get the size of the bucket array, just create an empty db with the correct bnum and Tokyo will initialize the file to the appropriate size. (For example, bnum=200000000 is approx 1.5GB for an empty db.)

But now, you'll notice that it still slows down, albeit a bit farther along. We found that the trick was to turn off journalling in the filesystem -- for some reason the journalling (on ext3) spikes as your hash file size grows beyond 2-3GB. (The way we realized this was spikes in I/O not corresponding to the changes of the file on disk, alongside daemon CPU bursts of kjournald)

For Linux, just unmount and remount your ext3 partition as an ext2. Build your db, and remount as ext3. When journalling was disabled we could build 180M key sized db's without a problem.