I'm trying to compare in java a huge amount of data with 2 folders entry folder1 and folder2. Each folder contains several files of around 10 MBytes size each. I have about 100 files in each folder. Each file contains a key,value line something like (total is around 500 Millions of lines for each folder):
RFE023334343432-45,456677
RFE54667565765-5,465368
and son on..
At first, each line of all files from folder1 is read and load into a rocksdb using in my example above
key = RFE023334343432-45 and corresponding
value = 456677
Once my rocksdb is full with folder1 data, for each line read in folder2, I call the folder1 rocksdb get() method to check if the key extracted for folder2 line exists into the rocksdb. It returns null when I does not exist. Note that I cannot use rocksdb keyMayExist() method because It returns false positive results when you manipulate a too huge amount of data.
The performance are correct when the data inside the folder1 is ordered regarding the key value.
But my duration is multiplied by 3 when the input data are not sorted (I shuffled them using shell command). That is weird because in my test I copy my unsort folder1 into folder2 (just duplication my folder). Thus even if folder1 is unsorted, folder2 is also unsorted in exactly the same way as folder1.
My question is how can I sort my rocksdb by key?
RocksDB always sorts data by key. You can use an iterator to K/V pairs from a RocksDB instance. Here is the API to create an iterator: https://github.com/facebook/rocksdb/blob/v6.22.1/include/rocksdb/db.h#L709-L716