javafileoptimizationrandomaccessfile

How can I search for a specific number/timestamp in a large sorted file in most optimized way using Java?


My file consists of logs. In which, every line is a log with starting column as time. All the lines in file are sorted according to the timestamp. I have to find out where the given timestamp occurs in the given file, file size could be of around 10gb. I can sequentially check line by line. Is there any way this can be done in optimized way to find the required?

Edit: I'm thinking of applying binary search. But what would be the approach I should go with to apply binary search on file? Can I use randomAccessFile class and use pointers? If so, How can I spot starting of a specific line where my pointer lands to get the timestamp of that log, thanks.

Sample log in the file: 2020-01-31T20:12:38.1234Z,field1,field2,etc,.....\n


Solution

  • Option 1 (fastest):

    If possible, create another file that acts as an index for the file when generating the input. This could represent what index in the byte array each line exists at as well as the length of the line in bytes. You could even break this up into multiple index files.

    // 1 is line id, 0 is byte start index, 12 is end index 
    1 0 12 
    

    Option 2:

    A good solution would be a binary search implementation. This would likely be significantly faster than a linear search. The idea is that if what you're seeking is unequal to the middle element (line) then you're going to use the left half of the file byte array, otherwise the right half of the byte array.