javaniorandomaccessfilefilechannel

Fastest way to read a line in file


I am using RandomAccessFile to read some informations from a large file. RandomAccessFile has a method seek that points the cursor to a specific part of the file that I want to read the whole line. To read this line I use readLine() method.

I read this whole file before and then created an index that allows me to access the begginning of any line with seek method. This index works fine. I created this index based on this answer: https://stackoverflow.com/a/42077860/763368

Since I have to do lots of access in this file, performance is an important issue to take care, then I am looking for other options to read the file going to an specific line and getting the whole line.

I read that FileChannel with MappedByteBuffer is a good option to quickly read files, but I didn't see any solution that does what I want.

P.S.: the lines have different lengths and I don't know this lengths.

Does anybody have a good solution?

Edit:

The file I want read has follow format: key\tvalue

The index is a hashmap with all the keys of that file been keys and the values is the byte position(Long).

Let's suppose I want go to the line with the key "foo", then I must seek to the value position, like this:

raf.seek(index.get("foo"))

If I use raf.readLine() the return will be the whole line with the key "foo".

But I don't want to use the RandomAccessFile for this work because it is too slow.

That is the way I am doing now in Scala:

val raf = new RandomAccessFile(file,"r")  
raf.seek(position.get(key))
println(raf.readLine)
raf.close

Solution

  • If you already have to read through the file once to find the indices of the keys, the absolutely fastest solution would be to read the lines and keep them in memory. If that doesn't work for some reason (e.g. memory constraints), using buffers can indeed be a good alternative. This is an outline of the code:

    FileChannel channel = new RandomAccessFile("/some/file", "r").getChannel();
    
    long pageSize = ...; // e.g. "3 GB or file size": max(channel.size(), THREE_GB); 
    long position = 0;
    ByteBuffer buffer = channel.map(FileChannel.MapMode.READ_ONLY, position, pageSize);
    
    ByteBuffer slice;
    int maxLineLength = 30;
    byte[] lineBuffer = new byte[maxLineLength];
    
    // Read line at indices 20 - 25
    buffer.position(20);
    slice = buffer.slice();
    slice.get(lineBuffer, 0, 6);
    System.out.println("Starting at 20:" + new String(lineBuffer, Charset.forName("UTF8")));
    
    // Read line at indices 0 - 10
    buffer.position(0);
    slice = buffer.slice();
    slice.get(lineBuffer, 0, 11);
    System.out.println("Starting at 0:" + new String(lineBuffer, Charset.forName("UTF8")));
    

    This code can also be used for very large files. Just call channel.map to find the "page" where your key is located: position = keyIndex / pageSize * pageSize and then call buffer.position from that index: keyIndex - position

    If you really don't have any way to group access to one "page" together, then you don't need the slice. Performance won't be as good, but this allows you to simplify the code further:

    byte[] lineBuffer = new byte[maxLineLength];
    // ...
    ByteBuffer buffer = channel.map(FileChannel.MapMode.READ_ONLY, keyIndex, lineLength);
    buffer .get(lineBuffer, 0, lineLength);
    System.out.println(new String(lineBuffer, Charset.forName("UTF8")));
    

    Note that the ByteBuffer is not created on the JVM heap, but is actually a memory mapped file at the OS level. (As of Java 8, you can verify this, by looking at the source code and searching for sun.nio.ch.DirectBuffer in the implementation).

    Line size: The best way to get the line size is to store it when you scan through the file, i.e. use Map[String, (Long, Int)] instead of what you are using for index now. If that doesn't work for you, you should run some tests to find out what is faster:

    This would be the Scala code for the second approach:

    // this happens once
    val maxLineLength: Long = 2000 // find this in your initial sequential scan
    val lineBuffer = new Array[Byte](maxLineLength.asInstanceOf[Int])
    
    // this is how you read a key
    val bufferLength = maxLineLength min (channel.size() - index("key"))
    val buffer = channel.map(FileChannel.MapMode.READ_ONLY, index("key"), bufferLength)
    var lineLength = 0 // or minLineLength
    while (buffer.get(lineLength) != '\n') {
      lineLength += 1
    }
    buffer.get(lineBuffer, 0, lineLength - 1)
    println(new String(lineBuffer, Charset.forName("UTF8")))