I have a huge byte array that needs to be processed. In theory, it should be possible to slice the work into even pieces and assign them to different threads to increase performance on a multi-core machine.
I allocated a ByteBuffer
for each thread and processed parts of the data each. The final performance is slower than with a single thread even though I have 8 logical processors. Also it is very inconsistent. Sometimes the same input is double as slow to process or more. Why is that? The data is loaded into the memory first so no more IO
operations are performed.
I allocate my ByteBuffers using MappedByteBuffer
because it's faster than ByteBuffer.wrap()
:
public ByteBuffer getByteBuffer() throws IOException
{
File binaryFile = new File("...");
FileChannel binaryFileChannel = new RandomAccessFile(binaryFile, "r").getChannel();
return binaryFileChannel.map(FileChannel.MapMode.READ_ONLY, 0, binaryFileChannel.size());
}
I do my concurrent processing using Executors
:
int threadsCount = Runtime.getRuntime().availableProcessors();
ExecutorService executorService = Executors.newFixedThreadPool(threadsCount);
ExecutorCompletionService<String> completionService = new ExecutorCompletionService<>(executorService);
for (ByteBufferRange byteBufferRange : byteBufferRanges)
{
Callable<String> task = () ->
{
performTask(byteBufferRange);
return null;
};
completionService.submit(task);
}
// Wait for all tasks to finish
for (ByteBufferRange ignored : byteBufferRanges)
{
completionService.take().get();
}
executorService.shutdown();
The concurrent tasks performTask()
use their own ByteBuffer
instances to read memory from the buffer, do calculations and so on. They do not synchronize, write or influence each other. Any ideas what is going wrong or is this not a good case of parallelization?
The same problem exist with ByteBuffer.wrap()
and MappedByteBuffer
alike.
As @EJP mentioned, the disk isn't really multi-threaded, though an SSD may help. The point of mapping the buffer is so you don't have to manage the memory yourself; let the OS do it since its virtual memory manager and file system cache are going to be faster than moving it into Java's heap and probably faster than any memory management code you write.
If the processing really can be parallelized, you will probably be better off having a single thread read the entire file, breaking it into chunks (possibly in some intermediate data format), then having your executors work on these chunks. The file reading thread can run concurrently with the other threads, so you don't need to read the whole file to start processing.
You may want to try setting the number of executors to cores - 1
so you don't starve the file reading thread. That would give the OS a chance to keep the file reading thread running on a single core without context switching so you will get good IO performance while using the other cores to do CPU intensive work.
FYI, this is what Apache Spark is built for. You may want to look at that if you need to work with larger files or need to process faster than what a single system can do.