I am working on a project that requires streaming data to disk at very high speeds on a single Linux server. An fio benchmark using the command below shows that I should be able to get the desired write speeds (> 40 GB/s) using io_uring.
fio --name=seqwrite --rw=write --direct=1 --ioengine=io_uring --bs=128k --numjobs=4 --size=100G --runtime=300 --directory=/mnt/md0/ --iodepth=128 --buffered=0 --numa_cpu_nodes=0 --sqthread_poll=1 --hipri=1
However, I am not able to replicate this performance with my own code, which makes use of the liburing helper library for io_uring. My current write speed is about 9 GB/s. I suspect that the extra overhead of liburing might be the bottleneck, but I have a few questions to ask about my approach before I give up on the much-prettier liburing code.
writev()
, but rather queueing requests to use the normal write()
function to write to disk. (tried gather / scatter IO requests, but this does not seem to have a major impact on my write speeds.)NUM_JOBS
macro. However, it does not tell me about threads that are created by the kernel for sq polling.bpftrace -e 'tracepoint:io_uring:io_uring_submit_sqe {printf("%s(%d)\n", comm, pid);}'
in a separate terminal, which shows that the kernel thread(s) dedicated to sq polling as active.IORING_SETUP_ATTACH_WQ
flag when setting up the rings. If anything, this slowed things down.The code below is a simplified version that removes a lot of error handling code for the sake of brevity. However, the performance and function of this simplified version is the same as the full-featured code.
#include <fcntl.h>
#include <liburing.h>
#include <cstring>
#include <thread>
#include <vector>
#include "utilities.h"
#define NUM_JOBS 4 // number of single-ring threads
#define QUEUE_DEPTH 128 // size of each ring
#define IO_BLOCK_SIZE 128 * 1024 // write block size
#define WRITE_SIZE (IO_BLOCK_SIZE * 10000) // Total number of bytes to write
#define FILENAME "/mnt/md0/test.txt" // File to write to
char incomingData[WRITE_SIZE]; // Will contain the data to write to disk
int main()
{
// Initialize variables
std::vector<std::thread> threadPool;
std::vector<io_uring*> ringPool;
io_uring_params params;
int fds[2];
int bytesPerThread = WRITE_SIZE / NUM_JOBS;
int bytesRemaining = WRITE_SIZE % NUM_JOBS;
int bytesAssigned = 0;
utils::generate_data(incomingData, WRITE_SIZE); // this just fills the incomingData buffer with known data
// Open the file, store its descriptor
fds[0] = open(FILENAME, O_WRONLY | O_TRUNC | O_CREAT);
// initialize Rings
ringPool.resize(NUM_JOBS);
for (int i = 0; i < NUM_JOBS; i++)
{
io_uring* ring = new io_uring;
// Configure the io_uring parameters and init the ring
memset(¶ms, 0, sizeof(params));
params.flags |= IORING_SETUP_SQPOLL;
params.sq_thread_idle = 2000;
io_uring_queue_init_params(QUEUE_DEPTH, ring, ¶ms);
io_uring_register_files(ring, fds, 1); // required for sq polling
// Add the ring to the pool
ringPool.at(i) = ring;
}
// Spin up threads to write to the file
threadPool.resize(NUM_JOBS);
for (int i = 0; i < NUM_JOBS; i++)
{
int bytesToAssign = (i != NUM_JOBS - 1) ? bytesPerThread : bytesPerThread + bytesRemaining;
threadPool.at(i) = std::thread(writeToFile, 0, ringPool[i], incomingData + bytesAssigned, bytesToAssign, bytesAssigned);
bytesAssigned += bytesToAssign;
}
// Wait for the threads to finish
for (int i = 0; i < NUM_JOBS; i++)
{
threadPool[i].join();
}
// Cleanup the rings
for (int i = 0; i < NUM_JOBS; i++)
{
io_uring_queue_exit(ringPool[i]);
}
// Close the file
close(fds[0]);
return 0;
}
void writeToFile(int fd, io_uring* ring, char* buffer, int size, int fileIndex)
{
io_uring_cqe *cqe;
io_uring_sqe *sqe;
int bytesRemaining = size;
int bytesToWrite;
int bytesWritten = 0;
int writesPending = 0;
while (bytesRemaining || writesPending)
{
while(writesPending < QUEUE_DEPTH && bytesRemaining)
{
/* In this first inner loop,
* Write up to QUEUE_DEPTH blocks to the submission queue
*/
bytesToWrite = bytesRemaining > IO_BLOCK_SIZE ? IO_BLOCK_SIZE : bytesRemaining;
sqe = io_uring_get_sqe(ring);
if (!sqe) break; // if can't get a sqe, break out of the loop and wait for the next round
io_uring_prep_write(sqe, fd, buffer + bytesWritten, bytesToWrite, fileIndex + bytesWritten);
sqe->flags |= IOSQE_FIXED_FILE;
writesPending++;
bytesWritten += bytesToWrite;
bytesRemaining -= bytesToWrite;
if (bytesRemaining == 0) break;
}
io_uring_submit(ring);
while(writesPending)
{
/* In this second inner loop,
* Handle completions
* Additional error handling removed for brevity
* The functionality is the same as with errror handling in the case that nothing goes wrong
*/
int status = io_uring_peek_cqe(ring, &cqe);
if (status == -EAGAIN) break; // if no completions are available, break out of the loop and wait for the next round
io_uring_cqe_seen(ring, cqe);
writesPending--;
}
}
}
Your fio example is using O_DIRECT, your own is doing buffered IO. That's quite a big change... Outside of that, you're also doing polled IO with fio, your example is not. Polled IO would set IORING_SETUP_IOPOLL and ensure that the underlying device has polling configured (see poll_queues=X for nvme). I suspect you end up doing IRQ driven IO anyway with fio, in case that isn't configured correctly to begin with.
A few more notes - fio also sets a few optimal flags, like defer taskrun and single issuer. If the kernel is new enough, that'll make a difference, though nothing crazy for this workload.
And finally, you're using registered files. This is fine obviously, and is a good optimization if you're reusing a file descriptor. But it's not a requirement for SQPOLL, that went away long ago.
In summary, the fio job you are running and the code you wrote do vastly different things. Not an apples to apples comparison.
Edit: fio job is also 4 threads writing to their own file, your example appears to be 4 threads writing to the same file. This will obviously make things worse, particularly since your example is buffered IO and you're just going to end up with a lot of contention on the inode lock because of that.