First I will tell environment of my PC, background of my question, my problem, than I will explain my exact question.
Environment:
OS: Ubuntu 16.04
Kernel: 4.17.1
CPU: i7-6700k
Memory: 8GB DRAM
Storage: SSD 120GB
Background:
I'm trying to optimizing linux kernel for my specific application. Following is abstract logic of this application.
1. call malloc, allocate the memory space which size is exactly 4KB(page size)
2. Copy predefined data(also, size is 4KB) to allocated memory space.
3. Do computation
4. Free allocated memory space.
This sequence occurs about several thousands to ten thousands times a second.
So I thought copy predefined data to allocated memory space using memcpy() thousands of times every second is very inefficient. But I cannot fix the code of this application.
My problem:
I want to do these copies asynchronously by kernel module, using less CPU cycles as possible. So I'm trying to implement a kernel module that copy this predefined data to free page frames asynchronously in kernel, and managing a pool page frames which has predefined data on them. When my specific application request a page frame, my kernel will give a page frame from this pool.
To copy data asynchronously, I first considered DMA, but intel idma64 of my CPU cannot copy data memory to memory asynchronously. Now, I'm trying to copy this data from secondary storage(SSD) to memory. I found that there is library for asynchronous IO named libaio in linux.
My question:
1. Can I use libaio libraries in kernel module? If not, what kind of library or APIs do I have to use to copy asynchronously in my kernel module?
2. Will libaio(or something else) really do copies without exploiting CPU cycles?
I don't think you need to write a kernel module. A user space thread pool of CPU pinned threads working with a collection of memory maps of files will be as efficient as is possible to implement. Just be careful of "TLB shootdown" i.e. avoid modifying the address space of the process, and throw as much virtual address space as you can at the problem to avoid that. Perhaps a little bit of hinting to the kernel as to what written pages will never be used again via madvise()
, and you should be optimal - sufficient multiple threads will maximise queue depth to the SSD, you want to aim for QD8 to QD16, and you should easily saturate a NVMe link whilst keeping CPU usage below 100%.
Things get harder if you have many NVMe linked SSDs, you may need to consider replacing Linux will something with more scalable storage i/o, but there is a throughput vs scalability tradeoff there. Windows and FreeBSD will scale better with lots of devices if you partition the work up right, but Linux will do much better with a few devices. Good luck!