This question is about low-level async I/O system calls like send + epoll/aio_read and others. I am asking about both network I/O and disk I/O.
The naive way of implementing those async calls would be to create a thread for each asynchronous I/O request, which would then do the request in a synchronous way. Obviously, this naive solution scales badly with a large number of parallel requests. Even if a thread pool was used, we would still need to have one thread for each parallel request.
Therefore, I speculate that this is done in the following more efficient way:
For writing/sending data:
Append the send-request to some kernel-internal async I/O queue.
Dedicated "write-threads" are picking up these send-requests in such a way, that the target hardware is fully utilized. For this, a special I/O scheduler might be used.
Depending on the target hardware, the write-requests get dispatched eventually, e.g. via Direct Memory Access (DMA).
For reading/receiving data:
The hardware raises an I/O interrupt that jumps into an I/O interrupt handler of the kernel.
The interrupt handler appends a notification to a read-queue and returns quickly.
Dedicated "read-threads" pick up the notifications of the read-queue and perform two tasks: 1) Copy the read data to the target buffer if necessary. 2.) Notify the target process in some way if necessary (e.g.epoll, signals,..).
For all of this, there is no need to have more write-threads or read-threads than the number of CPU cores. Hence, the parallel requests scalability problem would be solved.
How is this implemented in real OS kernels? Which of these speculations are true?
Those "asynchronous" I/O stuffs are another illusion by KERNEL and Driver service. I will take an example of wifi driver. (which is network).
1) If packets are coming in, wifi H/W will generate interrupts and DMA the dot11 frame or dot3 frame to DRAM (it depends on wifi H/W. Nowadays, most of modern wifi hw will convert the packets in HW - actually FW on HW).
2) Wifi Driver(running in KERNEL) is supposed to handle multiple wifi related things but most likely, it will form socket buffer (skb) and then send skbs to Linux KERNEL. Typically, it is happening in NET_RX_SOFTIRQ or your can create your own thread.
3) Packets come to Linux stack. You can send it to user space. It is happening in "__netif_receive_skb_core" and if the packet is "IP" packet, the first rx_handler would be "ip_rcv()".
4) ip packets move up to transport layer hander which is udp_rcv() / tcp_rcv(). To send packets to transport layer, you have to go through socket layer and eventually, you will form packet linked list (you can say Q) on the specific socket.
5) As far as I understand, this "Q" is the queue to supply packet to user space. You can do "async" or "sync" I/O here.
1) Packets are going through KERNEL's transportation layer and IP layer and eventually, your netdev TX handler is getting called (hard_start_xmit or ndo_xmit_start). Basically, if your netdev(e.g. eth0 or wifi0) is ethernet device, it is connected to your ethernet driver "TX" function or wifi driver "TX" function. This is callback and it is typically set up when driver is up.
2) At this stage, your packets are already transformed to "skb"
3) In the callback, it will prepare all the headers and descriptors and do DMA.
4) Once TX is don on HW, HW will generate interrupt and you need to free the packet.
Here, my point is, your network I/O is already working as "Asynchronous" at DMA and Driver level. Most of modern drivers may have separate context for this. For TX, it would use thread, tasklet or NET_TX_SOFTIRQ. For RX, if we are using "NAPI", it would use NET_RX_SOFTIRQ. or it can use thread and tasklet too.
All these are happening independently based on "interrupt" or some other trigger.
"Synchronous I/O" is mostly simulated in upper application layer. So, if you re-write your socket layer in kernel, you can do whatever you want to do since lower layer is already working as you want.