I am writing an application that requires me to stream data over a network interface and write it to disk at a very high throughput. The network and file IO components were implemented separately, and both are able to independently achieve the throughput required for the project. The networking side leverages DPDK
(more relevant) and the file IO side leverages io_uring
(less relevant). To achieve the high file IO throughput that I need, I must use direct IO (O_DIRECT); this is true regardless of the technology used to achieve the file IO. Using the page cache simply is not an option. The application must be zero-copy from the NIC to the NVMes we are using for storage.
I have been unable to align the DPDK message buffers (rte_mbuf
) to enable the direct IO. This severely limits my file IO throughput and if it is not possible, I will likely need to find an alternative to DPDK, which of course, I would like to avoid. Does anyone know how this memory alignment can be achieved? The message buffers should be aligned to addresses that are multiples of 4096.
There are a number of ways to set up the DPDK mempools (rte_mempoool
) and message buffers. Right now, I am using rte_pktmbuf_pool_create()
(as seen below), which creates a mempool and allocates the message buffers all with one function call, but I am open to going with a different approach if it helps me to get the alignment I need.
rte_pktmbuf_pool_create(name, num_bufs, DPDK_MBUF_CACHE_SIZE, 0, mbuf_size, cpu_socket);
Where...
DPDK_MBUF_CACHE_SIZE
is hardware-determined and is set to 315mbuf_size
is 9000 + RTE_PKTMBUF_HEADROOM (defined by DPDK to be 128) + RTE_ETHER_HDR_LEN + RTE_ETHER_CRC_LENSee the following code snippets, which provide a solution. Be sure to read all the way to the bottom before trying to implement something similar in your project. Also, please note that all critical error handling has been removed for the sake of brevity, and should be added back into any similar implementation.
register_external_buffers()
allocates the external memory areas in huge pages and registers them with DPDK
.
unsigned register_external_buffers(rte_device* device, uint32_t num_mbufs, uint16_t mbuf_size, unsigned socket, rte_pktmbuf_extmem **ext_mem)
{
rte_pktmbuf_extmem *extmem_array; // Array of external memory descriptors
unsigned elements_per_zone; // Memory is reserved and registered in zones
unsigned n_zones; // Number of zones needed to accomodate all mbufs
uint16_t element_size; // Size, in bytes, of one mbuf element
int status; // Used to store error codes / return values
element_size = RTE_ALIGN_CEIL(mbuf_size, 4096);
elements_per_zone = RTE_PGSIZE_1G / element_size;
n_zones = (num_mbufs / elements_per_zone) + ((num_mbufs % elements_per_zone) ? 1 : 0);
extmem_array = new rte_pktmbuf_extmem[n_zones];
for (int extmem_index = 0; extmem_index < n_zones; extmem_index++)
{
rte_pktmbuf_extmem *current_extmem = extmem_array + extmem_index;
current_extmem->buf_ptr = mmap(NULL, RTE_PGSIZE_1G, PROT_READ | PROT_WRITE, MAP_HUGETLB | MAP_SHARED | MAP_ANONYMOUS | MAP_POPULATE | MAP_LOCKED, -1, 0);
current_extmem->buf_iova = NULL;
current_extmem->buf_len = RTE_PGSIZE_1G;
current_extmem->elt_size = element_size;
rte_extmem_register(current_extmem->buf_ptr, current_extmem->buf_len, NULL, 0, RTE_PGSIZE_1G);
rte_dev_dma_map(device, current_extmem->buf_ptr, (rte_iova_t) current_extmem->buf_ptr, current_extmem->buf_len);
}
*ext_mem = extmem_array;
return n_zones;
}
Then register_external_buffers
might be used as follows:
rte_eth_dev_info dev_info;
rte_eth_dev_info_get(port_id, &dev_info);
unsigned length = register_external_buffers(dev_info.device, num_bufs, mbuf_size, cpu_socket, &extmem);
m_rx_pktbuf_pools.at(cpu_socket) = rte_pktmbuf_pool_create_extbuf(name, num_bufs, DPDK_MBUF_CACHE_SIZE, 0, mbuf_size, cpu_socket, extmem, length);
While this does result in all mbuf data being stored and aligned in external memory areas, they are aligned to huge pages -- not the typical 4k pages. This means that while the initial problem was solved, the solution is not very practical for this use case, as the number of page boundaries is very limited.