I'm building a high velocity distributed database in Rust, using io_uring, eBPF and the NVMe API, which means I cannot use 99% of the existing libraries/frameworks out there, but instead I need to implement everything from scratch, starting from a custom event loop.
At the moment I implemented only Unix Domain Socket/UDP/TCP, without TSL/SSL (due to lack of skills), but I would like to make the question as generic as possible (UDS/UDP/TCP/QUIC both in datagram and stream fashion, with and without TLS/SSL).
Let's say Alice connect to the database and sends two commands, without waiting for completion:
SET KEY1 PAYLOAD1
SET KEY2 PAYLOAD2
And let's say the payloads are big, big enough to not fit one packet.
How can I handle this case? How can I detect that two packets belong to the same command?
I thought about putting a RequestID
/ SessionID
in each packet, but I would need to know where a message get split, or the client could split before sending, but this means detecting the MTU and it would be inefficient.
Which strategies could I adopt to deal with this?
How can I handle this case?
Use an existing transport protocol (such as TCP or QUIC) – or invent your own transport protocol by looking at how existing ones handle it.
I thought about putting a RequestID / SessionID in each packet, but I would need to know where a message get split,
Messages do not get split in the way you think of. If an IP packet gets fragmented at the sender, then it gets defragmented at the receiver, so that the payload (e.g. UDP packet) that comes out is exactly the same as what was sent – so if a 5000-byte UDP packet was sent originally, then a 5000-byte UDP packet is exactly what will come out of recv(); not multiple smaller UDP packets.
or the client could split before sending, but this means detecting the MTU and it would be inefficient.
That is nevertheless practically the only option. Re-fragmenting by gateways is no longer allowed in IPv6 (only the sender can fragment) so it's always the sender who has to pre-fragment according to path MTU.
But in general IP-level fragmentation is unreliable as some networks outright block fragments and other networks block or ignore the ICMP messages that would help MTU discovery (e.g. DNS experiments with EDNS0 negotiating 4096-byte packets were not entirely successful), so application-level fragmenting might be the only option.
For TCP specifically, the endpoints always negotiate a "Maximum Segment Size" that usually is based on the discovered path MTU (so that each TCP segment always fits in a single IP packet without requiring IP-level fragmentation).