windowskerneldriverndis

Using NdisFIndicateReceiveNetBufferLists for every packet vs chaining them all together to receive?


I have an NDIS driver where i send received packets to the user service, then the service marks those packets that are OK (not malicious), then i iterate over the packets that are good to receive then i send them one by one by by converting each of them back to a proper NetBufferList with one NetBuffer and then i indicate them using NdisFIndicateReceiveNetBufferLists.

This caused a problem that in large file transfers through SMB (copying files from shares), which reduced the transfer speed significantly.

As a workaround, i now chain all of the NBLs that are OK altogether (instead of sending them one by one), and then send all of them at once via NdisFIndicateReceiveNetBufferLists.

My question is, will this change cause any issue? Any difference between sending X number of NBLs one by one vs chaining them together and sending all of them at once? (since most of them might be related to different flows/apps)

Also, the benefit of chaining packets together is much greater in multi packet receive compared to multi packet send via FilterSendNetBufferLists, why is that?


Solution

  • An NET_BUFFER represents one single network frame. (With some appropriate hand-waving for LSO/RSC.)

    An NET_BUFFER_LIST is a collection of related NET_BUFFERs. Each NET_BUFFER on the same NET_BUFFER_LIST belong to the same "traffic flow" (more on that later), they have all the same metadata and will have all the same offloads performed on them. So we use the NET_BUFFER_LIST to group related packets and to have them share metadata.

    The datapath generally operates on batches of multiple NET_BUFFER_LISTs. The entire batch is only grouped together for performance reasons; there's not a lot of implied relation between multiple NBLs within a batch. Exception: most datapath routines take a Flags parameter that can hold flags that make some claims about all the NBLs in a batch, for example, NDIS_RECEIVE_FLAGS_SINGLE_ETHER_TYPE.

    So to summarize, you can indeed safely group multiple NET_BUFFER_LISTs into a single indication, and this is particularly important for perf. You can group unrelated NBLs together, if you like. However, if you are combining batches of NBLs, make sure you clear out any NDIS_XXX_FLAGS_SINGLE_XXX style flags. (Unless, of course, you know that the flags' promise still holds. For example, if you're combining 2 batches of NBLs that both had the NDIS_RECEIVE_FLAGS_SINGLE_ETHER_TYPE flag, and if you verify that the first NBL in each batch has the same EtherType, then it is actually safe to preserve the NDIS_RECEIVE_FLAGS_SINGLE_ETHER_TYPE flag.)

    However note that you generally cannot combine multiple NET_BUFFERs into the same NET_BUFFER_LIST, unless you control the application that generated the payload and you know that the NET_BUFFERs' payloads belong to the same traffic flow. The exact semantics of a traffic flow are a little fuzzy down in the NDIS layer, but you can imagine it means that any NDIS-level hardware offload can safely treat each packet as the same. For example, an IP checksum offload needs to know that each packet has the same pseudo-header. If all the packets belong to the same TCP or UDP socket, then they can be treated as the same flow.

    Also, the benefit of chaining packets together is much greater in multi packet receive compared to multi packet send via FilterSendNetBufferLists, why is that?

    Receive is the expensive path, for two reasons. First, the OS has to spend CPU to demux the raw stream of packets coming in from the network. The network could send us packets from any random socket, or packets that don't match any socket at all, and the OS has to be prepared for any possibility. Secondly, the receive path handles untrusted data, so it has to be cautious about parsing.

    In comparison, the send path is super cheap: the packets just fall down to the miniport driver, who sets up a DMA and they're blasted to hardware. Nobody in the send path really cares what's actually in the packet (the firewall already ran before NDIS saw the packets, so you don't see that cost; and if the miniport is doing any offload, that's paid on the hardware's built-in processor, so it doesn't show up on any CPU you can see in Task Manager.)

    So if you take a batch of 100 packets and break it into 100 calls of 1 packet on the receive path, the OS has to grind through 100 calls of some expensive parsing functions. Meanwhile, 100 calls through the send path isn't great, but it'll be only a fraction of the CPU costs of the receive path.