When I execute the following Python code on a pcap file:
if tcp.dport == 80:
try:
http=dpkt.http.Request(tcp.data)
except (dpkt.dpkt.NeedData):
continue
except (dpkt.dpkt.UnpackError):
continue
if http.method == 'POST':
print('POST Message')
Packets such as the following ones create a problem:
These are a single HTTP Post message segmented into two TCP segments and each one is sent in a different packet. However, because the first segment is a TCP only and the second one is recognised as HTTP, it seems that when dpkt.http.Request tries to read the first segment as HTTP it fails.
So far no problem. It is OK to fail as it is not really a full HTTP message. However, the issue is that it does not seem to be reading the second segment at all ("POST Message" is not printed)!!! The second segment is totally ignored as if it does not exist!!! The only possible explanation for that is that dpkt automatically reads the second segment at once as it recognises they both are segments for the same message.
The issue is that, though both TCP segments are read at once (following the above assumption), the resulted tcp.data is not recognised as an HTTP packet, rather it is still recognised as TCP only because the first segment of the message is a TCP only packet.
So what shall I do to read the HTTP header and data of such pcap file?
dpkt
only works at the packet level. dpkt.http.Request
expects the full HTTP request as input and not only the part in the current packet. This means you have to collect the input from all packets belonging to the connection, i.e. reassembling the TCP data stream.
Reassembling is not simply concatenating packets but also making sure that there are no lost packets, no duplicates and that the packets gets reassembled in the proper order which might not be the order on the wire. Essentially you need to do everything which the OS kernel would do before putting the extracted payload into a socket buffer.
For some example how parts of this can be done see Follow HTTP Stream (with decompression). Note that the example there blindly assumes that the packets are already in order, complete and without duplicates - and assumption which is not guaranteed in real life.