I am currently reverse engineering a regularly updated multiplayer game. The networking protocol uses a custom serialization framework and I am now able to restore a lot of information about the messages that are being exchanged. For each message I can retrieve the complete structure of said message and the type of that message (eg. Authentication, Chat, Movement...). However one problem I am having is that messages and message types are regularly added and removed and also messages might have fields added or removed. The overall order of messages and message types stays the same!
I am now looking for a way of how to best utilize the information I have to match the updated message structures to the old ones where I have already identified the meaning of some messages and fields. That is, given two sets of messages like the following how can I transfer the information that was already reverse engineered (comments in the new messages)?
Old Messages:
Authentication:
message Login:
opcode: 1
fields:
- string mail
- string password
message LoginResponse:
opcode: 2
fields:
- string token
Chat:
message ChatSend:
opcode: 3
fields:
- string channel
- string message
message ChatReceive:
opcode: 4
fields:
- string channel
- string user
- string message
New Messages:
Type1: # Authentication
message Unk1: # Login
opcode: 1
fields:
- string unk1 # mail
- string unk2 # password
_ string unk3 # new field
message Unk2: # LoginResponse
opcode: 2
fields:
- string unk1 # token
Type2: # new Type
message Unk3:
opcode: 3
fields:
- Vec3 unk1
- float unk2
Type3: # Chat
message Unk4: # ChatSend
opcode: 4
fields:
- string unk1 # channel
- string unk2 # message
message Unk5: # new message
opcode: 5
fields:
- string unk1
- string unk2
message Unk6: # ChatReceive
opcode: 6
fields:
- string unk1 # channel
- string unk2 # user
- string unk3 # message
Some additional information: There are around 60 different types of messages and per message type there are at most around 100 messages. Also I would welcome solutions either in pseudo code or python.
The better and more sustainable solution is to redesign the system that produces the messages to be consistent with the names and format. That would make it more extensible.
If that is really not an option, here is a possible algorithm you might want to explore by calculating the string differences using a library such as Levenshtein. Here, let's focus on the outermost data (the types). Just perform the same concept on the inner data (the messages and fields).
Let's say these are the matches between the types in the old and new messages:
Old Messages | New Messages | Remarks |
---|---|---|
O1 | N1 | |
N2 | new | |
N3 | new | |
O2 | N4 | |
O3 | N5 | |
O4 | deleted | |
O5 | deleted | |
N6 | new | |
O6 | N7 | |
N8 | new |
Where:
O1
:Authentication:
message Login:
opcode: 1
fields:
- string mail
- string password
N1
:Type1:
message Unk1:
opcode: 1
fields:
- string unk1
- string unk2
- string unk3
For each old message, calculate the Levenshtein distance to each new message and select the smallest distance. The smallest distance signifies it is the closest equivalent string. Let's assume the numbers below were the calculated distances per Ox
: Ny
pair
O# | N1 | N2 | N3 | N4 | N5 | N6 | N7 | N8 | Smallest Distance |
---|---|---|---|---|---|---|---|---|---|
O1 | 3 | 10 | 7 | 11 | 14 | 8 | 5 | 12 | N1 |
O2 | 8 | 9 | 6 | 2 | 9 | 7 | 8 | 17 | N4 |
O3 | 9 | 7 | 7 | 9 | 3 | 13 | 7 | 6 | N5 |
O4 | 7 | 9 | 8 | 15 | 16 | 6 | 3 | 10 | N7 |
O5 | 5 | 7 | 9 | 8 | 11 | 4 | 10 | 5 | N6 |
O6 | 9 | 6 | 7 | 8 | 8 | 14 | 1 | 11 | N7 |
But since the order of messages stays the same, O4
mapping to N7
while O5
mapping to the earlier N6
is wrong. Also O6
is wrong as it maps to the same N7
. Now we have to perform additional steps before choosing the smallest distance
O
is mapped to an N
that is either equal or later than the currently chosen N
e.g. here is O5
mapping to N6
when an earlier O4
is mapped to a later N7
.
O
are closer to their mapped N
than the current one.
O
are closer to their N
, then we can't change that because its similarity is closer than the current. Instead, we would try to choose the 2nd smallest distance to the current O
and repeat the same steps.O
is mapped closer to the currently chosen N
than any of the earlier O
to their respective N
, then we would choose the currently chosen N
for the current O
. Then we would mark all earlier O
that uses an equal or later N
as already deleted.With this additional steps, the updated table would be:
O# | N1 | N2 | N3 | N4 | N5 | N6 | N7 | N8 | Smallest Distance |
---|---|---|---|---|---|---|---|---|---|
O1 | 3 | 10 | 7 | 11 | 14 | 8 | 5 | 12 | N1 |
O2 | 8 | 9 | 6 | 2 | 9 | 7 | 8 | 17 | N4 |
O3 | 9 | 7 | 7 | 9 | 3 | 13 | 7 | 6 | N5 |
O4 | 7 | 9 | 8 | 15 | 16 | 6 | 3 | 10 | |
O5 | 5 | 7 | 9 | 8 | 11 | 4 | 10 | 5 | |
O6 | 9 | 6 | 7 | 8 | 8 | 14 | 1 | 11 | N7 |
As you can see, O5
was remapped from N6
(distance of 4) to N8
(distance of 5) since O4
used a later N7
. But then both were marked as deleted because O6
was mapped to N7
which is closer (distance of 1) to the earlier O
that used an N
equal or later than N7
(namely O4
and O5
).
Now, we know that:
O1
is N1
O2
is N4
O3
is N5
O4
is deletedO5
is deletedO6
is N7
N
are newly added namely N2
, N3
, N6
, N8