python reverse-engineering sequence-alignment

Aligning lists of structs for patch analysis

I am currently reverse engineering a regularly updated multiplayer game. The networking protocol uses a custom serialization framework and I am now able to restore a lot of information about the messages that are being exchanged. For each message I can retrieve the complete structure of said message and the type of that message (eg. Authentication, Chat, Movement...). However one problem I am having is that messages and message types are regularly added and removed and also messages might have fields added or removed. The overall order of messages and message types stays the same!

I am now looking for a way of how to best utilize the information I have to match the updated message structures to the old ones where I have already identified the meaning of some messages and fields. That is, given two sets of messages like the following how can I transfer the information that was already reverse engineered (comments in the new messages)?

Old Messages:

Authentication:
 message Login:
  opcode: 1
  fields:
  - string mail
  - string password

 message LoginResponse:
  opcode: 2
  fields:
  - string token

Chat:
 message ChatSend:
  opcode: 3
  fields:
  - string channel
  - string message
 message ChatReceive:
  opcode: 4
  fields:
  - string channel
  - string user
  - string message

New Messages:

Type1: # Authentication
 message Unk1: # Login
  opcode: 1
  fields:
  - string unk1 # mail
  - string unk2 # password
  _ string unk3 # new field
  
 message Unk2: # LoginResponse
  opcode: 2
  fields:
  - string unk1 # token

Type2: # new Type
  message Unk3:
   opcode: 3
   fields:
   - Vec3 unk1
   - float unk2

Type3: # Chat
 message Unk4: # ChatSend
  opcode: 4
  fields:
  - string unk1 # channel
  - string unk2 # message
 message Unk5: # new message
  opcode: 5
  fields:
  - string unk1
  - string unk2
 message Unk6: # ChatReceive
  opcode: 6
  fields:
  - string unk1 # channel
  - string unk2 # user
  - string unk3 # message

Some additional information: There are around 60 different types of messages and per message type there are at most around 100 messages. Also I would welcome solutions either in pseudo code or python.

Solution

The better and more sustainable solution is to redesign the system that produces the messages to be consistent with the names and format. That would make it more extensible.

If that is really not an option, here is a possible algorithm you might want to explore by calculating the string differences using a library such as Levenshtein. Here, let's focus on the outermost data (the types). Just perform the same concept on the inner data (the messages and fields).

Let's say these are the matches between the types in the old and new messages:

Old Messages	New Messages	Remarks
O1	N1
	N2	new
	N3	new
O2	N4
O3	N5
O4		deleted
O5		deleted
	N6	new
O6	N7
	N8	new

Where:

An example of Old Message e.g. O1:

Authentication:
 message Login:
  opcode: 1
  fields:
  - string mail
  - string password

An example of New Message e.g. N1:

Type1:
 message Unk1:
  opcode: 1
  fields:
  - string unk1
  - string unk2
  - string unk3

For each old message, calculate the Levenshtein distance to each new message and select the smallest distance. The smallest distance signifies it is the closest equivalent string. Let's assume the numbers below were the calculated distances per Ox : Ny pair

O#	N1	N2	N3	N4	N5	N6	N7	N8	Smallest Distance
O1	3	10	7	11	14	8	5	12	N1
O2	8	9	6	2	9	7	8	17	N4
O3	9	7	7	9	3	13	7	6	N5
O4	7	9	8	15	16	6	3	10	N7
O5	5	7	9	8	11	4	10	5	N6
O6	9	6	7	8	8	14	1	11	N7

But since the order of messages stays the same, O4 mapping to N7 while O5 mapping to the earlier N6 is wrong. Also O6 is wrong as it maps to the same N7. Now we have to perform additional steps before choosing the smallest distance

Check if an earlier O is mapped to an N that is either equal or later than the currently chosen N e.g. here is O5 mapping to N6 when an earlier O4 is mapped to a later N7.
- If there is, check if all of those earlier O are closer to their mapped N than the current one.
  - If all of those earlier O are closer to their N, then we can't change that because its similarity is closer than the current. Instead, we would try to choose the 2nd smallest distance to the current O and repeat the same steps.
  - But if the current O is mapped closer to the currently chosen N than any of the earlier O to their respective N, then we would choose the currently chosen N for the current O. Then we would mark all earlier O that uses an equal or later N as already deleted.

With this additional steps, the updated table would be:

O#	N1	N2	N3	N4	N5	N6	N7	N8	Smallest Distance
O1	3	10	7	11	14	8	5	12	N1
O2	8	9	6	2	9	7	8	17	N4
O3	9	7	7	9	3	13	7	6	N5
O4	7	9	8	15	16	6	3	10	N7 Deleted
O5	5	7	9	8	11	4	10	5	N6 N8 Deleted
O6	9	6	7	8	8	14	1	11	N7

As you can see, O5 was remapped from N6 (distance of 4) to N8 (distance of 5) since O4 used a later N7. But then both were marked as deleted because O6 was mapped to N7 which is closer (distance of 1) to the earlier O that used an N equal or later than N7 (namely O4 and O5).

Now, we know that:

O1 is N1
O2 is N4
O3 is N5
O4 is deleted
O5 is deleted
O6 is N7
While all unchosen N are newly added namely N2, N3, N6, N8

O#	N1	N2	N3	N4	N5	N6	N7	N8	Smallest Distance
O1	3	10	7	11	14	8	5	12	N1
O2	8	9	6	2	9	7	8	17	N4
O3	9	7	7	9	3	13	7	6	N5
O4	7	9	8	15	16	6	3	10	N7
O5	5	7	9	8	11	4	10	5	N6
O6	9	6	7	8	8	14	1	11	N7

O#	N1	N2	N3	N4	N5	N6	N7	N8	Smallest Distance
O1	3	10	7	11	14	8	5	12	N1
O2	8	9	6	2	9	7	8	17	N4
O3	9	7	7	9	3	13	7	6	N5
O4	7	9	8	15	16	6	3	10	N7
O5	5	7	9	8	11	4	10	5	N6
O6	9	6	7	8	8	14	1	11	N7

O#	N1	N2	N3	N4	N5	N6	N7	N8	Smallest Distance
O1	3	10	7	11	14	8	5	12	N1
O2	8	9	6	2	9	7	8	17	N4
O3	9	7	7	9	3	13	7	6	N5
O4	7	9	8	15	16	6	3	10	N7
O5	5	7	9	8	11	4	10	5	N6
O6	9	6	7	8	8	14	1	11	N7