pythonreverse-engineeringsequence-alignment

Aligning lists of structs for patch analysis


I am currently reverse engineering a regularly updated multiplayer game. The networking protocol uses a custom serialization framework and I am now able to restore a lot of information about the messages that are being exchanged. For each message I can retrieve the complete structure of said message and the type of that message (eg. Authentication, Chat, Movement...). However one problem I am having is that messages and message types are regularly added and removed and also messages might have fields added or removed. The overall order of messages and message types stays the same!

I am now looking for a way of how to best utilize the information I have to match the updated message structures to the old ones where I have already identified the meaning of some messages and fields. That is, given two sets of messages like the following how can I transfer the information that was already reverse engineered (comments in the new messages)?

Old Messages:

Authentication:
 message Login:
  opcode: 1
  fields:
  - string mail
  - string password

 message LoginResponse:
  opcode: 2
  fields:
  - string token

Chat:
 message ChatSend:
  opcode: 3
  fields:
  - string channel
  - string message
 message ChatReceive:
  opcode: 4
  fields:
  - string channel
  - string user
  - string message

New Messages:

Type1: # Authentication
 message Unk1: # Login
  opcode: 1
  fields:
  - string unk1 # mail
  - string unk2 # password
  _ string unk3 # new field
  
 message Unk2: # LoginResponse
  opcode: 2
  fields:
  - string unk1 # token

Type2: # new Type
  message Unk3:
   opcode: 3
   fields:
   - Vec3 unk1
   - float unk2

Type3: # Chat
 message Unk4: # ChatSend
  opcode: 4
  fields:
  - string unk1 # channel
  - string unk2 # message
 message Unk5: # new message
  opcode: 5
  fields:
  - string unk1
  - string unk2
 message Unk6: # ChatReceive
  opcode: 6
  fields:
  - string unk1 # channel
  - string unk2 # user
  - string unk3 # message

Some additional information: There are around 60 different types of messages and per message type there are at most around 100 messages. Also I would welcome solutions either in pseudo code or python.


Solution

  • The better and more sustainable solution is to redesign the system that produces the messages to be consistent with the names and format. That would make it more extensible.

    If that is really not an option, here is a possible algorithm you might want to explore by calculating the string differences using a library such as Levenshtein. Here, let's focus on the outermost data (the types). Just perform the same concept on the inner data (the messages and fields).

    Let's say these are the matches between the types in the old and new messages:

    Old Messages New Messages Remarks
    O1 N1
    N2 new
    N3 new
    O2 N4
    O3 N5
    O4 deleted
    O5 deleted
    N6 new
    O6 N7
    N8 new

    Where:

    Authentication:
     message Login:
      opcode: 1
      fields:
      - string mail
      - string password
    
    Type1:
     message Unk1:
      opcode: 1
      fields:
      - string unk1
      - string unk2
      - string unk3
    

    For each old message, calculate the Levenshtein distance to each new message and select the smallest distance. The smallest distance signifies it is the closest equivalent string. Let's assume the numbers below were the calculated distances per Ox : Ny pair

    O# N1 N2 N3 N4 N5 N6 N7 N8 Smallest Distance
    O1 3 10 7 11 14 8 5 12 N1
    O2 8 9 6 2 9 7 8 17 N4
    O3 9 7 7 9 3 13 7 6 N5
    O4 7 9 8 15 16 6 3 10 N7
    O5 5 7 9 8 11 4 10 5 N6
    O6 9 6 7 8 8 14 1 11 N7

    But since the order of messages stays the same, O4 mapping to N7 while O5 mapping to the earlier N6 is wrong. Also O6 is wrong as it maps to the same N7. Now we have to perform additional steps before choosing the smallest distance

    With this additional steps, the updated table would be:

    O# N1 N2 N3 N4 N5 N6 N7 N8 Smallest Distance
    O1 3 10 7 11 14 8 5 12 N1
    O2 8 9 6 2 9 7 8 17 N4
    O3 9 7 7 9 3 13 7 6 N5
    O4 7 9 8 15 16 6 3 10 N7 Deleted
    O5 5 7 9 8 11 4 10 5 N6 N8 Deleted
    O6 9 6 7 8 8 14 1 11 N7

    As you can see, O5 was remapped from N6 (distance of 4) to N8 (distance of 5) since O4 used a later N7. But then both were marked as deleted because O6 was mapped to N7 which is closer (distance of 1) to the earlier O that used an N equal or later than N7 (namely O4 and O5).

    Now, we know that: