protocol-buffersgrpcprotocgrpc-pythonprotobuf-python

Reverse engineering .proto files from pb2.py generated with protoc


Is it possible to get proto files from generated pb2.py with protoc? Will be the same reverse engineering possible for gRPC?


Solution

  • The format of the _pb2.py file varies between protobuf-python versions, but most of them have a field called serialized_pb inside them. This contains the whole structure of the .proto file in the FileDescriptorProto format:

    serialized_pb=b'\n\x0c...'
    

    This can be passed to the protoc compiler to generate headers for other languages. However, it has to be first put inside a FileDescriptorSet to match the format correctly. This can be done using Python:

    import google.protobuf.descriptor_pb2
    fds = google.protobuf.descriptor_pb2.FileDescriptorSet()
    fds.file.append(google.protobuf.descriptor_pb2.FileDescriptorProto())
    fds.file[0].ParseFromString(b'\n\x0c... serialized_pb data ....')
    open('myproto.txt', 'w').write(str(fds))
    open('myproto.pb', 'wb').write(fds.SerializeToString())
    

    The snippet above saves a human-readable version to myproto.txt and a format that is nominally compatible with protoc to myproto.pb. The text representation looks like this:

    file {
      name: "XYZ.proto"
      dependency: "dependencyXYZ.proto"
      message_type {
        name: "MyMessage"
        field {
          name: "myfield"
          number: 1
         label: LABEL_OPTIONAL
         type: TYPE_INT32
        }
       ...
    

    For example C++ headers could now be generated using:

    protoc --cpp_out=. --descriptor_set_in=myproto.pb XYZ.proto
    

    Note that the XYZ.proto must match the name of the file in the descriptor set, which you can check in myproto.txt. However this method quickly gets difficult if the file has dependencies, as all of those dependencies have to be collected in the same descriptor set. In some cases it may be easier to just use the textual representation to rewrite the .proto file by hand.