cbinaryreverse-engineeringarm64

How do I organize my binary parser's code?


I'm building a Mach-O 64 bit binary parser for a reverse engineering tool, ghidra style. I want the program to output where we are in readable human language, only using file format identifiers.

Let me show you an example:

0x100000004: cf fa ed fe
0x100000008: 0c 00 00 01
0x10000000c: 00 00 00 00
0x100000010: 02 00 00 00
0x100000014: 11 00 00 00
0x100000018: 20 04 00 00
0x10000001c: 85 00 20 00
0x100000020: 00 00 00 00
0x100000024: 19 00 00 00 LC_SEGMENT_64

Here, LC_SEGMENT_64 is on the side of the where it starts, i know this because the LC_SEGMENT_64 identifier is 0x19. But if i do this to every single possible Mach-O identifier it's going to get messy. How do I implement this in a good way, without using 50 thousand if-else statements?

My code atm:

#include <errno.h>
#include <mach-o/loader.h>
#include <mach/machine.h>
#include <stdint.h>
#include <stdio.h>
#include <string.h>

#define BUFFER_SIZE 4
#define ERROR(msg) fprintf(stderr, "ERROR: %s | %s\n", msg, strerror(errno));

void HexPrinter(uint32_t buffer, FILE *binary) {
  uint64_t mem_addr = 0x10000000;
  fread(&buffer, 1, BUFFER_SIZE, binary);
  if (buffer != MH_MAGIC_64) {
    ERROR("Not a 64-bit Mach-O binary");
    return;
  } else {
    printf("0x%llx: %02x %02x %02x %02x\n", mem_addr, (buffer & 0xFF),
           ((buffer >> 9) & 0xFF), ((buffer >> 16) & 0xFF),
           ((buffer >> 24) & 0xFF));
  }

  while ((fread(&buffer, 1, BUFFER_SIZE, binary)) == BUFFER_SIZE) {
    printf("0x%llx: %02x %02x %02x %02x\t", mem_addr, (buffer & 0xFF),
           ((buffer >> 9) & 0xFF), ((buffer >> 16) & 0xFF),
           ((buffer >> 24) & 0xFF));

/* I don't want to write one of these for each identifier */
    if (buffer == LC_SEGMENT_64) {
      printf("LC_SEGMENT_64\n");
    } else {
      printf("\n");
    }

    mem_addr += BUFFER_SIZE;
  }

  if (ferror(binary)) {
    ERROR("Error reading file");
    fclose(binary);
    return;
  }
}

int main(int argc, char *argv[1]) {
  FILE *binary;
  char *pathname = argv[1];

  uint32_t buffer;

  if (!argv[1]) {
    ERROR("Usage: ./nibBrev <pathname>");
    return (-1);
  }

  binary = fopen(pathname, "r");
  if (!binary) {
    ERROR("Couldn't open file");
    return (-1);
  }

  HexPrinter(buffer, binary);

  fclose(binary);
  return 0;
}

Solution

  • I read the loader.h file provided from this question for the Mach-O format you are working with and will target my answer to what I read there. If this link is out of date or not correct adjust based on what you are working with.

    The constants you mention are sequential starting at #define LC_SEGMENT 0x1 and ending at #define LC_BUILD_VERSION 0x32. Create a mapping from these constants to indices in a table of strings like this.

    /* This is how many identifiers I saw listed in loader.h add one 
       for the 0th unused slot. Number based on the last identifier in
       the list of constants or count by hand. Either works. */
    #define NUM_IDENTIFIERS LC_BUILD_VERSION + 1
    static const char *const identifier_strings[NUM_IDENTIFIERS] = {
        /* Probably unused 0th slot. */
        [0] = "",
        [LC_SEGMENT] = "LC_SEGMENT",
        [LC_SYMTAB] = "LC_SYMTAB",
        /* ...continues for rest of identifier constants */
    
        /* careful with constants OR'd with LC_REQ_DYLD bit.
           Or do this to all values to be safe, maybe? */
        [LC_RPATH & ~LC_REQ_DYLD] = "LC_RPATH",
    
        /* ...continues */
        [LC_BUILD_VERSION] = "LC_BUILD_VERSION",
    };
    

    The strings are now always in sync with those constants and the string table can be reorganized in any way for readability because of the [index] = "string", notation used.

    Now a lookup function might look like this.

    const char *lookup(uint32_t identifier) {
        /* While the identifiers are sequential be mindful of
           the LC_REQ_DYLD bit that has been OR'd to some in
           the list. That would mess up the indexing. See the
           comment above this constant for more info. */
        identifier &= ~LC_REQ_DYLD;
        if (identifier && identifier < NUM_IDENTIFIERS) {
            return identifier_strings[identifier];
        }
        return NULL;
    }
    

    Then same as the other answer.

    const char *id = lookup(buffer);
    if (id) {
        puts(id);
    } else {
        puts('\n');
    }
    

    Warning: I assume you are only interested in the LC_* section of identifiers. Both Ted Lyngmo's answer and mine would not work if you added more constants to print from loader.h because I see the same values used for many different #defines throughout the file.