A somewhat lengthy question so please bear with me.
I am writing a parser to extract Objective-C metadata entities from input Mach-O binaries. And I want to better understand how pointers to metadata entities are stored/encoded in Mach-Os.
#import <Foundation/Foundation.h>
@interface Person : NSObject
- (void) someMethod;
@end
@implementation Person
- (void) someMethod {}
@end
int main() {
return 0;
}
clang++ -target arm64-apple-ios16 -isysroot /path/to/iphoneos_sdk \
-framework Foundation -o test test.m
Output from objdump -s test
...
Contents of section __DATA_CONST.__objc_classlist:
100008000 c0c00000 00000000 ........
...
Contents of section __DATA.__objc_data:
10000c098 01000000 00001080 01000000 00001080 ................
10000c0a8 00000000 00002080 00000000 00000000 ...... .........
10000c0b8 00c00000 00001000 98c00000 00001000 ................
10000c0c8 02000000 00001080 00000000 00002080 .............. .
10000c0d8 00000000 00000000 48c00000 00000000 ........H.......
Note that the class pointer is stored as 0xc0c0
in the __objc_classlist
section. The class is actually located at pointer: 0x0001 0000 c0c0
in the __objc_data
section.
clang++ -framework Foundation -o test test.m
Output from objdump -s test
...
Contents of section __DATA_CONST.__objc_classlist:
100004000 d8800000 01000000 ........
...
Contents of section __DATA.__objc_data:
1000080b0 00000000 00000000 00000000 00000000 ................
1000080c0 00000000 00000000 00000000 00000000 ................
1000080d0 00800000 01000000 b0800000 01000000 ................
1000080e0 00000000 00000000 00000000 00000000 ................
1000080f0 00000000 00000000 68800000 01000000 ........h.......
In this case, the class pointer is stored as 0x0001 0000 80d8
in the __objc_classlist
section and we can use that address to go to where the class is actually stored in the __objc_data
section.
I also noticed other ways in which pointers are encoded. For example, I came across a case for ARM64 targets where a pointer to a metadata entity was stored as: 0x0000 9000 0000 3faf
while the actual location is 0x0001 0000 3faf
.
So, my question is: how does Objective-C/clang encode MD entity pointers in Mach-O files?
You're looking at non-linked data. You need to be aware of dynamic linking operations in order to meaningfully parse this.
So, my question is: how does Objective-C/clang encode MD entity pointers in Mach-O files?
It depends. Specifically, it depends on what runtime linking format for binds and rebases your binary uses. Broadly speaking, there are two formats:
Dyld opcodes.
This is the "old" format and has been used since macOS 10.6. In this format, all metadata is stored separately from the data it applies to, which is why you get clean pointers in your x86_64 binary, surrounded by zeroes. As its name suggests, it's an opcode-based sequence of instructions, which is stored somewhere in __LINKEDIT
and is pointed to by the LC_DYLD_INFO
/LC_DYLD_INFO_ONLY
in the Mach-O header. You can dump this info specifically with xcrun dyld_info -opcodes
:
% xcrun dyld_info -opcodes test.macos
test.macos [x86_64]:
-opcodes:
rebase opcodes:
0x0000 REBASE_OPCODE_DO_REBASE_IMM_TIMES
0x0018 REBASE_OPCODE_DO_REBASE_ADD_ADDR_ULEB
0x0050 REBASE_OPCODE_DO_REBASE_IMM_TIMES
0x0058 REBASE_OPCODE_DO_REBASE_IMM_TIMES
0x0060 REBASE_OPCODE_DO_REBASE_IMM_TIMES
0x0080 REBASE_OPCODE_DO_REBASE_IMM_TIMES
0x0088 REBASE_OPCODE_DO_REBASE_IMM_TIMES
0x00D0 REBASE_OPCODE_DO_REBASE_IMM_TIMES
0x00D8 REBASE_OPCODE_DO_REBASE_IMM_TIMES
0x00F8 REBASE_OPCODE_DO_REBASE_IMM_TIMES
regular bind opcodes:
0x00E0 BIND_OPCODE_DO_BIND
0x00B0 BIND_OPCODE_DO_BIND
0x00B8 BIND_OPCODE_DO_BIND
0x00C0 BIND_OPCODE_DO_BIND_ADD_ADDR_IMM_SCALED
0x00E8 BIND_OPCODE_DO_BIND
no lazy bind opcodes
no weak bind opcodes
The load command and dyld opcodes are defined in mach-o/loader.h
. Use of the opcodes has been somewhat detailed by Jonathan Levin, though for the actual implementation, see MachOAnalyzer.cpp
and MachOLayout.cpp
in dyld source.
Chained fixups.
This is the "new" format first introduced in iOS 12 on arm64e. In this format, some metadata is stored alongside the target data it applies to, which is what you're seeing in your arm64 binary.
This format was initially only used for arm64e binaries, and whether this is used depends on the target architecture and minimum OS version, but iOS 16 and macOS 13 targets now seem to use it for all architectures (I'm guessing your default macOS target is 12.x or lower).
The way this works is by first segmenting the binary into pages (which may or may not match the hardware page size), and recording the offset of the first value that needs to be operated on in each page. The data at that offset then encodes the information needed to construct a valid pointer at load-time, as well as the offset to the next such value, thereby forming the "fixup chain". Cramming all of this into a 64-bit (or sometimes even 32-bit) value is of course no small feat, so there are many subtly different formats that can be picked from, each optimised for a special use case (see mach-o/fixup-chains.h
), but generally you have the top bit telling you whether it's a bind or rebase, you have N amount of bits in the middle that encode distance to the next pointer, pointer authentication stuff, etc., and then you have the rest of the bits which encode the offset from the base of the image (for rebases) or the index into the import symbol table (for binds). Also, only one format can be chosen for the entire binary, so you will likely only have to implement two or three, and will never encounter the rest.
At that point you're left with the list of page offsets that lead to the first value on each page. If chained fixups are used in conjunction with dyld opcodes, then this is encoded somehow (I never looked at it) in the dyld opcode sequence with BIND_OPCODE_THREADED
. If this is used stand-alone, then there is a LC_DYLD_CHAINED_FIXUPS
load command in the Mach-O header, which points to a struct dyld_chained_fixups_header
, which points to a few more structs, encoded as offsets from itself. One of those holds the page starts, another holds the list of imported symbols, etc. See mach-o/fixup-chains.h
again for those.
You can use xcrun dyld_info -fixup_chains
and xcrun dyld_info -fixup_chain_details
to examine this:
% xcrun dyld_info -fixup_chains test.ios
test.ios [arm64]:
-fixup_chains:
seg[2]:
page_size: 0x4000
pointer_format: 6 (generic 64-bit, 4-byte stride, target vmoffset )
segment_offset: 0x00008000
max_pointer: 0x00000000
pages: 1
start[ 0]: 0x0000
seg[3]:
page_size: 0x4000
pointer_format: 6 (generic 64-bit, 4-byte stride, target vmoffset )
segment_offset: 0x0000C000
max_pointer: 0x00000000
pages: 1
start[ 0]: 0x0018
% xcrun dyld_info -fixup_chain_details test.ios
test.ios [arm64]:
-fixup_chain_details:
0x00008000: raw: 0x000000000000C0C0 rebase: (next: 000, target: 0x0000000C0C0, high8: 0x00)
0x0000C018: raw: 0x0090000000007F9C rebase: (next: 018, target: 0x00000007F9C, high8: 0x00)
0x0000C060: raw: 0x0010000000007F9C rebase: (next: 002, target: 0x00000007F9C, high8: 0x00)
0x0000C068: raw: 0x0050000000007F88 rebase: (next: 010, target: 0x00000007F88, high8: 0x00)
0x0000C090: raw: 0x0010000000007FA3 rebase: (next: 002, target: 0x00000007FA3, high8: 0x00)
0x0000C098: raw: 0x8010000000000001 bind: (next: 002, ordinal: 000001, addend: 0)
0x0000C0A0: raw: 0x8010000000000001 bind: (next: 002, ordinal: 000001, addend: 0)
0x0000C0A8: raw: 0x8020000000000000 bind: (next: 004, ordinal: 000000, addend: 0)
0x0000C0B8: raw: 0x001000000000C000 rebase: (next: 002, target: 0x0000000C000, high8: 0x00)
0x0000C0C0: raw: 0x001000000000C098 rebase: (next: 002, target: 0x0000000C098, high8: 0x00)
0x0000C0C8: raw: 0x8010000000000002 bind: (next: 002, ordinal: 000002, addend: 0)
0x0000C0D0: raw: 0x8020000000000000 bind: (next: 004, ordinal: 000000, addend: 0)
0x0000C0E0: raw: 0x000000000000C048 rebase: (next: 000, target: 0x0000000C048, high8: 0x00)
In the more general case, you could also use xcrun dyld_info -fixups
to display any sort of bind or rebase target, no matter whether it uses dyld opcodes or fixup chains under the hood. But I suppose that won't help you much for the purpose of writing a parser.