I'm in the process of writing an x86_64 disassembler, to get a better understanding of the assembly-encoding rules. I got a working version, and I understand most things about prefixes, modR/M and so on. But I'm a bit unsure what's the smartest way for detecting the type of instruction (after all prefixes have been evaluated; and before the modR/M byte is checked).
Initially, after seeing that an instruction like "push" encodes the register in the first 3 bits of the opcode, I started parsing instructions like this:
struct EncodedInstructionByte
{
uint8_t encoding : 3;
uint8_t header : 5;
};
This seems to work for most instructions, at first. push could be identified by checking "header == 0x0A", pop with 0x0B, and so forth. Though when parsing the setcc/jcc-instructions, they seem to have an actual "header" of 4 bits, and encode the cc-code in their last 4 bits:
struct EncodedInstructionByteCC
{
uint8_t ccCode : 4;
uint8_t header : 4;
};
Some of the instructions didn't even seem to fit that scheme at all, like call-indirect (which is just 0xFF). Note that was far from encountering or detecting all possible instructions, so my understanding at that point was pretty limited.
Now after looking at the Intel 64 Architecture Software Developer Manuals opcode-table (Table B13), it seems that the general format is 4-4 bits. For example, "push r" is given as
0101 0 reg
whereas "pop r" would then be
0101 1 reg
Meaning that I could check if the last four bits == 0101b/0x05, and then check the next bit to see if its push or pop; followed by the reg-index stored in the first 3 bits.
Does that actually make sense, or is the 0101-"header" for both push and pop purely incidential?
I guess I'm having a bit of trouble forming this into one poignant question. Aside from wanting to understand, my end-goal is to have condensed and smart opcode-detection scheme. Seeing how extensive the x86_64 instruction set is, I would not want to manually check all 7 opcode variants for "push r" and "pop r", but a general detection/parsing-scheme that works for the entire instruction-set. So I'm wondering if starting to indentify opcodes by looking at the last 4 bits for a first grouping, and then decerning further makes any sense; or if this is too broad of an attempt at categorization, and I'm seeing patterns that are not really intended. If my approach is not viable, I would appreciate if someone could suggest an alternative opcode detection scheme.
You probably just want a table by opcode byte of how to handle it, with only a few different handlers. So your code is not too messy but the data array initializer looks like an opcode map (http://ref.x86asm.net/coder64.html). Actually a few different maps, one for the 0F xx
2-byte opcodes, others for the 0F 3A xx
and 0F 38 xx
3-byte opcodes. And then there's prefixes, e.g. F3 90 pause
looks like rep nop
.
For the push reg
short forms, 8 entries pointing to that 5:3
handler, or for cmovcc/jcc/setcc 16 entries each pointing at the condition-code handler.
The entries could be a struct that also includes the mnemonic as a string, except for opcode bytes like FF
and others where the 3-bit ModRM /r
field is another three opcode bits, effectively: How to read the Intel Opcode notation
You probably don't want separate maps for combinations of prefixes, so you might record which prefixes have been seen before the opcode as bit flags (or as counts if you want to be able to print rep rep rep add eax, ecx
for f3 f3 f3 01 c8
which has 3 meaningless REP/REPZ prefixes.)
Currently that's meaningless, reserved for future use, but in practice inapplicable REP prefixes will be silently ignored. (So when CPU vendors want to add extensions that are backwards-compatible with existing CPUs, like performance hints, or like how tzcnt
can run as rep bsf
and give the same result for inputs other than 0
, they can pick an encoding that includes a REP prefix.) No extension has ever required 2 of the same prefix, or two from the same category like F3 F2
(REPZ / REPNZ) on the same instruction.
Anyway, your struct that says what to do next to disassemble the current instruction, given an opcode byte, might have pointers to more data structures for what a different instruction if there's a rep
prefix. For example the entry for 90
is nop
with no prefixes, a 2-byte nop
(xchg ax,ax
) with a 66
prefix, or pause
with an F3
prefix. So the struct insn *with_repz
member might be a pointer to a struct with const char *mnemonic = "pause";
. Or maybe a table of meaningful prefixes that you linear-search? Lots of ways to go about this.
Note that VEX and EVEX in 32-bit mode (and 16-bit protected mode) overlap invalid encodings of instructions like les
and lds
. VEX and EVEX prefixes don't decode in 16-bit real mode; in that case the invalid encodings will #UD fault instead of being VEX prefixes. Apparently some DOS software like SoftPC used C4
intentionally as a trap, and later NTVDM. In 64-bit mode, les
, lds
, and bound
don't exist, so C4
and C5
bytes always begin a VEX prefix, and 62
always EVEX. (Slipping between the cracks of invalid encodings in other modes is why they have some fields NOTed and can only encode 16 or 32 registers in 64-bit mode.)