[SOLVED] How does the processor differ between OpCodes and Data?

How does the processor differ between OpCodes and Data?

I am trying to write a disassembler, and I was wondering how the processor differentiates OpCodes from Data-Bytes.

For example, this is the byte representation of "Hello World": 0x48 0x65 0x6c 0x6c 0x6f 0x20 0x57 0x6f 0x72 0x6c 0x64 0x00

But how does the processor "know" that it is saying "Hello World" and not actually this: _ _ INS INS OUTS AND _ OUTS JB INS _ ADD

An explination is very welcome.

Solution

The processor knows because the entry points are known. The processor decodes in execution order which is how you should disassemble as well for a variable length instruction set. Fixed length you can just go through memory from the entry point linearly, but variable length you need to go in execution order. This is not foolproof of course, pretty easy to trip up a disassembler, so be aware that it is possible and I recommend you keep track. I generally make a table of the entry point of the instruction (opcode in some ISAs), and the non-entry bytes, so that if I branch into the middle of an instruction I can stop that path of the disassembler there (naturally you have to cover all the possible paths).

With respect to opcodes vs data, so long as the toolchain and programmer did the right job then one instruction will hand off to another jumping over data areas as needed.

Processors are very dumb, they don't have a lot of real functions, some ALU stuff, reading and writing from addresses, moving data in and out of registers. Half the job is feeding them programs that follow the rules.