I have been programming some stuff on 16-bit DOS recently for fun. I see a lot of people mentioning that far pointers are slower than near pointers and to avoid them.
Why?
From an assembly point of view, this makes sense. There are several extra instruction involved. You have to store the old value of a segment register, store the new segment into a register, then move that into CS or DS (immediate mode stores aren't a valid opcode in 8086). You can then do whatever you need to do in that segment. After, you have to restore the old value.
It sounds like a lot, but in reality this doesn't eat up a lot of cycles. I guess if every pointer you were using were in a different segment this could add up, but data is usually grouped. So unless you were bouncing all over the place, which is slow for other reasons, the penalty shouldn't be that bad. If you have to hit DRAM, that should dominate the cost, right?
I feel like there is more to this story and I am having a hard time tracking it down. Hoping an 8086 wizard is hanging around who remembers this stuff.
To clarify: I am interested in actual 16-bit processors like the 8086 and the 80286 in real mode.
Why are far pointers slow?
Segment register loads are too complex for a core's front-end to convert into micro-ops. Instead they're "emulated" by micro-ops stored in a little ROM. This begins with some branches (which CPU mode is it?), and typically those branches can't benefit from CPU's branch prediction causing stall/s.
To avoid segment register loads (e.g. when the same far pointer is used multiple times) software tends to use more segment registers (e.g. tends to use ES
, FS
and GS
); and this adds more prefixes (segment override prefixes) to instructions. These extra prefixes can also slow down instruction decoding.
I guess if every pointer you were using were in a different segment this could add up, but data is usually grouped.
Compilers aren't that smart. If a small piece of code uses 4 far pointers that all happen to use the same segment, the compiler won't know that they're all in the same segment and will do expensive segment register loads regardless. To work around that you could describe the data as a structure (e.g. 1 pointer to a structure that has 4 fields instead of 4 different pointers); but that requires the programmer to write software differently.
For an example; if you do something like "int foo(void) { int a; return bar(&a); }" then ss
will probably be passed on the stack and then the callee (bar
) will load that into another segment register, because bar()
has to assume that the pointer could point anywhere.
The other problem is that sometimes data is larger than a segment (e.g. an array of 100000 bytes that doesn't fit in a 64 KiB segment); so someone (the programmer or the compiler) has to calculate and load a different segment to access (parts of) the same data. All pointer arithmetic may need to take this into account (e.g. a trivial looking pointer++;
may become something more like offset++; segment += (offset >> 4); offset &= 0x000F;
that causes a segment register load).
If you have to hit DRAM, that should dominate the cost, right?
For real mode; you're limited to about 640 KiB of RAM and the caches in CPUs are typically much larger, so you can expect that every memory access is going to be a cache hit. In some cases (e.g. cascade lake CPUs with 1 MiB of L2 cache per core) you won't even use the L3 cache (it'll be all L2 hits).
You can also expect that a segment register load is more expensive than a cache hit.
Hoping an 8086 wizard is hanging around who remembers this stuff.
When people say "segmentation is awful and should be avoided" they're not thinking of a CPU that's been obsolete for 40 years (8086), they're thinking of CPUs that were relevant during this century. Some of them may also be thinking of more than just performance alone (especially for assembly language programmers, segmentation is an annoyance/extra burden).