performancemathcpucpu-architecturealu

Do CPUs have a hardware "math cache" or dictionary that stores the result of simple math operations for quicker processing?


For simple math operations like sqrt(2), 1 / 2, 1 + 1, 10^2, 1 * 0, tan(45) and so on, do CPUs ever contain some kind of hardware table with the result of these operations so they aren't actually "calculated" but rather fetched from a dictionary or "math cache" of some sort for faster processing? Of course I predict some people here will ask "who defines what a 'simple' math operation is?", but I'm just curious.


Solution

  • No, it's up to software to implement caching if repeated values are common.

    I've never heard of any CPUs caching results for computation instructions, even slow variable-latency ones like FP division or square root, or integer division. (Although even those aren't that slow on the newest CPUs, like 9-12 cycle latency and 6 cycle throughput for idiv r32 on Zen 4). Even "big" CPUs with large transistor budgets like x86 performance cores and Apple M1 family don't do that.

    It would take some die area to have caches, and cost extra power to be checking and updating them for no benefit in cases where they miss for almost every op. Which would be common since a lot of use-cases process non-redundant data.

    It makes more sense to spend the transistor and power budget to give better throughput for all operations, which is what real high-performance CPUs do: Multiple execution pipelines for common operations like (SIMD) FP mul or FMA that can each each start a new operation every clock cycle.


    Caching would actually be counterproductive for any simpler operations, like FP or integer add/sub or multiply. Those have short latencies that aren't data-dependent, so the scheduler for out-of-order exec knows which clock cycle they'll produce their result. And can thus avoid write-back conflicts. e.g. if you started a 3-cycle latency integer multiply 2 cycles ago, sending 1 cycle integer add to that execution port would result in both results being ready next cycle, but there's normally only one set of wires to get results from ALUs on that execution port to places they need to go (register file and bypass forwarding network.) See the section about standardizing uop latency in the Sandybridge section of Agner Fog's microarch guide, which saved power in the scheduler. (Later Intel uarches have added uops with more different latencies again.)

    Multipliers are power-hungry (since lots of gates switch), but you're not going to save power unless you sacrifice latency for the cache-miss case. To keep latency low, the hardware would start a multiply in parallel with checking its cache of recent operands. (Which would have a 128-bit tag and 64-bit data for 64x64 -> 64-bit multiply. So the parallel comparators would have to be much wider than in load/store units.) On hit you could produce the output after 1 or 2 cycles instead of the usual 3 to 5.

    Integer add typically has 1 cycle latency already, and many per-clock throughput. (e.g. 4 per clock on recent Intel and AMD). Also can be SIMD vectorized for high-throughput computations, doing 8 uint32_t adds per instruction (x86 with AVX2), up to 3 of those per clock. So if you were going to have a cache for those operations, you'd need a cache with 24 read ports!! (Or separate small caches in each execution unit). SIMD integer and FP multiply are also pretty high throughput, like 2 instructions per clock with 4 or 5 cycle latency.

    Integer addition is cheaper to just do than to check a cache even in the cache-hit case, especially with SIMD to do multiple adds in parallel for only once instruction (or micro-op) going through the pipeline.

    Of course I predict some people here will ask "who defines what a 'simple' math operation is?

    One the CPU can do in a single asm instruction, of course. Discarding a whole sequence of instructions seems even less likely even if they're consecutive, and more costly in power to try to recognized repeating patterns of computations. And intermediate results might be left in registers that also get read before they're overwritten, in which case those outputs are also visible side-effects that need to get cached.

    Cases where this could plausibly be most useful are the slow x87 instructions like fsincos and fyl2x (which might be used in a pow or log function). But x87 is pretty much obsolete, not used by most software. e.g. fsincos is microcoded as 60-120 uops on Ice Lake.

    Most CPUs don't have exponentiation instructions, although x^2 is just a single multiply which CPUs do natively support.