floating-pointcpu-architecturehardwareriscvpipelining

How exactly are RISC-V extensions like F implemented in a pipelined processor


I understand that a typical floating point operation is significantly slower than a typical integer one, so I wasn't too sure what approach(es?) would be appropriate to implement a RV32IF processor in hardware.

One method I could think of is say if I experimentally knew the slowest FP execution takes twice as long as the slowest INT one, I could have 2 'execution' stages for floating point, making integer instructions follow the typical 5 stage pipeline and floating point ones a 6 stage. I understand this would lead to a whole bunch of new hazards to worry about though, especially since not all FP instructions are made the same and the FDIV in particular could take way longer than just twice the longest integer execution time. Another approach would then have to be a hardware scoreboarding method like Tomasulo's Algorithm, but that would be quite a bit more work and area to implement.

I wanted to understand how this problem is usually tackled, especially if say I was resource constrained.


Solution

  • Similar to how integer multiply works, typically also multi-cycle (with no stalls unless you try to read the register too soon). But it's easier because the FP registers are separate from integer so you don't have to detect hazards between them, except for instructions that move data between int / FP regs. (Some ARMv7 CPUs have big stalls on instructions that move data from SIMD/FP registers to integer, since they don't do detailed dependency tracking between the two domains so just stall until all in-flight FP instructions are done, or something like that.)

    In terms of a classic 5-stage RISC pipeline, execution starts in the EX stage, but then you have a separate pipeline for FP operations, with its own write-back stage into the FP register file.

    With instructions of different latencies (e.g. integer mul vs. add, or FP add and mul if they have different latencies), WAW and maybe WAR hazards become possible, so yes, hazard detection becomes more complex. As well as bypass forwarding, and there's a larger window for true dependencies (RAW hazards).

    Also write-back conflicts where two results are ready in the same cycle for different registers in the same register file. That's a type of structural hazard. You can design the pipeline to stall and have one of the results write-back in the next cycle.

    Discussion of a classic 5-stage RISC pipeline (https://en.wikipedia.org/wiki/Classic_RISC_pipeline) usually totally ignores these issues, only considering single-cycle latency ALU ops, and a cache with 1-cycle load-use latency (not including the EX cycle for address math).

    Classic MIPS avoided complications for the main pipeline by having mult and div write separate registers, hi and lo, with special semantics for mflo (move from LO) etc. that allowed the CPU to give unpredictable results in corner cases where software did things that could make pipelining hard. i.e. they handled some corner cases by saying "don't do that". Raymond Chen explains in a nice short article: The MIPS R4000, part 3: Multiplication, division, and the temperamental HI and LO registers

    RISC-V (and MIPS32r1 and so on with MIPS mul instead of MIPS mult) do need to match more register-numbers against each other to detect hazards, and a somewhat longer window of in-flight instructions in the longer pipes.

    For a minimal RISC-V aiming to minimize transistors even if we have to give up significant speed: integer multiply/divide is optional so you can omit them to keep the integer pipeline clean. Fully stall on mov between FP and integer regs like old ARM, and have fixed latency for all FP operations (except div/sqrt) so they pipeline separately with a longer pipe with WB to the FP register file separate from integer.

    For FP divide / sqrt, you could fully stall on those if you didn't care about performance. To let other instructions work while they're in flight, just track the one output FP register from your non-pipelined div/sqrt unit. (It's hard and expensive to pipeline; big x86 cores had throughput ~= latency for divsd and sqrtsd until Bulldozer and Ivy Bridge in the early 2010s, and even then not heavily pipelined. And still not 1/clock, more like 4 to 6 cycle throughput with 13-19 cycle latency.)

    If the first access after a div/sqrt is a write, cancel the div/sqrt so another one can start. (Or not, this would only happen if software is dumb, unless you're doing out-of-order exec so mis-speculation is possible.) If the first access is a read, then you stall, so efficient software just needs to do software-pipelining of FP (or integer) div/sqrt, not trying to read the result until many instructions later.