pipelinecpu-architectureprocessor

Why is Pipelined Processor identified as a SISD?


I'm studying pipelined processor and i noticed that a pipelined processor (single core processor) is identified as a Single instruction - single data (flynn taxonomy). I was wondering why?

As a pipelined processor, why "Single instruction", if there is more than one instruction in the pipeline? Shouldn't be named Multiple Instruction - single data?

Also what does "single data" mean? because every stage of the pipeline works on different data.. so i would say that pipelined processor is a "Multiple Data", UNLESS "single data" refers to the program. Im so confused guys.

I hope someone can enlighten me the way. Thanks a lot.


Solution

  • Flynn's taxonomy is concerned with execution units

    A scalar pipeline still only has one ALU so is firmly in the SISD category. Keeping it fed with work more of the time is obviously good, but doesn't really change the fact that your CPU reads one instruction stream and executes them as if they each fully completed before the next one started (e.g. stalling for hazards that bypass-forwarding can't avoid.)

    So in terms of the programming model, you don't have to explicitly write a parallel program that's aware of how the CPU is executing1, unlike with SIMD where you have to explicitly process data in chunks. Or MIMD multi-threading where you have to coordinate multiple threads accessing parts of your data (and worse, coordinating places where they access the same data, if you need any reductions like the sum of an array).

    Being aware of latencies when scheduling instructions can improve performance for a scalar pipelined CPU, but that's a detail the taxonomy isn't concerned with. Scalar pipelines often don't need much (instruction-level parallelism aka ILP) to avoid stalls: integer ALU ops normally have 1 cycle latency so can run back to back. The earliest pipelined CPUs (like MIPS R2000) had load results ready for the ALU to use with a gap of only 1 extra cycle so a bit of instruction scheduling and maybe unrolling + software pipelining can help, especially if you have chains of floating-point instructions with higher latencies.

    It's assumed you or the compiler does whatever simple local optimizations are necessary for the specific CPU to feed its execution unit(s) with work most cycles, like loop unrolling and even software pipelining to avoid hazards. A serial dependency like sum += arr[i] might just make your code execute slower (depending on latency of addition vs. how tight the loop is); it doesn't force you to restructure your code to still run using the same instructions.

    Where the taxonomy really loses touch with later developments in computer architecture (it was proposed in 1966 and 1972 papers) is superscalar pipelines, which can look at a serial instruction stream and find instructions that don't depend on each other (instruction-level parallelism aka ILP), and execute them in parallel, hopefully achieving greater than 1 instruction per clock on average, and much higher peak throughput when not stalled. Such a pipeline will have multiple execution units; for example AMD Zen 2 has four integer ALUs and four floating-point execution units, and a front-end that can issue/rename up to 6 micro-ops per clock cycle.

    (This is a simplifications; it's actually four execution ports; different functional units share the same port, like the integer divider is separate hardware from the integer popcount / lzcnt/tzcnt unit or the simple add/sub/and/or/xor unit.)

    See also Modern Microprocessors A 90-Minute Guide! for more about superscalar out-of-order exec pipelines, and https://www.realworldtech.com/sandy-bridge/. Out-of-order exec is very good for hiding latencies from cache misses, and also for finding ILP across loop iterations without needing huge amounts of sofware unrolling and software-pipelining.

    Optimizing for wider pipelines makes it more important to unroll a loop like sum += arr[i] with multiple accumulators like sum0 += arr[i+0] / sum1 += arr[i+1] etc. to hide latency, as in Latency bounds and throughput bounds for processors for operations that must occur in sequence and Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)

    If you bottleneck on one serial dependency chain, you're running 1 addition per 4 clock cycles (for example on Intel Skylake for floating point), vs. 2 per clock cycle, so that's a factor of 8 in throughput vs. latency from exposing ILP to the CPU pipeline. And that's not even considering SIMD, where each instruction can do 4 or 8 FP adds per instruction (unit of work that the CPU pipeline has to track), so that's another factor of 8 in array-summing throughput from using SIMD vector instructions. i.e. it costs the same to do 1 float at a time as it does to do 8 floats at a time. (With modern-style short-vector fixed-length SIMD, not like older CDC-style vector processors where you tell the CPU about a whole array and it internally loops over it, without the loop overhead that would suck so much on a scalar pipeline. See also Are there any problems for which SIMD outperforms Cray-style vectors? re: memory-memory SIMD vs. Cray vs. modern short-vector SIMD, and modern stuff like AArch64 SVE and RISC-V extension V.)

    See also Duncan's taxonomy which is a 1990 update to Flynn's but still predates widespread superscalar pipelined CPUs.) An early superscalar pipeline was Intel Pentium in 1993 - dual-issue in-order: it decoded 2 instructions per clock cycle, and could send them both through its pipeline in lockstep if they were compatible.

    Footnote 1: Except when the architecture makes some pipeline details architecturally visible, for example MIPS I having load delay slots: instead of stalling, you'd get unpredictable values if the instruction after a load reads the result. And branch-delay slots, where the instruction after a jump/branch executes even when it's taken.


    Further thoughts:

    According to Wikipedia's article on Flynn's taxonomy, MISD (only?) describes things like redundant / fault-tolerant systems, like in aviation flight control computers, where the same data is processed multiple times. (e.g. 3 or 4 times, with each flight computer "voting" on the right thing to do, so a fault in one can be detected when it's different from the others.)

    Wikipedia also mentions a "pipelined" processor in the context of Flynn's taxonomy, but they're talking only in terms of an implementation strategy for SIMD (like separate pipelines or "lanes" within an ALU for an instruction like addps xmm0, xmm1 that does four FP additions.) That sense of "pipelined" is unrelated to what you're talking about, pipelining the fetch / decode / etc. of other instructions to happen in parallel with the execution unit.


    Flynn's taxonomy does get a bit fuzzy when applied to modern CPUs with multiple execution units, where each execution unit might internally have multiple SIMD lanes.

    An instruction like addps xmm0, xmm1 provides a control input to one of the SIMD FPUs, which internally activates four FP-adders, one for each lane (element) of the vector. So there's your SIMD. The execution unit is pipelined, able to start new independent work every cycle, but only produce a result after 4 cycles, for example.

    But this might be happening in the same clock cycle as a mulps xmm2, xmm3 in another execution unit, which is also SIMD.

    Traditionally, we don't call this MIMD when both SIMD FPUs are part of the same CPU core, and those addps and mulps instructions were part of the same single-threaded program, or thread of execution in a multi-threaded program.

    At least I think that's the case. Really the term MIMD basically stops being useful at this level of CPU complexity so it's better to just talk about what you want to say than to figure out which abstract label applies.