many related questions says, for example, some instructions can be excuted in one clock cycle.
however, as far as I know(from the book of Computer Systems: A Programmer's Perspective), there are many steps in pipelines such as fetch, decode, execute, store etc. Each step costs at least one cycle. If so, why could any instructions be excuted in one clock cycle?
The linked question makes a distinction between throughput and latency. e.g. after dec eax
, how soon can another dec eax
execute? It only needs the EAX value to be ready when it hits the EXEC stage of a simple in-order pipeline. Keeping the latency of the execution unit itself down to 1 cycle is what enables back-to-back exec of dependent instructions.
Total latency of the pipeline from fetch to exec only matters for mispredicted branches.
Having multiple instructions in the pipeline is the entire point of pipelining; you wouldn't call it a pipeline if you were going to require one instruction to make it all the way through the pipeline before you started fetching another one.
See also https://en.wikipedia.org/wiki/Classic_RISC_pipeline and
Modern Microprocessors
A 90-Minute Guide!.
Or keep reading your CS:APP textbook.
Also related, for modern CPUs like current x86 and high-end ARM (superscalar out-of-order):