I am reading ARM Cortex-A8 data sheet, in data sheet ARM stated that an Load data that missed in L2 take at least 28 core cycle to complete, now i could not imagine that during this 28 cycle CPU will stall and put bubble in pipeline or execute other instruction until this load complete? what if we have an branch based on this load result? what if we have another load just after that instruction that again missed in L2??
Even under a cache miss, the pipeline will go on until the RAW (read after write) dependency bites.
ldr r12, [r0], #4
subs r12, r12, r1
beq end_loop
The subs
instruction cannot be executed at the same time as ldr
due to the RAW dependency.
The beq
instruction cannot be executed at the same time as subs
due to the CPSR RAW dependency.
All in all, the sequence above will take 6 cycles in best case: three cycles instruction execution plus 3 cycles L1 hit latency while it will be 3 + 28 = 31 cycles in worst case (total cache miss)