Untill now I used 3 NOPs in order to "clean" the pipeline. Recently I encountered the ISB instruction that does that for me. Viewing the arm info center I noticed that this command takes 4 cycles (Under Cortex M0) and the 3 NOPs takes only 3.
Why should I use this command? What is it different from the 3 NOPs?
The reason that ISB instruction is 4 cycles is very simple. Cortex-M instruction set is a mixture of 16-bit and 32-bit instructions. There are six 32-bit instructions that are supported in Cortex-M designs (e.g. Cortex-M0) : BL, MSR, MRS, ISB, DMB, DSB.
All these six instructions can be mixed among 16-bit instructions.
The question is how the processor knows which instruction is 16-bit and which one is 32-bits ? To answer this question the processor reads the first 16-bits and decodes it (1 cycle). if the opcode matches a 32-bit instructions then it knows that the next 16-bit instruction is actually the second half of a 32-bit instruction and tries to execute it (3 cycles).
That makes ALL 32-bit instructions in Cortex-M cores to be 1+3 cycles = 4 cycles.
To flush the pipeline you can use 3 NOPs if you are sure about the core implementation. You must be sure that the core does not have a branch prediction and on the fly instruction optimization which removes consecutive NOPs. If you are sure about the absense of this feature then use 3 NOP instructions and you will save 1 cycle. But if you are not use and you also want your ARM code to be portable to other architectures like ARMv7, etc. Then you must use ISB instruction, which is a 32-bit instruction and takes 4 cycles.