I wrote some neon code in assembly and was aiming at maximum optimization. Though latency due to register conflict and pipeline is reduced it is showing only 1 cycle difference i.e before n.70-0 after n.69-0. why it is showing like that i did n't understand. here is my sample code
before optimization http://pulsar.webshaker.net/ccc/sample-6b7ba7c2 after optimization http://pulsar.webshaker.net/ccc/sample-d59091b4
i have so many doubts in pulsar calculator. 1. n.16-0 1c d0:1 here n stands for what? 2. a.23-0 2c q6l:1 VMLA.I16 q6, q9, D0[2] a stand for what? l:1 means? does 23 is the cycles count? 3. does count Time means total time for execution of code ? hope kindly somebody will help me regarding these doubts....
This is what I can remember about this cycle counter:
"n" stands for Neon pipeline, "a" stands for ARM pipeline. In fact you are mixing ARM and NEON instructions.
Regarding "q6l:1": q6l is the register which cause the current instructions to wait, while 1 is the number of extra half-cycles needed for this register/result to became available to the instruction, therefore is the number of half-cycles the instructions have to wait for his input. I'm not sure but I suppose that "q6l" is the lower part of the q6 register.
The number "23" in your example is the number of cycle in which the instruction can start the execution.
Count time has nothing to do with your code. Parse time is the time the tool tooks to interpret the instructions you provided. Count time is the time the tool tooks to analyze your instructions and provide the cycle informations.
I'll explain more the results, for example:
n.18-0 1c n0 q10:8
"n" stands for the execution unit (n = neon, a = arm, v = vfp).
"18" is the number of cycle in which the instruction can start the execution.
"0" is the number of the pipeline.
"1c" is the number of execution cycles for the instruction. Please NOTE that this is different from the number of cycles required until the result of the instruction is available for further instructions.
"n0" is the pipeline causing the current instruction to wait a result. n0 = neon pipeline number 0.
"q10" is the register causing the instruction to wait for the result.
"8" is related to the time the instruction have to wait for the results. It is the number of half-cycles if I remember correctly.
This counter does not consider the fact that a compiler can re-arrange instructions, i.e. postponing an instruction that is waiting a result. But if you impose your compiler to not re-arrange the assembly instructions, when an instruction have to wait a result no other instructions can start the execution even if they have not to wait for a result, therefore this causes an execution stall in which the CPU cannot execute any instruction.
Moreover, I would not use this type of counter for code with loops. I suggest you to split your code in different parts and optimize each loop separately.