I'm using an OMAP L138 processor at the moment which does not have a hardware FPU. We will be processing spectral data using algorithms that are FP intensive thus the ARM side won't be adequate. I'm not the algorithm person but one is "Dynamic Time Warping" (I don't know what it means, no). The initial performance numbers are:
Core i7 Laptop@ 2.9GHz: 1 second
Raspberry Pi ARM1176 @ 700MHz: 12 seconds
OMAP L138 ARM926 @ 300MHz: 193 seconds
Worse, the Pi is about 30% of the price of the board I'm using!
I do have a TI C674x which is the other processor in the OMAP L138. The question is would I be best served by spending many weeks trying to:
(When I look at FPU performance on the A8, it isn't an improvement over the Rasp Pi but Cortex A9 seems to be).
I understand the answer is "it depends". Others here have said that "you unlock an incredible fast DSP that can easily outperform the Cortex-A8 if assigned the right job" but for a defined job set would I be better off skipping to the A9, even if I had to buy an external DSP later?
That question can't be answered without knowing the clock-rates of DSP and the ARM.
Here is some background:
I just checked the cycles of a floating point multiplication on the c674x DSP:
It can issue two multiplications per cycle, and each multiplication has a result latency of three cycles (that means you have to wait three additional cycles before the result appears in the destination register).
You can however start two multiplications each cycle because the DSP will not wait for the result. The compiler/assembler will do the required scheduling for you.
That only uses two of the available eight functional units of the DSP, so while you do the two multiplications you can per cycle also do:
Loop control and branching is free and does not cost you anything on the DSP.
That makes a total of six floating point operations per cycle with parallel loads/stores and loop control.
ARM-NEON on the other hand can, in floating point mode:
Issue two multiplications per cycle. Latency is comparable, and the instructions are also pipeline-able like on the DSP. Loading/storing takes extra time as does add/subtract stuff. Loop control and branching will very likely go for free in well written code.
So in summary the DSP does three times as much work per cycle as the Cortex-A9 NEON unit.
Now you can check the clock-rates of DSP and the ARM and see what is faster for your job.
Oh, one thing: With well-written DSP code you will almost never see a cache miss during loads because you move the data from RAM to the cache using DMA before you access the data. This gives impressive speed advantages for the DSP as well.