Memory copying: ARM STM vs. ARM NEON

I need to copy large amounts of memory (on the order of 47k) (example, from a USB buffer to a more permanent buffer).

This is using an ARM Cortex A8.

(The ARM has the NEON code.)

The ARM NEON instruction can copy 4 32-bit elements at a time (per instruction).
The ARM LDM and STM instructions can load and store (copy) more than 4 registers at a time (per instruction).

Questions:

Which is more efficient for copying large amounts (e.g. 47k) of memory, the ARM NEON instruction or the ARM LDM and STM instructions? (I don't have benchmarking tools available; this is on an embedded system).
What is the advantage of the ARM NEON instructions for copying memory?
The project is primarily C language, but also has some assembly language. Is there a method to suggest to the compiler to use ARM NEON or the LDM/STM instructions, without optimizations? (We are launching code without optimizations so there are no differences when the product is returned. There is a possibility that optimization can be responsible for issues in the product.)

Tools:
ARM Cortex A8 processor
IAR Electronic Workbench IDE & Compiler.
Development on Windows 10 PC, to remote embedded ARM processor (via JTAG).

Solution

Neon has the advantage of unaligned load and store, but it consumes more power.

And since you are copying form the USB buffer to a permant one where you have full control over alignment and size, it would be better without neon, because memory speed is the same.

The standard memcpy most probably already utilizes neon (it depends on the BSP), hence I'd write a mini version utilizing ldrd and strd which is slightly faster than ldm and stm.

.balign 64
    push    {r4-r11}
    sub     r1, r1, #8
    sub     r0, r0, #8
    b       1f
    
.balign 64
1:
    ldrd    r4, r5, [r1, #8]
    ldrd    r6, r7, [r1, #16]
    ldrd    r8, r9, [r1, #24]
    ldrd    r10, r11, [r1, #32]!
    subs    r2, r2, #32
    
    strd    r4, r5, [r0, #8]
    strd    r6, r7, [r0, #16]
    strd    r8, r9, [r0, #24]
    strd    r10, r11, [r0, #32]!
    bgt     1b

.balign 16
    pop     {r4-r11}
    bx      lr

I think you have no problem making the buffer size a multiple of 32, and both buffers aligned to 64 bytes(cache line length) or even better, 4096 bytes (page size).