Basically, I need to make this piece of code more efficient either by reducing the size of the overall code to reduce memory size or make it more efficient in how it runs. I am using Thumb 2 as well as the Cortex-M3.
I have tried reducing the amount of MOV Functions used but while that does reduce the overall code size, because of the way the code works, it needs every individual piece to take and store the results in the registers, so I am kind of stumped for how I can improve it. The code is at it's default state at the moment.
THUMB
AREA RESET, CODE, READONLY
EXPORT __Vectors
EXPORT Reset_Handler
__Vectors
DCD 0x00180000 ; top of the stack
DCD Reset_Handler ; reset vector - where the program starts
AREA 2a_Code, CODE, READONLY
Reset_Handler
ENTRY
num_words EQU (end_source-source)/4 ; number of words to copy
start
LDR r0,=source ; point to the start of the area of memory to copy from
LDR r1,=dest ; point to the start of the area of memory to copy to
MOV r2,#num_words ; get the number of words to copy
; find out how many blocks of 8 words need to be copied - it is assumed
; that it is faster to load 8 data items at a time, rather than load
; individually
block
MOVS r3,r2,LSR #3 ; find the number of blocks of 8 words
BEQ individ ; if no blocks to copy, just copy individual words
; copy and process blocks of 8 words
block_loop
LDMIA r0!,{r5-r12} ; get 8 words to copy as a block
MOV r4,r5 ; get first item
BL data_processing ; process first item
MOV r5,r4 ; keep first item
MOV r4,r6 ; get second item
BL data_processing ; process second item
MOV r6,r4 ; keep second item
MOV r4,r7 ; get third item
BL data_processing ; process third item
MOV r7,r4 ; keep third item
MOV r4,r8 ; get fourth item
BL data_processing ; process fourth item
MOV r8,r4 ; keep fourth item
MOV r4,r9 ; get fifth item
BL data_processing ; process fifth item
MOV r9,r4 ; keep fifth item
MOV r4,r10 ; get sixth item
BL data_processing ; process sixth item
MOV r10,r4 ; keep sixth item
MOV r4,r11 ; get seventh item
BL data_processing ; process seventh item
MOV r11,r4 ; keep seventh item
MOV r4,r12 ; get eighth item
BL data_processing ; process eighth item
MOV r12,r4 ; keep eighth item
STMIA r1!,{r5-r12} ; copy the 8 words
SUBS r3,r3,#1 ; move on to the next block
BNE block_loop ; continue until last block reached
; there may now be some data items available (fewer than 8)
; find out how many of these individual words need to be copied
individ
ANDS r3,r2,#7 ; find the number of words that remain to copy individually
BEQ exit ; skip individual copying if none remains
; copy the excess of words
individ_loop
LDR r4,[r0],#4 ; get next word to copy
BL data_processing ; process the item read
STR r4,[r1],#4 ; copy the word
SUBS r3,r3,#1 ; move on to the next word
BNE individ_loop ; continue until the last word reached
; languish in an endless loop once all is done
exit
B exit
; subroutine to scale a value by 0.5 and then saturate values to a maximum of 5
data_processing
CMP r4,#10 ; check whether saturation is needed
BLT divide_by_two ; if not, just divide by 2
MOV r4,#5 ; saturate to 5
BX lr
divide_by_two
MOV r4,r4,LSR #1 ; perform scaling
BX lr
AREA 2a_ROData, DATA, READONLY
source ; some data to copy
DCD 1,2,3,4,5,6,7,8,9,10,11,0,4,6,12,15,13,8,5,4,3,2,1,6,23,11,9,10
end_source
AREA 2a_RWData, DATA, READWRITE
dest ; copy to this area of memory
SPACE end_source-source
end_dest
END
Basically I need the code to store the results in every register, while also being either reduced in size or made to execute quicker. Thanks for your help.
Here is a slightly optimised version of your main loop. Given that you are programming for a Cortex M3, there is no real possibility for super-scalar or SIMD processing as your CPU does not support it. The main differences between this and your code are:
This code runs in 10 cycles per table entry plus a few instructions for the initial branch plus eventual branch mispredictions.
.syntax unified
.thumb
@ r0: source
@ r1: destination
@ r2: number of words to copy
@ the number in front of the comment is the number
@ of cycles needed to execute the instruction
block: cbz r2, .Lbxlr @ 2 return if nothing to copy
.Loop: ldmia r0!, {r3} @ 2 load one item from source
cmp r3, #10 @ 1 need to scale?
ite lt @ 1 if r3 < 10:
lsrlt r3, r3, #1 @ 1 then r3 >>= 1
movge r3, #5 @ 1 else r3 = 5
stmia r1!, {r3} @ 2 store to destination
subs r2, r2, #1 @ 1 decrement #words
bne .Loop @ 1 continue if not done yet
.Lbxlr: bx lr
You can get this down to 16 cycles for two entries (8 cycles per entry) by unrolling the loop once. Note that this almost triplicates the code length for only a small performance gain.
.syntax unified
.thumb
@ r0: source
@ r1: destination
@ r2: number of words to copy
@ the number in front of the comment is the number
@ of cycles needed to execute the instruction
@ first check if the number of elements is even or odd
@ leave this out if it's know to be even
block: tst r2, #1 @ 1 odd number of entries to copy?
beq .Leven @ 2 if not, proceed with eveness check
ldmia r0!, {r3} @ 2 load one item from source
cmp r3, #10 @ 1 need to scale?
ite lt @ 1 if r3 < 10:
lsrlt r3, r3, #1 @ 1 then r3 >>= 1
movge r3, #5 @ 1 else r3 = 5
stmia r1!, {r3} @ 2 store to destination
subs r2, r2, #1 @ 1 decrement #words
@ check if any elements are left
@ remove if you know that at least two elements are present
.Leven: cbz r2, .Lbxlr @ 2 return if no entries left.
.Loop: ldmia r0!, {r3, r4} @ 3 load two items from source
cmp r3, #10 @ 1 need to scale?
ite lt @ 1 if r3 < 10:
lsrlt r3, r3, #1 @ 1 then r3 >>= 1
movge r3, #5 @ 1 else r3 = 5
cmp r4, #10 @ 1 need to scale?
ite lt @ 1 if r5 < 10:
lsrlt r4, r4, #1 @ 1 then r4 >>= 1
movge r4, #5 @ 1 else r4 = 5
stmia r1!, {r3, r4} @ 3 store to destination
subs r2, r2, #2 @ 1 decrement #words twice
bne .Loop @ 1 continue if not done yet
.Lbxlr: bx lr
7 cycles per element could be achieved by unrolling the loop four times, but I believe that is excessive.
Note that this code is in GNU as syntax. It should be trivial to modify it for your assembler.