Suppose I have a short array v
of say 8 int64_t
. I have an algorithm that needs to access different elements of that array, which are not compile-time constants, e.g. something like v[(i + j)/2] += ...
in which i
and j
are variables not subject to any kind of constant propagation.
Ordinarily I’d keep the array I memory, calculate the array index, load the array from memory in that position, and then store the result.
But suppose that, for valid reasons which I won’t go into, I want to keep the full array in registers -- the array is size-limited and fits the register bank.
If I were just reading from, and not writing to, the array, I could use (in ARMv8 NEON) the TBL instruction to perform table lookups. But what about the case of writing?
All I can think of is self-modifying code, encoding the array index directly into the instructions and executing it. I know this carries performance penalties when first running, but it might even work if the same code were executed over and over again.
Other than that, any ideas? Is it even possible? I reviewed the parts relevant to the instruction set and encoding of the ARMv8 architecture reference manual, and so far I’m inclined to say no, but maybe someone knows an obscure instruction or addressing mode that would help here.
If you want to access x8
, then there is no other way than to have an instruction that encodes x8
as a source register. So outside of emitting instructions at runtime, the only index-based solution I can come up with is to have a stub for each register, and branch based on the index like a switch-case. Assuming your array spans x8 through x15:
.p2align 2 // maybe change this to align to cacheline size?
read_reg:
adr x2, 1f
add x2, x0, lsl 3
br x2
1:
mov x0, x8
ret
mov x0, x9
ret
mov x0, x10
ret
mov x0, x11
ret
mov x0, x12
ret
mov x0, x13
ret
mov x0, x14
ret
mov x0, x15
ret
Writing would work the same way. This of course has the chance of messing up branch predictions. One other "hack" I can think of to not use branches at all is to combine csel
with a direct move to the nzcv
system register:
.p2align 2
read_reg:
lsl x0, x0, 28
msr nzcv, x0
csel x4, x8, x9, vc // bit 0
csel x5, x10, x11, vc
csel x6, x12, x13, vc
csel x7, x14, x15, vc
csel x4, x4, x5, lo // bit 1
csel x5, x6, x7, lo
csel x0, x4, x5, ne // bit 2
ret
This could be extended to a maximum of 16 registers. I'm not too certain on the performance constraints of the msr
or whether it requires an isb
on some architectures though - on an Apple M1 at least, it doesn't. And the case for writing wouldn't be as compact, as you need at the very least 8 instructions to target each register. :/