Is there a way to treat the register file as an array in ARMv8 (scalar or Neon)?

Suppose I have a short array v of say 8 int64_t. I have an algorithm that needs to access different elements of that array, which are not compile-time constants, e.g. something like v[(i + j)/2] += ... in which i and j are variables not subject to any kind of constant propagation.

Ordinarily I’d keep the array I memory, calculate the array index, load the array from memory in that position, and then store the result.

But suppose that, for valid reasons which I won’t go into, I want to keep the full array in registers -- the array is size-limited and fits the register bank.

If I were just reading from, and not writing to, the array, I could use (in ARMv8 NEON) the TBL instruction to perform table lookups. But what about the case of writing?

All I can think of is self-modifying code, encoding the array index directly into the instructions and executing it. I know this carries performance penalties when first running, but it might even work if the same code were executed over and over again.

Other than that, any ideas? Is it even possible? I reviewed the parts relevant to the instruction set and encoding of the ARMv8 architecture reference manual, and so far I’m inclined to say no, but maybe someone knows an obscure instruction or addressing mode that would help here.

Solution

If you want to access x8, then there is no other way than to have an instruction that encodes x8 as a source register. So outside of emitting instructions at runtime, the only index-based solution I can come up with is to have a stub for each register, and branch based on the index like a switch-case. Assuming your array spans x8 through x15:

.p2align 2 // maybe change this to align to cacheline size?
read_reg:
    adr x2, 1f
    add x2, x0, lsl 3
    br x2
1:
    mov x0, x8
    ret
    mov x0, x9
    ret
    mov x0, x10
    ret
    mov x0, x11
    ret
    mov x0, x12
    ret
    mov x0, x13
    ret
    mov x0, x14
    ret
    mov x0, x15
    ret

Writing would work the same way. This of course has the chance of messing up branch predictions. One other "hack" I can think of to not use branches at all is to combine csel with a direct move to the nzcv system register:

.p2align 2
read_reg:
    lsl x0, x0, 28
    msr nzcv, x0
    csel x4, x8, x9, vc // bit 0
    csel x5, x10, x11, vc
    csel x6, x12, x13, vc
    csel x7, x14, x15, vc
    csel x4, x4, x5, lo // bit 1
    csel x5, x6, x7, lo
    csel x0, x4, x5, ne // bit 2
    ret

This could be extended to a maximum of 16 registers. I'm not too certain on the performance constraints of the msr or whether it requires an isb on some architectures though - on an Apple M1 at least, it doesn't. And the case for writing wouldn't be as compact, as you need at the very least 8 instructions to target each register. :/