I'm looking for a way of loading elements from a 8-bit source array (uint8_t*
) into AArch64 NEON / ASIMD register with data format uint16x8_t
or even better uint16x8x3_t
. So basically, each byte in the source array has to be loaded as a short into the register.
In a for-loop, I have to do the load in each iteration with a new batch of values.
I cannot find any ASIMD intrinsics to do this, but perhaps I am missing something. My current approach is to first load the elements as uint8x8x3_t
, performing a widening left-shift (using vmovl_u8
, so that the elements turn into uint16x8_t
) but this seems very inefficient:
uint8x8x3_t bgrChunk = vld3_u8(bgr);
uint16x8_t b = vmovl_u8(bgrChunk.val[0]);
uint16x8_t g = vmovl_u8(bgrChunk.val[1]);
uint16x8_t r = vmovl_u8(bgrChunk.val[2]);
bgr += 24; // Required for next iteration
I have also tried the following, but this performs even worse than the above;
uint16_t bgrValues[] = { bgr++, bgr++, bgr++, ... repeat up to 24 elements ..., bgr++, bgr++ };
uint16x8x3_t bgrChunk = vld3q_u16(bgrValues);
Is there a more efficient way to do this, or am I missing some intrinsic that will make this easier for me?
Edit; Extended example of what I want
Lets say I have an array uint8_t*
with values { 5, 33, 102, 153... }
Is there a way that I can directly load each 8-bit individual element into a register as a 16-bit value so that that register will contain the 16-bit values { 5, 33, 102, 153... }?
void foo(uint8_t* bgr, uint16_t width, uint16_t height) {
for (uint16_t y = 0; y < height; y++) {
for (uint16_t x = 0; x < width; x += 8) {
// I want to load 8-bit values as 16-bit values here. Is there a more efficient way to do this than the code below?
uint8x8x3_t bgrChunk = vld3_u8(bgr);
uint16x8_t b = vmovl_u8(bgrChunk.val[0]);
uint16x8_t g = vmovl_u8(bgrChunk.val[1]);
uint16x8_t r = vmovl_u8(bgrChunk.val[2]);
bgr += 24;
// ... Some operations working on the loaded data
}
}
}
It's an orthogonal instruction set for load/stores, so for widening loads you'll need to load 8-bit values into registers, and then widen to 16-bits as a second operation.
Depending what you do next, it is often possible for this second operation to be a useful arithmetic operation not just a move. For example, vmull_s8()
, vaddl_s8()
, vsubl_s8()
all return a 16-bit result. There are similar narrowing equivalents if you want to go the other way.