c++csimdarm64neon

ARM64 ASIMD intrinsic to load uint8_t* into uint16x8(x3)?


I'm looking for a way of loading elements from a 8-bit source array (uint8_t*) into AArch64 NEON / ASIMD register with data format uint16x8_t or even better uint16x8x3_t. So basically, each byte in the source array has to be loaded as a short into the register.

In a for-loop, I have to do the load in each iteration with a new batch of values.

I cannot find any ASIMD intrinsics to do this, but perhaps I am missing something. My current approach is to first load the elements as uint8x8x3_t, performing a widening left-shift (using vmovl_u8, so that the elements turn into uint16x8_t) but this seems very inefficient:

uint8x8x3_t bgrChunk = vld3_u8(bgr);
uint16x8_t b = vmovl_u8(bgrChunk.val[0]);
uint16x8_t g = vmovl_u8(bgrChunk.val[1]);
uint16x8_t r = vmovl_u8(bgrChunk.val[2]);
bgr += 24; // Required for next iteration

I have also tried the following, but this performs even worse than the above;

uint16_t bgrValues[] = { bgr++, bgr++, bgr++, ... repeat up to 24 elements ..., bgr++, bgr++ };
uint16x8x3_t bgrChunk = vld3q_u16(bgrValues);

Is there a more efficient way to do this, or am I missing some intrinsic that will make this easier for me?

Edit; Extended example of what I want

Lets say I have an array uint8_t* with values { 5, 33, 102, 153... }

Is there a way that I can directly load each 8-bit individual element into a register as a 16-bit value so that that register will contain the 16-bit values { 5, 33, 102, 153... }?

void foo(uint8_t* bgr, uint16_t width, uint16_t height) {
  for (uint16_t y = 0; y < height; y++) {
    for (uint16_t x = 0; x < width; x += 8) {
      // I want to load 8-bit values as 16-bit values here. Is there a more efficient way to do this than the code below?
      uint8x8x3_t bgrChunk = vld3_u8(bgr);
      uint16x8_t b = vmovl_u8(bgrChunk.val[0]);
      uint16x8_t g = vmovl_u8(bgrChunk.val[1]);
      uint16x8_t r = vmovl_u8(bgrChunk.val[2]);
      bgr += 24;
      // ... Some operations working on the loaded data
    }
  }
}

Solution

  • It's an orthogonal instruction set for load/stores, so for widening loads you'll need to load 8-bit values into registers, and then widen to 16-bits as a second operation.

    Depending what you do next, it is often possible for this second operation to be a useful arithmetic operation not just a move. For example, vmull_s8(), vaddl_s8(), vsubl_s8() all return a 16-bit result. There are similar narrowing equivalents if you want to go the other way.