I need to copy all the odd numbered bytes from one memory location to another. i.e. copy the first, third, fifth etc. Specifically I'm copying from the text area 0xB8000 which contains 2000 character/attribute words. I want to skip the attribute bytes and just end up with the characters. The following code works fine:
mov eax, ecx ; eax = number of bytes (1 to 2000)
mov rsi, rdi ; rsi = source
mov rdi, CMD_BLOCK ; rdi = destination
@@: movsb ; copy 1 byte
inc rsi ; skip the next source byte
dec eax
jnz @b
The number or characters to be copied is anywhere from 1 to 2000. I've recently started playing with sse2, sse3 sse4.2 but can't find an instruction(s) that can reduce the looping. Ideally I would love to cut down the loops from 2000 to say 250 which would be possible if there was an instruction that could skip every 2nd byte, after loading 128 bits at a time.
I would do something like this, processing 32 input bytes to 16 output bytes per loop iteration:
const __m128i vmask = _mm_set1_epi16(0x00ff);
for (i = 0; i < n; i += 16)
{
__m128i v0 = _mm_loadu_si128(&a[2 * i]); // load 2 x 16 input bytes (MOVDQU)
__m128i v1 = _mm_loadu_si128(&a[2 * i + 16]);
v0 = _mm_and_si128(v0, vmask); // mask unwanted bytes (PAND)
v1 = _mm_and_si128(v1, vmask);
__m128 v = _mm_packus_epi16(v0, v1); // pack low bytes (PACKUSWB)
_mm_storeu_si128(v, &b[i]; // store 16 output bytes (MOVDQU)
}
This is C with intrinsics of course - if you really want to do this in assembler then you can just convert each intrinsic above into its corresponding instruction.