Suppose I have an object comprising many small integer types. For example:
uint16_t values[8];
Or as part of a union:
union Data {
uint16_t values[8];
// Other members
};
I would like to assign them all to a single value. The obvious solution is to use a loop:
for (int i = 0; i < 8; ++i)
values[i] = new_value;
Or to individually assign each value:
values[0] = new_value;
// ...
values[7] = new_value;
I was under the assumption that the compiler with -Os
enabled would recognize these and generate optimal code for them but instead they both compiled into the naive solution:
mov WORD PTR [rdi], si
mov WORD PTR [rdi+2], si
mov WORD PTR [rdi+4], si
mov WORD PTR [rdi+6], si
mov WORD PTR [rdi+8], si
mov WORD PTR [rdi+10], si
mov WORD PTR [rdi+12], si
mov WORD PTR [rdi+14], si
Does anyone know of a clever SWAR solution or similar trick to perform this task with fewer operations?
If the smaller type in question is a unsigned char
or type with the same representation, one option is to simply use memset()
:
uint8_t values[8];
memset(values, new_value, 8);
Performance will depend on the compiler and implementation. Any reasonable compiler should recognize library calls such as memset()
and generate optimal code for them.
If the type in question is larger than unsigned char
, that rules out memset()
but a couple of other ways come to mind.
You can use type-punning to, rather than assign a value every time, assign the first 2 individually, then reinterpret the first 2 as a larger type, to assign the first 2 to the second 2 in one step, and so on:
union Components {
uint16_t u16[8];
uint32_t u32[4];
uint64_t u64[2];
} values;
values.u16[1] = values.u16[0] = new_value;
values.u32[1] = values.u32[0];
values.u64[1] = values.u64[0];
A similar effect can be used with bitwise operators:
uint64_t value = new_value;
value |= value << 16;
value |= value << 32;
In a sense, the above has a logarithmic number of steps if the number of copies made fits within a single register.
The final way is to simply use multiplication by a magic number. Binary multiplication is equivalent to, for each bit in a number, adding either 0 if the bit is 0, otherwise the other number shifted by the bit position, to the result. As such, we can exploit this by producing masks which comprise 0 bits for all bits except the start of wherever we want a copy of our original number.
uint8_t new_value;
uint64_t values = 0x0101010101010101U * new_value; // 8 copies of new_value concatenated together
uint16_t new_value;
uint64_t values = 0x0001000100010001U * new_value; // 4 copies of new_value concatenated together
The actual 16-bit values can be individually accessed in array form using a union
similar to the second example:
union Components values;
values.u64[1] = values.u64[0] = 0x0001000100010001U * new_value; // This can be made even better if you have a futuristic machine with 128-bit registers