I am trying to convert an existing OpenCL kernel to an HLSL compute shader.
The OpenCL kernel samples each pixel in an RGBA texture and writes each color channel to a tighly packed array.
So basically, I need to write to a tightly packed uchar
array in a pattern that goes somewhat like this:
r r r ... r g g g ... g b b b ... b a a a ... a
where each letter stands for a single byte (red / green / blue / alpha) that originates from a pixel channel.
going through the documentation for RWByteAddressBuffer
Store method, it clearly states:
void Store(
in uint address,
in uint value
);
address [in]
Type: uint
The input address in bytes, which must be a multiple of 4.
In order to write the correct pattern to the buffer, I must be able to write a single byte to a non aligned address. In OpenCL / CUDA this is pretty trivial.
As far as I know it is not possible to write directly to a non aligned address in this scenario. You can, however, use a little trick to achieve what you want. Below you can see the code of the entire compute shader which does exactly what you want. The function StoreValueAtByte
in particular is what you are looking for.
Texture2D<float4> Input;
RWByteAddressBuffer Output;
void StoreValueAtByte(in uint index_of_byte, in uint value) {
// Calculate the address of the 4-byte-slot in which index_of_byte resides
uint addr_align4 = floor(float(index_of_byte) / 4.0f) * 4;
// Calculate which byte within the 4-byte-slot it is
uint location = index_of_byte % 4;
// Shift bits to their proper location within its 4-byte-slot
value = value << ((3 - location) * 8);
// Write value to buffer
Output.InterlockedOr(addr_align4, value);
}
[numthreads(20, 20, 1)]
void CSMAIN(uint3 ID : SV_DispatchThreadID) {
// Get width and height of texture
uint tex_width, tex_height;
Input.GetDimensions(tex_width, tex_height);
// Make sure thread does not operate outside the texture
if(tex_width > ID.x && tex_height > ID.y) {
uint num_pixels = tex_width * tex_height;
// Calculate address of where to write color channel data of pixel
uint addr_red = 0 * num_pixels + ID.y * tex_width + ID.x;
uint addr_green = 1 * num_pixels + ID.y * tex_width + ID.x;
uint addr_blue = 2 * num_pixels + ID.y * tex_width + ID.x;
uint addr_alpha = 3 * num_pixels + ID.y * tex_width + ID.x;
// Get color of pixel and convert from [0,1] to [0,255]
float4 color = Input[ID.xy];
uint4 color_final = uint4(round(color.x * 255), round(color.y * 255), round(color.z * 255), round(color.w * 255));
// Store color channel values in output buffer
StoreValueAtByte(addr_red, color_final.x);
StoreValueAtByte(addr_green, color_final.y);
StoreValueAtByte(addr_blue, color_final.z);
StoreValueAtByte(addr_alpha, color_final.w);
}
}
I hope the code is self explanatory since it is hard to explain, but I'll try anyway.
The fist thing the function StoreValueAtByte
does is to calculate the address of the 4-byte-slot enclosing the byte you want to write to. After that the position of the byte inside the 4-byte-slot is calculated (is it the fist, second, third or the fourth byte in the slot). Since the byte you want to write is already inside an 4-byte variable (namely value
) and occupies the rightmost byte, you then just have to shift the byte to its proper position inside the 4-byte variable. After that you just have to write the variable value
to the buffer at the 4-byte-aligned address. This is done using bitwise OR
because multiple threads write to the same address interfering each other leading to write-after-write-hazards. This of course only works if you initialize the entire output buffer with zeros before issuing the dispatch-call.