direct3dhlslcompute-shaderdirect3d11directcompute

Is it possible to write to a non 4-bytes aligned address with HLSL compute shader?


I am trying to convert an existing OpenCL kernel to an HLSL compute shader.

The OpenCL kernel samples each pixel in an RGBA texture and writes each color channel to a tighly packed array.

So basically, I need to write to a tightly packed uchar array in a pattern that goes somewhat like this:

r r r ... r g g g ... g b b b ... b a a a ... a

where each letter stands for a single byte (red / green / blue / alpha) that originates from a pixel channel.

going through the documentation for RWByteAddressBuffer Store method, it clearly states:

void Store(
  in uint address,
  in uint value
);

address [in]

Type: uint

The input address in bytes, which must be a multiple of 4.

In order to write the correct pattern to the buffer, I must be able to write a single byte to a non aligned address. In OpenCL / CUDA this is pretty trivial.


Solution

  • As far as I know it is not possible to write directly to a non aligned address in this scenario. You can, however, use a little trick to achieve what you want. Below you can see the code of the entire compute shader which does exactly what you want. The function StoreValueAtByte in particular is what you are looking for.

    Texture2D<float4> Input;
    RWByteAddressBuffer Output;
    
    void StoreValueAtByte(in uint index_of_byte, in uint value) {
    
        // Calculate the address of the 4-byte-slot in which index_of_byte resides
        uint addr_align4 = floor(float(index_of_byte) / 4.0f) * 4;
    
        // Calculate which byte within the 4-byte-slot it is
        uint location = index_of_byte % 4;
    
        // Shift bits to their proper location within its 4-byte-slot
        value = value << ((3 - location) * 8);
    
        // Write value to buffer
        Output.InterlockedOr(addr_align4, value);
    }
    
    [numthreads(20, 20, 1)]
    void CSMAIN(uint3 ID : SV_DispatchThreadID) {
    
        // Get width and height of texture
        uint tex_width, tex_height;
        Input.GetDimensions(tex_width, tex_height);
    
        // Make sure thread does not operate outside the texture
        if(tex_width > ID.x && tex_height > ID.y) {
    
            uint num_pixels = tex_width * tex_height;
    
            // Calculate address of where to write color channel data of pixel
            uint addr_red = 0 * num_pixels + ID.y * tex_width + ID.x;
            uint addr_green = 1 * num_pixels + ID.y * tex_width + ID.x;
            uint addr_blue = 2 * num_pixels + ID.y * tex_width + ID.x;
            uint addr_alpha = 3 * num_pixels + ID.y * tex_width + ID.x;
    
            // Get color of pixel and convert from [0,1] to [0,255]
            float4 color = Input[ID.xy];
            uint4 color_final = uint4(round(color.x * 255), round(color.y * 255), round(color.z * 255), round(color.w * 255));      
    
            // Store color channel values in output buffer
            StoreValueAtByte(addr_red, color_final.x);
            StoreValueAtByte(addr_green, color_final.y);
            StoreValueAtByte(addr_blue, color_final.z);
            StoreValueAtByte(addr_alpha, color_final.w);
        }
    }
    

    I hope the code is self explanatory since it is hard to explain, but I'll try anyway.
    The fist thing the function StoreValueAtByte does is to calculate the address of the 4-byte-slot enclosing the byte you want to write to. After that the position of the byte inside the 4-byte-slot is calculated (is it the fist, second, third or the fourth byte in the slot). Since the byte you want to write is already inside an 4-byte variable (namely value) and occupies the rightmost byte, you then just have to shift the byte to its proper position inside the 4-byte variable. After that you just have to write the variable value to the buffer at the 4-byte-aligned address. This is done using bitwise OR because multiple threads write to the same address interfering each other leading to write-after-write-hazards. This of course only works if you initialize the entire output buffer with zeros before issuing the dispatch-call.