image-processingcudapixelformat

CUDA kernel to convert P010LE into ARGB


I have frames extracted from a 10 bit movie, using ffmpeg writing P010LE frames. I'm loading the frames into a vector of uint8_t. I've verified the frame data by viewing it with rawpixels.net. I'm trying to make a CUDA kernel that will convert these P010LE frames into ARGB format. Apparently this is beyond my mental horsepower.

Anyway, to simplify I thought I'd just show the luma, then maybe ask a question about the chroma once I've failed that. My kernel looks like this:

__global__ void P010leToArgbKernel(const uint8_t * P010le, uint8_t * argb, int width, int height)
{
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;

    if (x < width && y < height)
    {
        int yIndexP010 = (y * width + x) * 2;

        uint16_t Y = *reinterpret_cast<const uint16_t*>(P010le + yIndexP010) & 0x3FF;
        uint8_t Y8 = static_cast<uint8_t>((Y * 255 + 511) / 1023); // Proper rounding

        int yIndex = (y * width + x) * 4;

        argb[yIndex + 0] = Y8;
        argb[yIndex + 1] = Y8;
        argb[yIndex + 2] = Y8;
        argb[yIndex + 3] = 255; // Alpha
    }
}

The result is as follows, which I do not believe to be correct (original image left, converted image right). I'm expecting a smooth, greyscale image. What mistake did I make?

enter image description here


Solution

  • Following a hint from Christoph I realised the P010LE format actually stores its 10 bits in the high 10, not the low. Shifting down 6 times brings the data into the low 10 bits. Apparently it's good practice to mask off the bits you want in any case, so I have done that here.

    Y = (Y >> 6) & 0x3FF;
    

    The final kernel (probably not the fastest possible but it works) is below. When I used this with GDI+ so I could get a very quick-to-code visualisation, I also realised that its PixelFormat32bppARGB is actually BGRA on the underlying bitmap, so I've renamed it from P010le_To_Argb_Kernel to P010le_To_Bgra_Kernel.

    __global__ void P010le_To_Bgra_Kernel(const uint8_t * P010le, uint8_t * argb, int width, int height)
    {
        int x = blockIdx.x * blockDim.x + threadIdx.x;
        int y = blockIdx.y * blockDim.y + threadIdx.y;
    
        if (x < width && y < height)
        {
            int yIndexP010 = (y * width * 2) + (x * 2);
    
            uint16_t Y = *reinterpret_cast<const uint16_t*>(P010le + yIndexP010);
            
            // Important note:  The data is in the high ten bits, not the low!
    
            Y = (Y >> 6) & 0x3FF;
            
            int uvIndexP010 = (width * height * 2) + ((y / 2) * (width * 2)) + (x / 2) * 4;
    
            uint16_t U = *reinterpret_cast<const uint16_t *>(P010le + uvIndexP010) >> 6 & 0x3FF;
            uint16_t V = *reinterpret_cast<const uint16_t*>(P010le + uvIndexP010 + 2) >> 6 & 0x3FF;
    
            uint8_t Y8 = static_cast<uint8_t>((Y * 255 + 511) / 1023); // Scale to 8-bit.
            uint8_t U8 = static_cast<uint8_t>((U * 255 + 511) / 1023);
            uint8_t V8 = static_cast<uint8_t>((V * 255 + 511) / 1023);
    
            int C = Y8 - 16;
            int D = U8 - 128;
            int E = V8 - 128;
    
            int argbIndex = (y * width + x) * 4;
    
            // GDI+ bitmap format PixelFormat32bppARGB claims to be ARGB but in byte order it's BGRA.
    
            argb[argbIndex + 3] = 255; // Alpha
    
            argb[argbIndex + 2] = min(max((298 * C + 409 * E + 128) >> 8, 0), 255); // Red
            argb[argbIndex + 1] = min(max((298 * C - 100 * D - 208 * E + 128) >> 8, 0), 255); // Green
            argb[argbIndex + 0] = min(max((298 * C + 516 * D + 128) >> 8, 0), 255); // Blue
        }
    }