I have frames extracted from a 10 bit movie, using ffmpeg writing P010LE frames. I'm loading the frames into a vector of uint8_t. I've verified the frame data by viewing it with rawpixels.net. I'm trying to make a CUDA kernel that will convert these P010LE frames into ARGB format. Apparently this is beyond my mental horsepower.
Anyway, to simplify I thought I'd just show the luma, then maybe ask a question about the chroma once I've failed that. My kernel looks like this:
__global__ void P010leToArgbKernel(const uint8_t * P010le, uint8_t * argb, int width, int height)
{
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
if (x < width && y < height)
{
int yIndexP010 = (y * width + x) * 2;
uint16_t Y = *reinterpret_cast<const uint16_t*>(P010le + yIndexP010) & 0x3FF;
uint8_t Y8 = static_cast<uint8_t>((Y * 255 + 511) / 1023); // Proper rounding
int yIndex = (y * width + x) * 4;
argb[yIndex + 0] = Y8;
argb[yIndex + 1] = Y8;
argb[yIndex + 2] = Y8;
argb[yIndex + 3] = 255; // Alpha
}
}
The result is as follows, which I do not believe to be correct (original image left, converted image right). I'm expecting a smooth, greyscale image. What mistake did I make?
Following a hint from Christoph I realised the P010LE format actually stores its 10 bits in the high 10, not the low. Shifting down 6 times brings the data into the low 10 bits. Apparently it's good practice to mask off the bits you want in any case, so I have done that here.
Y = (Y >> 6) & 0x3FF;
The final kernel (probably not the fastest possible but it works) is below. When I used this with GDI+ so I could get a very quick-to-code visualisation, I also realised that its PixelFormat32bppARGB is actually BGRA on the underlying bitmap, so I've renamed it from P010le_To_Argb_Kernel to P010le_To_Bgra_Kernel.
__global__ void P010le_To_Bgra_Kernel(const uint8_t * P010le, uint8_t * argb, int width, int height)
{
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
if (x < width && y < height)
{
int yIndexP010 = (y * width * 2) + (x * 2);
uint16_t Y = *reinterpret_cast<const uint16_t*>(P010le + yIndexP010);
// Important note: The data is in the high ten bits, not the low!
Y = (Y >> 6) & 0x3FF;
int uvIndexP010 = (width * height * 2) + ((y / 2) * (width * 2)) + (x / 2) * 4;
uint16_t U = *reinterpret_cast<const uint16_t *>(P010le + uvIndexP010) >> 6 & 0x3FF;
uint16_t V = *reinterpret_cast<const uint16_t*>(P010le + uvIndexP010 + 2) >> 6 & 0x3FF;
uint8_t Y8 = static_cast<uint8_t>((Y * 255 + 511) / 1023); // Scale to 8-bit.
uint8_t U8 = static_cast<uint8_t>((U * 255 + 511) / 1023);
uint8_t V8 = static_cast<uint8_t>((V * 255 + 511) / 1023);
int C = Y8 - 16;
int D = U8 - 128;
int E = V8 - 128;
int argbIndex = (y * width + x) * 4;
// GDI+ bitmap format PixelFormat32bppARGB claims to be ARGB but in byte order it's BGRA.
argb[argbIndex + 3] = 255; // Alpha
argb[argbIndex + 2] = min(max((298 * C + 409 * E + 128) >> 8, 0), 255); // Red
argb[argbIndex + 1] = min(max((298 * C - 100 * D - 208 * E + 128) >> 8, 0), 255); // Green
argb[argbIndex + 0] = min(max((298 * C + 516 * D + 128) >> 8, 0), 255); // Blue
}
}