Why does ffmpeg output slightly different RGB values when converting to gbrp and rgb24?

By using one the following command-lines, it is possible to convert a video stream to an RGB buffer:

ffmpeg -i video.mp4 -frames 1 -color_range pc -f rawvideo -pix_fmt rgb24 output.rgb24
ffmpeg -i video.mp4 -frames 1 -color_range pc -f rawvideo -pix_fmt gbrp output.gbrp

These RGB buffers can then be read, for example using Python and NumPy:

import numpy as np


def load_buffer_gbrp(path, width=1920, height=1080):
    """Load a gbrp 8-bit raw buffer from a file"""
    data = np.frombuffer(open(path, "rb").read(), dtype=np.uint8)
    data_gbrp = data.reshape((3, height, width))
    img_rgb = np.empty((height, width, 3), dtype=np.uint8)
    img_rgb[..., 0] = data_gbrp[2, ...]
    img_rgb[..., 1] = data_gbrp[0, ...]
    img_rgb[..., 2] = data_gbrp[1, ...]
    return img_rgb


def load_buffer_rgb24(path, width=1920, height=1080):
    """Load an rgb24 8-bit raw buffer from a file"""
    data = np.frombuffer(open(path, "rb").read(), dtype=np.uint8)
    img_rgb = data.reshape((height, width, 3))
    return img_rgb


buffer_rgb24 = load_buffer_rgb24("output.rgb24")
buffer_gbrp = load_buffer_gbrp("output.gbrp")

Theoretically, the two outputs should have the same RGB values (only the layout in memory should differ) ; in the real world, this is not the case:

import matplotlib.pyplot as plt

diff = buffer_rgb24.astype(float) - buffer_gbrp.astype(float)
fig, (ax1, ax2, ax3) = plt.subplots(ncols=3, constrained_layout=True, figsize=(12, 2.5))
ax1.imshow(buffer_rgb24)
ax1.set_title("rgb24")
ax2.imshow(buffer_gbrp)
ax2.set_title("gbrp")
im = ax3.imshow(diff[..., 1], vmin=-5, vmax=+5, cmap="seismic")
ax3.set_title("difference (green channel)")
plt.colorbar(im, ax=ax3)
plt.show()

The converted frame differs by more than what could be explained by chroma-subsampling or rounding errors (difference is around 2-3, rounding errors would be less than 1), and, what's worse, seems to have a uniform bias on the whole image.

Why is that so, and what ffmpeg parameters affect this behavior?

Solution

Good analysis so far. Let me try to add some perspective from the swscale side, hope that helps further in explaining the differences you're seeing and what they technically originate from.

The differences you see are indeed caused by different rounding. These differences are not because rgb24/gbrp are fundamentally different (they are different layouts of the same fundamental data type), but because the implementations were written for different use cases at different times by different people.

yuv420p-to-rgb24 (and the other way around) are very, very old implementations that come from before swscale was part of FFmpeg. These implementations have MMX (!) optimizations and are optimized for optimal conversion on Pentium machines (!). This is mid-90s technology or so. The idea here was to convert JPEG and MPEG-1 to/from monitor-compatible output before YUV output was a thing. The MMX optimizations are actually pretty well-tuned for their time.

You can imagine that speed is critically important here (at that time, YUV-to-rgb24 conversion was slow and a major component of the overall display pipeline). YUV-to-RGB is a simple matrix multiplication (with coefficients depending on what the exact YUV colorspace is). However, the resolution of the UV planes are different from the Y & RGB planes. In the simple (non-exact) yuv-to-rgb24 conversion, the UV are upsampled using next-neighbour conversion, so each RGB[x,y] uses Y[x,y] and UV[x/2,y/2] as input, or in other words, UV input samples are re-used 2x2 times for each output RGB pixel. The flag full_chroma_int "undoes" this optimization/shortcut. This means the chroma plane is upsampled using actual scaling conversions before the YUV-to-RGB conversion is initiated, and this upsampling can use filters such as bilinear, bicubic or even more advanced/expensive kernels (e.g. lanczos, sinc or spline).

bitexact is a generic term in FFmpeg to disable SIMD optimizations that don't generate the exact same output as the C function. I'll ignore that for now beyond just stating what it means.

Lastly, accurate_rnd: if I remember correctly, the idea here is that in matrix multiplications (independent of whether you use chroma plane upsampling or not), the typical way to do the integer-equivalent of the floating-point r = v*coef1 + y in a given precision (e.g. using 15 bits coefficients) is r = y + ((v*coef1 + 0x4000) >> 15). However, in x86 SIMD, this requires you to use the instruction pmulhrsw which is only available in SSSE3, not in MMX. Also, it means for the g = u*coef2 + v*coef3 + y you need pmaddwd and round/shift using separate instructions. So, instead, the MMX SIMD instead uses pmulhw (an unrounded version of pmulhrsw), which basically makes it r = y + (v*coef1>>16) (using 16-bits coefficients). This is mathematically very close, but not as precise, especially not for the G pixel (since it turns g = (u*coef2 + v * coef3 + 0x8000) >> 16) + y into g = (u*coef2>>16) + (v*coef3>>16) + y). accurate_rnd "undoes" this optimization/shortcut.

Now, YUV-to-gbrp. GBR-planar was added for H264 RGB support, since H264 codes RGB as "just another" YUV variant, but G is in the Y plane etc. You can imagine that speed was much less of an issue, as was MMX support. So here, the math was done correctly. In fact, if I remember correctly, accurate_rnd was only added afterwards so YUV-to-rgb24 could output identical pixels as YUV-to-gbrp and make the two outputs equivalent, but at the cost of not being able to use the (old) MMX optimizations that were inherited when swscale was merged into FFmpeg. This upsamples correctly with a user-configured scaling kernel by default because the planar conversion will only be done when all YUV planes have the same size, that is, it's strictly only does the matrix multiplication. This was added in something like 2015 or so, so we're talking about an eternity in computer programming terms.

Nowadays, the performance gain from "imprecise" implementations such as YUV-to-rgb24 are not considered worth it vs. the actual quality lost in the imprecise rounding and lack of configurable scaling for the chroma planes. This is why most people will recommend you to use -sws_flags accurate_rnd+full_chroma_int. Also, nowadays there are x86 SIMD (SSSE3 and AVX2) implementations for the "slower" conversion path, whereas around 2010, that was all straight C code with nobody wanting to invest time to optimize it. I'm guessing that -sws_flags accurate_rnd+full_chroma_int will perform slightly worse than "fast" YUV-to-rgb24 conversion, because it does chroma upsampling and matrix multiplication in two steps instead of one. But on modern x86 hardware, the performance penalty of this should be minimal and acceptable unless you're actually severely resource-constrained.

Hope that all makes sense.