[SOLVED] Fastest way to extract raw Y' plane data from Y'Cb'Cr encoded video?

Fastest way to extract raw Y' plane data from Y'Cb'Cr encoded video?

I have a use-case where I'm extracting I-Frames from videos and turning them into perceptual hashes for later analysis.

⠀

I'm currently using ffmpeg to do this with a command akin to:

ffmpeg -skip_frame nokey -i 'in%~1.mkv' -vsync vfr -frame_pts true -vf 'keyframes/_Y/out%~1/%%06d.bmp'

and then reading in the data from the resulting images.

⠀

This is a bit wasteful as, to my understanding, ffmpeg does implicit YUV -> RGB colour-space conversion and I'm also needlessly saving intermediate data to disk.

Most modern video codecs utilise chroma subsampling and have frames encoded in a Y'C_bC_r colour-space, where Y' is the luma component, and Cb Cr are the blue-difference, red-difference chroma components.

Which in something like YUV420p used in h.264/h.265 video codecs is encoded as such:

Where each Y' value is 8 bits long and corresponds to a pixel.

⠀

As I use gray-scale data for generating the perceptual hashes anyway, I was wondering if there is a way to simply grab just the raw Y' values from any given I-Frame into an array and skip all of the unnecessary conversions and extra steps?

(as the luma component is essentially equivalent to the grayscale data i need for generating hashes)

I came across the -vf 'extractplanes=y' filter in ffmpeg that seems like it might do just that, but according to source:

"...what is extracted by 'extractplanes' is not raw data of the (for example) Y plane. Each extracted is converted to grayscale. That is, the converted video data has YUV (or RGB) which is different from the input."

which makes it seem like it's touching chroma components and doing some conversion anyway, in testing applying this filter didn't affect the processing time of the I-Frame extraction either.

⠀

My script is currently written in Python, but I am in the process of migrating it to C++, so I would prefer any solutions pertaining to the latter.

ffmpeg seems like the ideal candidate for this task, but I really am looking for whatever solution that would ingest the data fastest, preferably saving directly to RAM, as I'll be processing a large number of video files and discarding I-Frame luma pixel data once a hash has been generated.

I would also like to associate each I-Frame with its corresponding frame number in the video.

Solution

The linked page is not official documentation. "Each extracted is converted to grayscale." Well, yes. Cb plane only would be just grayscale and same for Y and Cb and also for R, G, B, A planes if they exist. Nothing is converted to grayscale, it is tagged as grayscale, because it is, as it is one plane, and that is grayscale.

"That is, the converted video data has YUV (or RGB) which is different from the input." It is different from YCbCr to RGB converted source. But the data is the actual underlying limited or full range data even for 30 bit files or 48/64 bit files.

"Since the example input is yuv420p format, that is, the chrominance components are thinned out." Well, yes. For 4:2:0 Y plane is full size while Cb and Cr planes are just 1/4 in size.

See https://ffmpeg.org/ffmpeg-filters.html#extractplanes

Also, see this bug (fixed): https://trac.ffmpeg.org/ticket/9575