I have a use-case where I'm extracting I-Frames
from videos and turning them into perceptual hashes for later analysis.
⠀
I'm currently using ffmpeg
to do this with a command akin to:
ffmpeg -skip_frame nokey -i 'in%~1.mkv' -vsync vfr -frame_pts true -vf 'keyframes/_Y/out%~1/%%06d.bmp'
and then reading in the data from the resulting images.
⠀
This is a bit wasteful as, to my understanding, ffmpeg
does implicit YUV -> RGB
colour-space conversion and I'm also needlessly saving intermediate data to disk.
Most modern video codecs utilise chroma subsampling and have frames encoded in a Y'CbCr colour-space, where Y' is the luma component, and Cb Cr are the blue-difference, red-difference chroma components.
Which in something like YUV420p
used in h.264/h.265 video codecs is encoded as such:
Where each Y' value is 8 bits
long and corresponds to a pixel.
⠀
As I use gray-scale data for generating the perceptual hashes anyway, I was wondering if there is a way to simply grab just the raw Y' values from any given I-Frame
into an array and skip all of the unnecessary conversions and extra steps?
(as the luma component is essentially equivalent to the grayscale data i need for generating hashes)
I came across the -vf 'extractplanes=y'
filter in ffmpeg
that seems like it might do just that, but according to source:
"...what is extracted by 'extractplanes' is not raw data of the (for example) Y plane. Each extracted is converted to grayscale. That is, the converted video data has YUV (or RGB) which is different from the input."
which makes it seem like it's touching chroma components and doing some conversion anyway, in testing applying this filter didn't affect the processing time of the I-Frame
extraction either.
⠀
My script is currently written in Python
, but I am in the process of migrating it to C++
, so I would prefer any solutions pertaining to the latter.
ffmpeg
seems like the ideal candidate for this task, but I really am looking for whatever solution that would ingest the data fastest, preferably saving directly to RAM
, as I'll be processing a large number of video files and discarding I-Frame
luma pixel data once a hash has been generated.
I would also like to associate each I-Frame
with its corresponding frame number in the video.
The linked page is not official documentation. "Each extracted is converted to grayscale." Well, yes. Cb plane only would be just grayscale and same for Y and Cb and also for R, G, B, A planes if they exist. Nothing is converted to grayscale, it is tagged as grayscale, because it is, as it is one plane, and that is grayscale.
"That is, the converted video data has YUV (or RGB) which is different from the input." It is different from YCbCr to RGB converted source. But the data is the actual underlying limited or full range data even for 30 bit files or 48/64 bit files.
"Since the example input is yuv420p format, that is, the chrominance components are thinned out." Well, yes. For 4:2:0 Y plane is full size while Cb and Cr planes are just 1/4 in size.
See https://ffmpeg.org/ffmpeg-filters.html#extractplanes
Also, see this bug (fixed): https://trac.ffmpeg.org/ticket/9575