language-agnosticvideocomparisonfingerprintaudio-fingerprinting

Finding duplicate video files by database (millions), fingerprint? Pattern recognition?


In the following scenario:

I got a project having a catalog of currently some ten thousand video files, the number is going to increase dramatically.

However lots of them are duplicates. With every video file I have associated semantic and descriptive information which I want to merge duplicates to achive better results for every one.

Now I need some sort of procedure where I index metadata in a database, and whenever a new video enters the catalog the same data is calculated and matched against in the database.

Problem is the videos aren't exact duplicates. They can have different quality, are amby cropped, watermarked or have a sequel/prequel. Or are cut off at the beginning and/or end.

Unfortunately the better the comparision the more cpu and memory intensive it gets so I plan on implementing several layers of comparision that begin with very graceful but fast comparision (maby video lengh with a tolerance of 10%) and end with the final comparision that decides whether its really a duplicate (that would be a community vote).

So as I have a community to verify the results it suffices to deliver "good guesses" with a low miss ratio.

So now my question is what layers can you guys think of or do you have a better approach?

I don't care the effort to create the metadata, I have enough slaves to do that. Just the comparision should be fast. So if it helps I can convert the video 100 times as well...

Here are my current ideas:

I would resample the picture to a thumbnail size and get the average rgb values then serialize pixel by pixel if the color at this pixel is greater/smaller than the average represented by 0 or 1. So I get a binary string which I can store into mysql and do a boolean bit-sum (supported by mysql internally) and count the remaining uneval bits (as well supported internally, that would then be the Levenshtein distance of the bianry strings)

I would transcode the video into a vbr videofile with the exact same settings. then I would look at the bitrate at certain points of time (percentage of the video completed or absolute seconds.. then we would only analyze a portion of the video). same thing as with the picture. Iif the bitrate is greater the average its 1 else its 0. we make a binary string and store it in db and calculate the Levenshtein distance later

Image comarision just like the first and last frame but at keyframe positions? We would use the same source files we used for bitrate calcluiations because keyframes are heavy depended on the codec and settings.

Maybe let's take one or more areas/pixels inside the image and see how they develope over time. As well the change abov/below average. black/white would suffice I think.

Or am I going the completely wrong way? I think I can't be the first one having this problem but I have not had any luck finding solutions.


Solution

  • This is a huge problem, so I've chosen to write a rather lengthy reply to try to decompose the problem into parts that may be easier to solve.

    It is important that the comparisons be performed using the compute and time resources available: I doubt a solution that takes months to run will be very useful in a dynamic video database. And the size of the database likely makes the use of cloud computing resources unfeasible. So we really care about the local cost of each comparison in several different domains: 1) Data storage, 2) compute resources, and 3) time.

    One key cost to consider is that of extracting the data needed from each video for whatever comparison metrics are to be used. Once the extracted data is available, then the cost of performing a comparison must be considered. Finally, the comparisons needed to match all videos to each other must be performed.

    The cost of the first two steps is O(1) on the number of videos. The cost of the last step must be worse than O(1), potentially much worse. So our primary goal should be minimizing the costs of the last step, even if it means adding many early, simple steps.

    The optimal algorithms for this process will greatly depend on the characteristics of the database, the level to which single and multiple matches exist. If 100% of the videos match some other video, then we will want to minimize the cost of a successful match. However, the more likely case is that matches will be rare, so we will want to minimize the cost of an unsuccessful match. That is to say, if there is a quick and dirty way to say "these two videos can't be matches', then we should use it first, before we even start to confirm a match.

    To characterize the database, first do some sampling and hand-matching to estimnate the degree of matching within the database. This experiment should show how the redundant videos "clumped": If a given video had a match, how likely was it to have more than a single match? What percentage of all matches were also part of a multiple match? This process will yield a 'model' of the database (a statistical distribution) that will be used to aid algorithm selection and tune the system.

    Going forward I will assume matches are relatively rare. After all, if there are lots of matches, the videos will "clump", effectively making the database smaller, and thus making the problem simpler. Let's assume the problem stays as hard as possible.

    I'd advocate a multi-level categorization approach, where we'd build a sequence of algorithms that repeatedly perform the binary decision of "these two videos do not match" / "these two videos may possibly match". Only the very last algorithm in the chain needs to output the answer "These two videos match."

    Classification/matching algorithms can fail in either or both of two ways: False Positive (non-matching videos are mislabled as matching) and False Negative (matching videos are mislabeled as non-matching). Each of these wrong decisions has a range of probabilities associated with it, and we want to minimize both.

    Since we are building an algorithm pipeline, we want algorithms that are very good at identifying non-matches without error, meaning they must have an extremely low False Reject rate, and we don't much care about the False Accept rate. For example, Wierd Al's clone of a video may look and sound very much like the original, and we may not be able to show it is not a match to the original until later in the algorithm pipeline.

    The simplest, fastest, most reliable algorithms should be run first, since the overwhelmingly vast majority of tests will yield the "do not match" result. The simplest check would be to search for identical files within the database, something done by many fast and simple filesystem and database maintenance utilities. After this scan is run, we can assume we will actually need to open and read the video files to detect differences.

    Since video comparison is relatively tough, let's start with the audio. Think of the database as first being an MP3 collection that may contain duplicates. After all, if we get a good audio match, it is very likely we will have a video match, and vice-versa. We can safely say the audio is a 'fair' representative for the video. Fortunately, a quick web search will yield many audio fingerprinting and comparison packages that are reliable, fast and mature. The audio fingerprint would need to be generated for every video in the database. Videos lacking an audio track would automatically fall into the "could match" set.

    But there is a 'gotcha' here: What about voice-overs? If a given video is encoded twice, with and without a voice-over, are they a match or not? What about the French audio vs the Spanish or English? If these should all be considered to be a match, then audio testing may need to be skipped.

    At this point, we know the filesystem entries are all "different enough", and we know the audio tracks are all "different enough" (if tested), which means we can't put off looking at the video data any longer. Fortunately, this should need to be done for only a small fraction of the video database, so we can tolerate some cost. As before, we will still want to first try to quickly eliminate more non-matches before we try to positively label a match.

    Since we need to take resolution changes into account (e.g., from 1080p to iPod), we will need a way to characterize video information that is not only resolution-independent, but also tolerant of the noise added and/or data lost as part of changing the resolution. We must tolerate frame rate changes (say, from a movie's 24 fps to video's 30 fps). There are also aspect ratio changes to consider, such as from 4:3 NTSC to 16:9 HD. We would want to handle color-space changes, such as from color to monochrome.

    Then there are transformations that affect all these at once, such as transcoding between HD and PAL, which can simultaneously affect color-space, frame-rate, aspect ratio, and resolution. The characterization should also be tolerant of some degree of cropping and/or filling, such as would happen from a switch back and forth between 4:3 and 16:9 aspect ratios (letterboxing, but not pan & scan). We also should handle videos that have been truncated, such as removing the credits from the end of a feature movie. And, obviously, we must also handle the differences created by different encoders that were fed an identical video stream.

    That's quite a list! Let's consider some things we may choose not to account for: I suspect it is OK to fail to find a match when image warping is present, despite the fact that anamorphic warping isn't uncommon, especially in 35mm wide-screen movies that were directly scanned without anamorphic reconstruction (tall-skinny people). We may also choose to fail when large watermarks are present in the middle of the frame, though we will want to tolerate smaller watermarks in the corners. And finally, it is OK to fail to match videos that have been temporally distorted or spatially flipped, such as when one is a slo-mo of the other, or has been flipped left-to-right.

    Does that just about cover the video space? Hopefully it is clear why it is important to start with the filesystem and the audio! That is, first think of your database more like an MP3 collection before considering it as a video collection.

    Ignoring the audio, video is just an ordered sequence of still images. So we're actually looking for one or more image comparison algorithms combined with one or more time-series comparison algorithms. This could be either pairs of separate algorithms (characterize each frame, then characterize the sequence of frames), or it could be merged into a single algorithm (look at the differences between frames).

    The images themselves may be decomposed further, into a monochrome 'structural' image and a color 'overlay'. I believe we can safely ignore the color information, if it is computationally convenient to do so.

    From the above, it may sound like I've assumed we'll have to fully decode a video in order to perform any comparisons on it. That is not necessarily the case, though the comparison of encoded data has many difficulties that limit its usefulness. The one significant exception to this is for object-level video encodings such as MP4, where very high-level multi-frame comparisons have been performed. Unfortunately, object comparisons between MP4 streams has not seen much research, and I am aware of no packages able to perform this function. But if you find one, use it!

    Most other digital video streams use encoding schemes such as MPEG2, Quicktime, or something similar. These schemes all use the concept of key frames and difference frames, though each implements it differently. When different videos are being compared (ones that are not the same size), it is unlikely the key frames and difference frames will match to any useful degree. However, this does not mean it is impossible, and packages exist that attempt to extract useful information from such streams without performing full decoding. If you find one that is fast, it may fall into a "why not try it" category of tests.

    The one trick I will use is instead of decoding frames completely, I would instead decode them only into separate component channels (HSV, HSL, YUV, whatever) and not all the way to the RGB framebuffer (unless that's what's been encoded, of course). From here, I'd next create separate luminance and chrominance (color) frames so comparisons may be performed in related domains. Decoding all the way to an RGB framebuffer may introduce errors that may make finding matches more difficult.

    Next, I'd discard the color information. Since a monochrome video should match its color original, we simply don't care about color!

    How may the resulting sequence of monochrome frames best be compared to another sequence that may appear very different, yet still may possibly be a match? There have been literally decades of research in this area, much of it categorized under "scale-invariant match detection". Unfortunately, very little of this research has been directly applied to determining when videos do or do not match.

    For our purposes, we can approach this issue from several directions. First, we must know for ourselves what is and is not a match in the monochrome domain. For example, we do not care about pixel-level differences, since even if two matching-but-different videos had the same resolution, we must tolerate some level of noise due to things like encoder differences.

    A simple (but slow) way forward is to transform each image into a form that is independent of both resolution and aspect ratio. One such transformation is into the spatial frequency domain, and the 2D FFT is ideal for this. After discarding the imaginary component, the real component may be truncated at high frequencies to remove noise and at low frequencies to remove aspect ratio effects, then normalized to a standard scale eliminate resolution differences. The resulting data looks like an odd tiny image that may be directly compared across video streams.

    There are many other possible frame transformation strategies, many vastly more efficient than the FFT, and a literature search should highlight them. Unfortunately, I know of few that have been implemented in software libraries that are as easy to use as the FFT.

    Once we have transformed the monochrome frames into a smaller and more useful domain, we still must perform the comparison to another such stream from another video. And that video is almost certain not to be a frame-to-frame match, so a simple comparison will certainly fail. We need a comparison that will take onto account differences in the time domain, including added/removed frames and differences in frame rate.

    If you look at the sequence of FFT frames, you will notice some very distinct behavior. Scene fades are abrupt and extremely easy to spot, cuts can also be distinguished, and there are typically only slow changes seen in the FFT between cuts. From the sequence of FFTs we can label each frame as being the first after a cut/fade, or as a frame between cuts/fades. What's important is the time between each cut/fade, independent of the number frames between them, which creates a signature or fingerprint which is largely independent of the frame rate.

    Taking this fingerprint of an entire video yields data that is massively smaller than the video itself. It is also a linear sequence of numbers, a simple time-series vector, much like audio, and can be analyzed using many of the same tools.

    The first tool is to perform a correlation, to determine if the pattern of cuts in one video is a close match to that in another video. If there are significant differences, then the videos are different. If they are a close match, then the only a few FFTs after each correlated cut need be compared to determine if the frames are similar enough to be a match.

    I'll not go into the comparison of 2D FFTs here, since there are abundant references that do the job far better than I can.

    Note: There are many other manipulations (beyond a 2D FFT) that may be applied to monochrome frames to obtain additional fingerprints. Representations of actual image content may be created by extracting the interior edges of the image (literally like an FBI fingerprint), or by selectively thresholding the image and performing a 'blobbing' operation (creating a linked list of related region descriptors). Tracking the evolution of the edges and/or blobs between frames can be used not only to generate cut lists, but can also be used to extract additional high-level image characteristics that would be lost using a 2D FFT.

    We have constructed a sequence of comparison algorithms that should be very fast at finding non-matches, and not require too much time to conclusively determine matches. Alas, having algorithms does not a solution make! We must consider several issues related to how these algorithms should best be implemented.

    First, we don't want to open and read each video file any more times than necessary, else the CPU could stall waiting for data from the disk. We also don't want to read any further into a file than needed, though we don't want to stop reading too soon and potentially miss a later match. Should the information that characterizes each video be saved, or should it be recomputed when needed? Addressing these issues will permit an efficient and effective video comparison system to be developed, tested and deployed.

    We have shown it is possible to compare videos with some hope of finding matches under highly variable conditions, with computational efficiency.

    The rest has been left as an exercise for the reader. ;^)