Calculating Binary Data Similarity

I've seen a few questions here related to determining the similarity of files, but they are all linked to a particular domain (images, sounds, text, etc). The techniques offered as solutions require knowledge of the underlying file format of the files being compared. What I am looking for is a method without this requirement, where arbitrary binary files could be compared without needing to understand what type of data they contain. That is, I am looking to determine the similarity percentage of two files' binary data.

To give a little more detail for you to work with, even though this is potentially applicable to many things, I do have a specific problem that I'm working on. I also currently have a working solution, but I don't think that it is ideal. There are probably many optimizations in terms of the comparison method, and storing the results. Hopefully some people here will be able to give me some new ideas. I will probably edit in some information about my current method after a couple of days, but I don't want to bias peoples' thoughts about the problem by telling you how I'm already doing it.

The problem I'm working on is clone detection for video game ROM images. For those that don't have experience with emulation, ROMs are dumps of the data on game cartridges. A ROM "clone" is typically a modified version of the same game, the most common type being a translated version. For example, the Japanese and English versions of the original Final Fantasy for the NES are clones. The games share almost all of their assets (sprites, music, etc), but the text has been translated.

There are currently several groups that work on maintaining lists of clones for the various systems, but as far as I can tell, this is all done manually. What I am attempting to do is find a method to detect similar ROM images automatically and objectively, based on data similarity instead of "these seem like the same game". There are several reasons for detecting clones, but one of the major motivations is to be used with Solid compression. This allows compression of all game clones together into the same archive, with the entire compressed clone set often taking up only slightly more space than one of the individual ROMs.

Some concerns to consider when coming up with potential approaches:

ROMs vary highly in size, depending on the system. Some are small, but modern systems may have large ones, 256MB or more. Some (all?) systems only have powers of 2 as possible sizes, a 130MB game on one of these systems would have a 256MB rom, largely empty. Note that because of this, some clones may have wildly different sizes, if a game version crosses the threshold and has to use a cartridge that is twice the size.
There are currently thousands of known ROMs on many systems, with most systems still having new ones released constantly. Even for older systems, there is a major ROM-hacking community that produces modified ROMs often.
Storing similarity data for every possible pair of ROMs would result in millions of rows of data for any of the more popular systems. A system with 5000 ROMs would require 25 million rows of similarity data, with a single new game adding another 5000 rows.
State of the processing must be recoverable, so that if it is interrupted it can pick up where it left off. With any method, a lot of processing will be required, and assuming that the whole thing will run in one batch is not safe.
New ROMs could be added at any time, so the method should not assume that it already has a "complete" set. That is, even after you have already figured out similarity for all existing ROMs, if a new one is added (and this could also occur before previous processing was entirely finished) there must be a method for comparing it to all previous ones, to determine which (if any) it is a clone of.
Higher processing speed should be given priority over accuracy (to a point). Knowing whether two ROMs are 94% or 96% similar is not particularly important, but if it takes a day of processing to compare a new ROM to all the previous ones, the program would probably never truly complete.

It's been an interesting problem to work on, I look forward to seeing what other people can come up with. Let me know in the comments if you want any more details, and I'll try to supply them.

Solution

It sounds like you want a binary delta or perhaps an index derived from the application of a binary delta (like it's size). You could then compare this index to some baseline that you determine experimentally to decide if it's a "clone" or not.

There are a lot of similarities between compression and delta creation, so I'd say you aren't far off with your current implementation.

That being said, pairwise comparison of every binary file in your database is probably prohibitively expensive (O(n²), I think). I would try to find a simple hash for identifying possible candidates for comparison. Something conceptually similar to what spdenne and Eduard are suggesting. That is, find a hash which can be applied to every item once, sort that list and then use a finer grained comparison on items whose hashes are close together in the list.

Constructing hashes useful for the general case has been an actively pursued research topic in CS for several years. The LSHKit software library implements some algorithms of this sort. The internet accessible paper FINDING SIMILAR FILES IN A LARGE FILE SYSTEM seems like it might be targeted more at comparing text files but might be useful to you. The more recent paper Multi-resolution similarity hashing describes a more powerful algorithm. It doesn't appear to be accessible without a subscription, though. You probably want to keep the wikipedia article on Locality Sensitive Hashing handy as you browse the other resources. They all get pretty technical and the wikipedia entry itself is pretty math heavy. As a more user-friendly alternative you might be able to apply some ideas (or even executables) from the field of Acoustic Fingerprinting.

If you're willing to abandon the general case it's likely that you can find a much simpler (and faster) domain-specific hash function that works just for your ROMs. Possibly something involving the placement of standard, or common, byte sequences and the value of select bits near them. I don't really know much about your binary format but I'm imagining things that signal the start of sections in the file like regions for sound, images or text. Binary formats frequently store the addresses of these sorts of sections near the beginning of the file. Some also use a chaining mechanism that stores the address of the first section at a known location along with it's size. This allows you to move to the next section which also contains a size, etc. A little investigation will probably allow you to discover any relevant formatting, if you're not already aware of it, and should put you well on your way to constructing a useful hash.

If the hash functions don't get you all the way (or they require input of some sort to define a metric/distance) then there are several binary delta algorithms and implementations available on the web. The one I'm most familiar with is used by the subversion version control system. It uses a binary delta algorithm called xdelta to efficiently store binary file revisions. Here's a link directly to the file in their repository that implements it: xdelta.c. There's probably a tool on the web that makes this more accessible as well.