comparediffcodebase

Compare multiple files for common code


I have two projects each with a massive code base. I'd like to run a tool to go through all the files in every project and show me which files across the projects have similar code. I'm not even sure if anything like this exists but I remember been in school, teachers had a tool they ran on all code from multiple students to identify how similar their code was (to catch cheaters).


Solution

  • What you want is a clone detection tool. These tools find code which duplicated across any set of files. For your task, you'd take the files for both projects, and do clone detection across that set.

    [EDIT 2019 based on real experience doing exactly what OP wants to do].

    If a clone is found in a file from one project, that corresponds to a clone found in a file from the other project, you've found what they have in common.

    A defect of doing straight clone detection across all files from both projects, is that you will find a lot of clones from one project into that same project. Those aren't interesting according to your question, e.g. false positives.

    My company provides a commercial clone detector called CloneDR. It is (IMHO) an extremely good detector and will find clones that other detectors cannot (e.g. it isn't fooled by comment changes, code layouts, number radixes, variable rename nor even insertion or deletion of code fragments). But it has one other very nice property: it has a option to detect clones only across two project code bases. You won't get the false positives you'd get by treating the two projects as one.