algorithmdifflcs

Myers diff algorithm vs Hunt–McIlroy algorithm


The longest common subsequence problem is a classic computer science problem, and algorithms to solve it are the root of version control systems and wiki engines.

Two basic algorithms are the Hunt–McIlroy algorithm which was used to create the original version of diff, and the Myers diff algorithm which is used by the GNU diff utility currently. Both seem to work more or less by finding the shortest path through a graph that represents the edit space between the two strings or text files. The edit space is the number of inserts or deletes necessary to transform one sequence into the other. So what exactly is the difference between Myers' diff algorithm and the Hunt–McIlroy algorithm?


Solution

  • The Myers algorithm mostly wins because it does not memorize the paths while working, and does not need to "foresee" where to go, doing so it can concentrate at any time only on the furthest positions it could reach with an edit script of smallest complexity. This "smallest complexity" ensures that what is found is the LCS, and not memorizing through which path it went until it found a match uses less memory. Not trying to compute all matches ahead of time avoids their storage, and make them a benefit at match time rather than a memory hog.