I've been working on a project developing a more extensible way to create synopses of ancient texts (essentially side-by-side comparisons with the diffs highlighted, but with options to focus only on semantically-useful diffs.) This means that it makes the most sense to use a character diff (primarily because some of these diffs are one extra vowel letter, which is not 'important', so to speak.) I've implemented the Myers diff algorithm, as described in the wonderful blog post in https://blog.jcoglan.com/2017/02/12/the-myers-diff-algorithm-part-1/ (and subsequent two parts) in Javascript, which I attach at the bottom of this post.
Coglan notes that the Myers diff algorithm is greedy - 'trying to consume as many lines that are the same before making a change' - and thus avoids the so-called "wrong end" problem:
Good: class Foo Bad: class Foo
def initialize(name) def initialize(name)
@name = name @name = name
end + end
+ +
+ def inspect + def inspect
+ @name + @name
+ end end
end end
I'm wondering if there is a way to avoid this situation when comparing letter-by-letter. The implementation Coglan describes leads to the following result in the case of "letter-by-letter", using the example above: the instance of "end" that the code thinks I have added, which should be the last three letters, are instead the "e" in def, "n" in inspect, and "d" in the first end.
In my use case, I have run into this problem when attempting to skip diffs which emerge from the usage of Aramaic acronyms. Said acronyms are of the form LLL"L, where the each L(etter) is taken from the first letter of the word it represents, and the double quote is placed between the penultimate and last letter of the acronym. Note that there are no capital letters in Aramaic (which would have been a natural first thing to test for potential acronyms in English.) I have highlighted in yellow tokens which contain acronyms (which are fully-spelled-out in the other text) in an example below. Spaces which are removed by the diff show up as arrows pointing down: As you might be able to see, the non-highlighted diffs work like a charm. It's pretty easy to tell where the "extra" or "switched" letters are, data that is important for me. And it's also possible to show that all acronym/plain-text pair - except for the first highlighted pair - demonstrate a regular pattern. (Shared letter in black, then removed letters until the end of a word - i.e. until an arrow in red - then another shared letter in black, etc...) Acronyms like these are the parallel to diffing "C.G.I" with "computer generated imagery", where not a single letter in the balance of the full-text "omputer enerated magery" is a letter in the acronym. But when the acronym is something like "A.C.M" and "association [for] computing machinery", where "computing" contains an M/m (again under the constraint in this case that there is no capitalization), the diff will use that m. This is what happens in the case of א״ר and אמר רבי. Even though the former is an acronym for the latter, the analysis is that אמר alone becomes א״ר; again, due to the wrong end problem.
I understand that the preferred method to solve this problem alone would likely simply be to just go back to diffing words, but then I lose much of the other functionality I've developed... Anyhow, here's the code (in Javascript, again; based on the blog post above.)
function myersDiff(oldTokens, newTokens) {
const m = oldTokens.length;
const n = newTokens.length;
const max = m + n;
//max amount of moves we may need to make
const v = new Map();
//map contains trace in order
v.set(1, 0);
let path = [];
//get shortest edit graph using the 45-degree k-lines at each depth d
for (let d = 0; d <= max; d++) {
path[d] = new Map();
for (let k = -d; k <= d; k += 2) {
/* e.g. if depth is 6, consider all options from -12 to 12 since the trace is a sparse binary tree
if k = -d, we're on the edge of the graph. We can only go downward (i.e. left). If k = d, we can only go up-right (i.e. right.)
Otherwise, check the elements to the right and left (k+1, k-1) and take the highest x-value
*/
let x;
if (k === -d || (k !== d && v.get(k - 1) < v.get(k + 1))) {
x = v.get(k + 1);
} else {
x = v.get(k - 1) + 1; //when moving rightward, the k-line index is higher
}
let y = x - k;
//edit graph should consider diagonals to add 0 cost
while (x < m && y < n && oldTokens[x].letter === newTokens[y].letter) {
x++;
y++;
}
v.set(k, x);
path[d].set(k, { x, y });
//check for end
if (x >= m && y >= n) {
return buildChanges(path, oldTokens, newTokens);
}
}
}
}
function buildChanges(path, oldTokens, newTokens) {
const changes = [];
let d = path.length - 1;
let x = oldTokens.length;
let y = newTokens.length;
while (d >= 0) {
const k = x - y;
const step = path[d].get(k);
let prevK;
if (k === -d || (k !== d && path[d - 1] && path[d - 1].has(k - 1) && path[d - 1].get(k - 1).x < path[d - 1].get(k + 1)?.x)) {
prevK = k + 1;
} else {
prevK = k - 1;
}
const prevStep = path[d - 1]?.get(prevK);
//backtrack to construct the list, privileging diagonals
while (x > (prevStep?.x || 0) && y > (prevStep?.y || 0)) {
changes.unshift({ type: "unchanged", token: oldTokens[x - 1].letter, capital: oldTokens[x-1].capital });
x--;
y--;
}
if (x > (prevStep?.x || 0)) {
changes.unshift({ type: "removed", token: oldTokens[x - 1].letter, capital: oldTokens[x-1].capital });
x--;
} else if (y > (prevStep?.y || 0)) {
changes.unshift({ type: "added", token: newTokens[y - 1].letter, capital: newTokens[y-1].capital });
y--;
}
d--;
}
return changes;
}
I usually do this sort of thing by utilizing the more common longest-common-subsequence algorithm, but then add a tie-breaking penalty to the path cost for starting or ending a match (diagonal sequence).
The size of the penalty should vary according to how desirable the start or end position is.
A small penalty everywhere will cause the algorithm to favor contiguous matches over broken matches of the same total size, which is usually nice.
You would make the penalty higher for starting or ending in the middle of a word, which would favor whole word matches.
Note that this will require tracking 3 numbers for each position in the matrix: The total edit cost, the best penalty when you get there via a match (diagonal move), and the best penalty when you get there via a non-match.