Why does the KMP (Knuth-Morris-Pratt) algorithm use length = prefixTable[length - 1]
instead of simply length = length - 1
in the following code? I'm confused because I initially thought that both assignments would yield the same result. I understand that I'm mistaken, but I'm not sure where I went wrong. Could someone please explain this difference with examples?
private int[] buildPrefixTable(String s) {
int[] prefixTable = new int[s.length()];
int length = 0;
for (int i = 1; i < s.length(); i++) {
while (length > 0 && s.charAt(i) != s.charAt(length)) {
length = prefixTable[length - 1];
}
if (s.charAt(i) == s.charAt(length)) {
length++;
}
prefixTable[i] = length;
}
return prefixTable;
}
At each index, we need to compute the length of the longest proper suffix at this point that is also a prefix of the whole string. If the suffix from the previous index cannot be extended further, it is incorrect to simply decrement the length by one as we do not know if shifting the suffix over will still match with a prefix.
Consider the string "ababb". At the second last character, we have prefixTable[3] = 2
with "ab"
being the longest proper suffix that matches a prefix. Moving on to the last character, we see that 'a'
does not match 'b'
. If at this point we simply decremented the length, we would be assuming that the suffix of length 1 ending at index 3 is also a prefix of the whole string, i.e. "b" = "a"
. But we would not be checking this incorrect assumption, as we only compare the one new character.
By looking up prefixTable[length - 1]
instead, we get the longest proper suffix of the first length
characters (which is a prefix of the entire string) that is also a prefix. Note that the prefix with length
characters has an occurrence starting at the beginning of the string and one ending at the previous index, so a proper suffix of it would also be a proper suffix of the prefix up to the previous index. This ensures that we jump to a point where the suffix at the previous index still matches a prefix.