Looking for a practical implementation of suffix arrays, I came across this paper. It outlines a O(n (log n * log n))
approach, where n
is the length of the string. While there are faster algorithms available, IMO, none is suitable in a programming contest or interview setting. But I found the explanation of the algorithm really lacking. I quote:
The algorithm is mainly based on maintaining the order of the string’s suffixes sorted by their 2k long prefixes. We shall execute m = [log2 n] (ceil) steps, computing the order of the prefixes of length 2k at the kth step. It is used an m x n sized matrix. Let’s denote by Aik the subsequence of A of length 2k starting at position i. The position of Aik in the sorted array of Ajk subsequences (j = 1, n) is kept in P(k, i).
When passing from step k to step k + 1, all the pairs of subsequences Aik and Ai+2^kk are concatenated, therefore obtaining the substrings of length 2k+1. For establishing the new order relation among these, the information computed at the previous step must be used. For each index i it is kept a pair formed by P(k, i) and P(k, i+2^k). The fact that i + 2k may exceed the string’s bounds must not bother us, because we shall fill the string with the "$" character, about which we shall consider that it’s lexicographically smaller than any other character. After sorting, the pairs will be arranged conforming to the lexicographic order of the strings of length 2k+1. One last thing that must be remembered is that at a certain step k there may be several substrings Aik = Ajk, and these must be labeled identically (P(k, i) must be equal to P(k, j)).
The paper provides a C++ implementation, but it's very difficult to glean any insight from the code when the variables are named L
, P
, and nr
, so, I made a simplified implementation in Python. Note that the variable names are to the best of my understanding, which is what I'm hoping to get more clarification on.
def suffix_array(text: str) -> list[int]:
@dataclasses.dataclass(order=True)
class Entry:
prefix: list[int] = dataclasses.field(default_factory=lambda: [0, 0])
start: int = dataclasses.field(default_factory=int, compare=False)
n = len(text)
m = math.ceil(math.log2(n)) + 2
pos = [ord(x) for x in text]
suffixes = [Entry() for _ in range(n)]
for step in range(1, m):
cnt = 2 ** (step - 1)
for i in range(n):
suffixes[i].prefix[0] = pos[i]
suffixes[i].prefix[1] = pos[j] if (j := i + cnt) < n else -1
suffixes[i].start = i
suffixes.sort()
for i in range(n):
if i > 0 and suffixes[i] == suffixes[i - 1]:
pos[suffixes[i].start] = pos[suffixes[i - 1].start]
else:
pos[suffixes[i].start] = i
return [i.start for i in suffixes]
Now, my questions:
suffixes
, pos
and Entry.prefix
in the code above? These are named L
, P
, and nr
, respectively, in the C++ code.step=1
), suffixes[i].prefix
(L[i].nr
in C++) becomes the start and the end of a prefix of length 2 starting at text[i]
. suffixes
(L
in C++) is sorted, and positions are stored in pos
(P
in C++). At the next iteration (step=2
), suffixes[i].prefix
array is populated based on pos[1]
, so, it is no longer the start and end of a prefix of length 4. What is suffixes[i].prefix
then?P
", but this appears to be incorrect. The suffix array is actually given by the start
(p
in C++) values of the array suffixes
(L
in C++).OP here. A detailed explanation of this algorithm is given in the book Competitive Programming 3 by Steven Halim - Chapter 6.6.4 Suffix Array. A visualization of the algorithm is available here.
I got it now, here are my notes.
Basically, the algorithm works sort of like Radix Sort, whereas parts of the input are sorted first, then the parts after that. By sorting integers and not strings, the algorithm avoids the string comparisons that make the naive algorithm quadratic. At each iteration, the algorithm does the following:
Example: text="GATAGACA".
Iteration 1, suffixes of length 2 are considered. Before and after sorting:
+---+--------+--------------+ +---+--------+--------------+------+
| i | Suffix | ranking_pair | | i | Suffix | ranking_pair | Rank |
+---+--------+--------------+ +---+--------+--------------+------+
| 0 | GA | [71, 65] | | 7 | A | [65, -1] | 0 |
| 1 | AT | [65, 84] | | 5 | AC | [65, 67] | 1 |
| 2 | TA | [84, 65] | | 3 | AG | [65, 71] | 2 |
| 3 | AG | [65, 71] | | 1 | AT | [65, 84] | 3 |
| 4 | GA | [71, 65] | | 6 | CA | [67, 65] | 4 |
| 5 | AC | [65, 67] | | 0 | GA | [71, 65] | 5 |
| 6 | CA | [67, 65] | | 4 | GA | [71, 65] | 5 |
| 7 | A | [65, -1] | | 2 | TA | [84, 65] | 7 |
+---+--------+--------------+ +---+--------+--------------+------+
Iteration 2, suffixes of length 4 are considered. Before and after sorting:
+---+--------+--------------+ +---+--------+--------------+------+
| i | Suffix | ranking_pair | | i | Suffix | ranking_pair | Rank |
+---+--------+--------------+ +---+--------+--------------+------+
| 7 | A | [0, 2] | | 7 | A | [0, 2] | 0 |
| 5 | AC | [1, 3] | | 5 | AC | [1, 3] | 1 |
| 3 | AGAC | [2, 4] | | 3 | AGAC | [2, 4] | 2 |
| 1 | ATAG | [3, 5] | | 1 | ATAG | [3, 5] | 3 |
| 6 | CA | [4, 5] | | 6 | CA | [4, 5] | 4 |
| 0 | GATA | [5, 7] | | 4 | GACA | [5, 6] | 5 |
| 4 | GACA | [5, -1] | | 0 | GATA | [5, -1] | 6 |
| 2 | TAGA | [7, -1] | | 2 | TAGA | [6, -1] | 7 |
+---+--------+--------------+ +---+--------+--------------+------+
Iteration 3 is not shown, suffixes of length 8 are considered.