arraysstringalgorithmsuffix-treesuffix-array

Practical implementation of suffix array


Looking for a practical implementation of suffix arrays, I came across this paper. It outlines a O(n (log n * log n)) approach, where n is the length of the string. While there are faster algorithms available, IMO, none is suitable in a programming contest or interview setting. But I found the explanation of the algorithm really lacking. I quote:

The algorithm is mainly based on maintaining the order of the string’s suffixes sorted by their 2k long prefixes. We shall execute m = [log2 n] (ceil) steps, computing the order of the prefixes of length 2k at the kth step. It is used an m x n sized matrix. Let’s denote by Aik the subsequence of A of length 2k starting at position i. The position of Aik in the sorted array of Ajk subsequences (j = 1, n) is kept in P(k, i).

When passing from step k to step k + 1, all the pairs of subsequences Aik and Ai+2^kk are concatenated, therefore obtaining the substrings of length 2k+1. For establishing the new order relation among these, the information computed at the previous step must be used. For each index i it is kept a pair formed by P(k, i) and P(k, i+2^k). The fact that i + 2k may exceed the string’s bounds must not bother us, because we shall fill the string with the "$" character, about which we shall consider that it’s lexicographically smaller than any other character. After sorting, the pairs will be arranged conforming to the lexicographic order of the strings of length 2k+1. One last thing that must be remembered is that at a certain step k there may be several substrings Aik = Ajk, and these must be labeled identically (P(k, i) must be equal to P(k, j)).

The paper provides a C++ implementation, but it's very difficult to glean any insight from the code when the variables are named L, P, and nr, so, I made a simplified implementation in Python. Note that the variable names are to the best of my understanding, which is what I'm hoping to get more clarification on.

def suffix_array(text: str) -> list[int]:
    @dataclasses.dataclass(order=True)
    class Entry:
        prefix: list[int] = dataclasses.field(default_factory=lambda: [0, 0])
        start: int = dataclasses.field(default_factory=int, compare=False)

    n = len(text)
    m = math.ceil(math.log2(n)) + 2
    pos = [ord(x) for x in text]
    suffixes = [Entry() for _ in range(n)]
    for step in range(1, m):
        cnt = 2 ** (step - 1)
        for i in range(n):
            suffixes[i].prefix[0] = pos[i]
            suffixes[i].prefix[1] = pos[j] if (j := i + cnt) < n else -1
            suffixes[i].start = i
        suffixes.sort()
        for i in range(n):
            if i > 0 and suffixes[i] == suffixes[i - 1]:
                pos[suffixes[i].start] = pos[suffixes[i - 1].start]
            else:
                pos[suffixes[i].start] = i

    return [i.start for i in suffixes]

Now, my questions:

  1. How do we know the algorithm works correctly? The paper provides no proof, and I don't understand how sorting the relative positions of the prefixes is also keeping the prefixes sorted.
  2. In plain English, what are the purposes of the variables suffixes, pos and Entry.prefix in the code above? These are named L, P, and nr, respectively, in the C++ code.
  3. At the first iteration (step=1), suffixes[i].prefix (L[i].nr in C++) becomes the start and the end of a prefix of length 2 starting at text[i]. suffixes (L in C++) is sorted, and positions are stored in pos (P in C++). At the next iteration (step=2), suffixes[i].prefix array is populated based on pos[1], so, it is no longer the start and end of a prefix of length 4. What is suffixes[i].prefix then?
  4. On page 6, it is said "The suffix array will be found on the last row of matrix P", but this appears to be incorrect. The suffix array is actually given by the start (p in C++) values of the array suffixes (L in C++).

Solution

  • OP here. A detailed explanation of this algorithm is given in the book Competitive Programming 3 by Steven Halim - Chapter 6.6.4 Suffix Array. A visualization of the algorithm is available here.

    I got it now, here are my notes.

    Basically, the algorithm works sort of like Radix Sort, whereas parts of the input are sorted first, then the parts after that. By sorting integers and not strings, the algorithm avoids the string comparisons that make the naive algorithm quadratic. At each iteration, the algorithm does the following:

    1. Update the ranking pair based on the ranks computed in the previous iteration.
    2. Sort the suffixes based on the updated ranking pairs.
    3. Recompute the ranks of the suffixes. To calculate the new ranks, we set the first one to have new rank r = 0. Then, we iterate from i = [1..n-1]. If the ranking pair of suffix 'i' is different from the ranking pair of the previous suffix in sorted order, we set the rank r = i. Otherwise, the rank stays at r.

    Example: text="GATAGACA".

    Iteration 1, suffixes of length 2 are considered. Before and after sorting:

    +---+--------+--------------+     +---+--------+--------------+------+
    | i | Suffix | ranking_pair |     | i | Suffix | ranking_pair | Rank |
    +---+--------+--------------+     +---+--------+--------------+------+
    | 0 | GA     | [71, 65]     |     | 7 | A      | [65, -1]     |    0 |
    | 1 | AT     | [65, 84]     |     | 5 | AC     | [65, 67]     |    1 |
    | 2 | TA     | [84, 65]     |     | 3 | AG     | [65, 71]     |    2 |
    | 3 | AG     | [65, 71]     |     | 1 | AT     | [65, 84]     |    3 |
    | 4 | GA     | [71, 65]     |     | 6 | CA     | [67, 65]     |    4 |
    | 5 | AC     | [65, 67]     |     | 0 | GA     | [71, 65]     |    5 |
    | 6 | CA     | [67, 65]     |     | 4 | GA     | [71, 65]     |    5 |
    | 7 | A      | [65, -1]     |     | 2 | TA     | [84, 65]     |    7 |
    +---+--------+--------------+     +---+--------+--------------+------+
    

    Iteration 2, suffixes of length 4 are considered. Before and after sorting:

    +---+--------+--------------+     +---+--------+--------------+------+
    | i | Suffix | ranking_pair |     | i | Suffix | ranking_pair | Rank |
    +---+--------+--------------+     +---+--------+--------------+------+
    | 7 | A      | [0, 2]       |     | 7 | A      | [0, 2]       |    0 |
    | 5 | AC     | [1, 3]       |     | 5 | AC     | [1, 3]       |    1 |
    | 3 | AGAC   | [2, 4]       |     | 3 | AGAC   | [2, 4]       |    2 |
    | 1 | ATAG   | [3, 5]       |     | 1 | ATAG   | [3, 5]       |    3 |
    | 6 | CA     | [4, 5]       |     | 6 | CA     | [4, 5]       |    4 |
    | 0 | GATA   | [5, 7]       |     | 4 | GACA   | [5, 6]       |    5 |
    | 4 | GACA   | [5, -1]      |     | 0 | GATA   | [5, -1]      |    6 |
    | 2 | TAGA   | [7, -1]      |     | 2 | TAGA   | [6, -1]      |    7 |
    +---+--------+--------------+     +---+--------+--------------+------+
    

    Iteration 3 is not shown, suffixes of length 8 are considered.