finite-automataperfect-hash

In this minimal perfect hashing function, what is meant by FirstLetter and Predecessor?


I'm implementing a Minimalistic Acyclic Finite State Automaton (MA-FSA; a specific kind of DAG) in Go, and would like to associate some extra data with nodes that indicate EOW (end-of-word). With MA-FSA, the traditional approach is not possible because there are multiple words that might end at that node. So I'm looking into minimal perfect hashing functions as an alternative.

In the "Correction" box at the top of his blog post, Steve Hanov says that he used the method described in this paper by Lucchesi and Kowaltowski. In looking at Figure 12 (page 19), it describes the hashing function.

On line 8, it refers to FirstLetter and Predecessor(), but it doesn't describe what they are. Or I'm not seeing it. What are they?

All I can figure out is that it's just traversing the tree, adding up Number from each node as it goes, but that can't possibly be right. It produces numbers that are too large and it's not one-to-one, like the paper says. Am I misreading something?


Solution

  • The paper says:

    Let us assume that the representation of our automaton includes, for each state, an integer which gives the number of words that would be accepted by the automaton starting from that state.

    So I believe this: for C <- FirstLetter to Predecessor(Word[I ]) do

    Means: for (c = 'a'; c < word[i]; c++)

    (They're just trying to be alphabet-independent.)

    Think of it this way: enumerate all accepted words. Sort them. Find your word in the list. Its index is the word's hash value.

    Their algorithm avoids storing the complete list by keeping track of how many words are reachable from a given node. So you get to a node, and check all the outgoing edges to other nodes that involve a letter of the alphabet before your next letter. All of the words reachable from those nodes must be on the list before your word, so you can calculate what position your word must occupy in the list.