c++data-structuresdawg

Can a DAWG be used to store word related information?


Can a DAWG be used to store auxiliary information relating to each path, e.g. a word's frequency in the English language? If yes, then how can i do that?


Solution

  • Typically, you cannot store per-word information in a DAWG the same way that you could in a trie or other data structure. The reason for this is that multiple different words in a DAWG might all share nodes, so there is a risk that information for one word "leaks" into information for other words.

    As a simple example, suppose that we have a DAWG for the words "is", "as", "i", and "a". In that case, the DAWG would look like this:

                         START
                      a /     \ i
                       ACC    ACC
                     s  \     / s                        
                          ACC
    

    Note that the node representing the words "as" and "is" are exactly the same node. Therefore, if you were to try to annotate the word "as" with information, the node holding that information would also be the same as the node for "is," meaning that "as" and 'is" would both pick up the same set of information.

    You could try to get around this by storing a map in the node for "as" and "is" that maps from the word ending at that node to the extra information about that word, but this dramatically increases the memory usage of the DAWG. You're now storing each character in the word, so your memory usage will go way up (remember that the whole point of the DAWG is to reduce the memory usage required to store a set of words). You'd be better off just storing a hash table that maps from words to information.

    Another option you might try to store this information would be to expand out each path through the DAWG into its own branch, so that the nodes for different words are always different. The problem with this approach, though, is that you're effectively converting the DAWG back into a trie, which dramatically increases the memory usage involved.

    In short, there is no straightforward way to annotate words in a DAWG with metainformation without greatly increasing the memory usage. If you have to do this, you're better off using a different data structure.

    Hope this helps!