javaalgorithmdata-structurestrieaho-corasick

Java implementation of Aho-Corasick string matching algorithm?


Now I know there have been previous questions regarding this algorithm, however I honestly haven't come across a simple java implementation. Many people have copied and pasted the same code in their GitHub profiles, and its irritating me.

So for the purpose of interview exercise, I've planned to set out and implement the algorithm using a different approach.

The algorithm tend out to be very very challenging. I honestly am lost on how to go about it. The logic just doesn't make sense. I've nearly spent 4 days straight sketching the approach, but to no avail.

Therefore please enlighten us with your wisdom.

I'm primarily doing the algorithm based on this information Intuition behind the Aho-Corasick string matching algorithm

It would be a big bonus if one can implement their own solution here.

But here's the following incomplete and not working solution which I got really stuck at:

If your overwhelemed with the code, the main problem lies at the main algorithm of Aho-Corasick. We already have created the trie tree of dictionaries well.

But the issue is, that now that we have the trie strcuture, how do we actually start implementing.

None of the resources were helpful.

public class DeterminingDNAHealth {
  private Trie tree;
  private String[] dictionary;
  private Node FailedNode;


  private DeterminingDNAHealth() {

  }

  private void buildMatchingMachine(String[] dictionary) {
    this.tree = new Trie();
    this.dictionary = dictionary;

    Arrays.stream(dictionary).forEach(tree::insert);

  }

  private void searchWords(String word, String[] dictionary) {

    buildMatchingMachine(dictionary);

    HashMap < Character, Node > children = tree.parent.getChildren();

    String matchedString = "";

    for (int i = 0; i < 3; i++) {
      char C = word.charAt(i);

      matchedString += C;

      matchedChar(C, matchedString);

    }

  }

  private void matchedChar(char C, String matchedString) {


    if (tree.parent.getChildren().containsKey(C) && dictionaryContains(matchedString)) {

      tree.parent = tree.parent.getChildren().get(C);

    } else {

      char suffix = matchedString.charAt(matchedString.length() - 2);

      if (!tree.parent.getParent().getChildren().containsKey(suffix)) {
        tree.parent = tree.parent.getParent();

      }


    }
  }

  private boolean dictionaryContains(String word) {

    return Arrays.asList(dictionary).contains(word);

  }


  public static void main(String[] args) {

    DeterminingDNAHealth DNA = new DeterminingDNAHealth();

    DNA.searchWords("abccab", new String[] {
      "a",
      "ab",
      "bc",
      "bca",
      "c",
      "caa"
    });


  }
}

I have setup a trie data structure which works fine. So no problem here

trie.java

public class Trie {
  public Node parent;
  public Node fall;

  public Trie() {
    parent = new Node('⍜');
    parent.setParent(new Node());
  }

  public void insert(String word) {...}

  private boolean delete(String word) {...}

  public boolean search(String word) {...}

  public Node searchNode(String word) {...}

  public void printLevelOrderDFS(Node root) {...}

  public static void printLevel(Node node, int level) {...}

  public static int maxHeight(Node root) {...}

  public void printTrie() {...}

}

Same thing for Node.

Node.java

public class Node {

  private char character;
  private Node parent;
  private HashMap<Character, Node> children = new HashMap<Character, Node>();
  private boolean leaf;

  // default case
  public Node() {}

  // constructor accepting the character
  public Node(char character) {
    this.character = character;
  }

  public void setCharacter(char character) {...}

  public char getCharacter() {...}

  public void setParent(Node parent) {...}

  public Node getParent() {...}

  public HashMap<Character, Node> getChildren() {...}

  public void setChildren(HashMap<Character, Node> children) {...}

  public void resetChildren() {...}

  public boolean isLeaf() {...}

  public void setLeaf(boolean leaf) {...}
}


Solution

  • I usually teach a course on advanced data structures every other year, and we cover Aho-Corasick automata when exploring string data structures. There are slides available here that show how to develop the algorithm by optimizing several slower ones.

    Generally speaking, I’d break the implementation down into four steps:

    1. Build the trie. At its core, an Aho-Corasick automaton is a trie with some extra arrows tossed in. The first step in the algorithm is to construct this trie, and the good news is that this proceeds just like a regular trie construction. In fact, I’d recommend just implementing this step by pretending you’re just making a trie and without doing anything to anticipate the later steps.

    2. Add suffix (failure) links. This step in the algorithm adds in the important failure links, which the matcher uses whenever it encounters a character that it can’t use to follow a trie edge. The best explanation I have for how these work is in the linked lecture slides. This step of the algorithm is implemented as a breadth-first search walk over the trie. Before you code this one up, I’d recommend working through a few examples by hand to make sure you get the general pattern. Once you do, this isn’t particularly tricky to code up. However, trying to code this up without fully getting how it works is going to make debugging a nightmare!

    3. Add output links. This step is where you add in the links that are used to report all the strings that match at a given node in the trie. It’s implemsnted through a second breadth-first search over the trie, and again, the best explanation I have for how it works is in the slides. The good news is that this step is actually a lot easier to implement than suffix link construction, both because you’ll be more familiar with how to do the BFS and how to walk down and up the trie. Again, don’t attempt to code this up unless you can comfortably do this by hand! You don’t need min code, but you don’t want to get stuck debugging code whose high-level behavior you don’t understand.

    4. Implement the matcher. This step isn’t too bad! You just walk down the trie reading characters from the input, outputting all matches at each step and using the failure links whenever you get stuck and can’t advance downward.

    I hope this gives you a more modular task breakdown and a reference about how the whole process is supposed to work. Good luck!