algorithmsearchtree

Algorithm for searching for patterns in trees


I am working on a project that heavily uses a tree structure for data processing. I am looking for a method to find matching patterns in the tree. for example, consider a tree like:

    (1:a) ----- (2:b) ---- (4:c) ---- (5:e) ---- (8:c) ---- (9:f)
          |---- (3:d)            |--- (6:f)            |--- (10:g) 
                                 |--- (7:g)

( 1 has two children 2 and 3, and 4 has children 5,6,7, and 8 has children 9 and 10 ) and the letters are the values of each node.

i need to find all the occurrences of something like

    c ---- f
      |--- g

which should return 4 and 8 as the indexes of the parent nodes. What is a good algorithm for that? It probably is BFS, but is there a more specialized search algorithm for this kind of searches?


Solution

  • This is some of my theory crafting, so feel free to correct me when I am wrong.

    It is influenced by a prefix/suffix trie structure, which enables one to find matching substrings in a string. Although the Data Structure I will choose will be more tree-like, it will also be very graph-like in nature, by connecting references of nodes.

    The output will ultimately (hopefully) show all indexes of the sub-tree roots that contains the pattern in a fast time.

    The data structure I will decide to use is similar to a tree node, which contains the string value, indexes of every location of where this occurs, indexes of all possible parents of nodes containing the common value, and childs are stored as a Map for O(1) best case searching.

    All following codes are done in C#.

        public class Node
        {
            public String value;                       //This will be the value. ie: “a”
            public Dictionary<int, int> connections;   //connections will hold {int reference (key), int parent (value)} pairs
            public Dictionary<String, Node> childs;    //This will contain all childs, with it’s value
                                                       //as the key.
            public Node()
            {
                connections = new Dictionary<int, int>();
                childs = new Dictionary<String, Node>();
            }
        }
    

    Second, we assume that your base data is a very traditional tree structure, although there may be few differences.

        public class TreeNode
        {
            public int index;
            public String value;
            public List<TreeNode> childs;
            public TreeNode()
            {
                childs = new List<TreeNode>();
            }
            public TreeNode(String value)
            {
                childs = new List<TreeNode>();
                this.value = value;
            }
            public void add(String name)
            {
                TreeNode child = new TreeNode(name);
                childs.Add(child);
            }
        }
    

    Finally, the base TreeNode structure's nodes are all indexed (in your example, you have used a 1 based index, but the following is done in a 0 based index)

            int index = 0;
            Queue<TreeNode> tempQ = new Queue<TreeNode>();
            tempQ.Enqueue(root);
            while (tempQ.Count > 0)
            {
                temp = tempQ.Dequeue();
                temp.index = index;
                index++;
                foreach (TreeNode tn in temp.childs)
                {
                    tempQ.Enqueue(tn);
                }
            }
            return root;
    

    After we initialize our structure, assuming that the base data is stored in a traditional type of TreeNode structure, we will try to do three things:

    1. Build a graph-like structure using the base TreeNode

    2. One biggest property is that unique values will only be represented in ONE node. For example, {C}, {F}, and {G} from your example will each be represented with only ONE node, instead of two. (Simply stated, all nodes with common values will be grouped together into one.)

    3. All unique nodes (from step 2) will be attached to the root element, and we will "rebuild" the tree by connecting references to references. (Graphic representation is soon shown below)

    Here is the code in C# to build the structure, done in O(n):

        private Node convert(TreeNode root)
        {
            Node comparisonRoot = new Node();   //root of our new comparison data structure.
                                                //this main root will contain no data except
                                                //for childs inside its child map, which will
                                                //contain all childs with unique values.
    
            TreeNode dataNode = root;           //Root of base data.
    
            Node workingNode = new Node();      //workingNode is our data structure's
                                                //copy of the base data tree's root.
            workingNode.value = root.value;
            workingNode.connections.Add(0, -1);
            // add workingNode to our data structure, because workingNode.value
            // is currently unique to the empty map of the root's child.
            comparisonRoot.childs.Add(workingNode.value, workingNode); 
    
            Stack<TreeNode> s = new Stack<TreeNode>();
            s.Push(dataNode);                   //Initialize stack with root.
    
            while (s.Count > 0) {               //Iteratively traverse the tree using a stack
                TreeNode temp = s.Pop();
                foreach(TreeNode tn in temp.childs) {
                    //fill stack with childs
                    s.Push(tn);
                }
                //update workingNode to be the "parent" of the upcoming childs.
                workingNode = comparisonRoot.childs[temp.value];
                foreach(TreeNode child in temp.childs) {
                   if(!comparisonRoot.childs.ContainsKey(child.value)) {
                       //if value of the node is unique
                       //create a new node for the unique value
                       Node tempChild = new Node();
                       tempChild.value = child.value;
    
                       //store the reference/parent pair
                       tempChild.connections.Add(child.index, temp.index);
    
                       //because we are working with a unique value that first appeared,
                       //add the node to the parent AND the root.
                       workingNode.childs.Add(tempChild.value, tempChild);
                       comparisonRoot.childs.Add(tempChild.value, tempChild);
                   } else {
                       //if value of node is not unique (it already exists within our structure)
                       //update values, no need to create a new node.
                       Node tempChild = comparisonRoot.childs[child.value];
                       tempChild.connections.Add(child.index, temp.index);
                       if (!workingNode.childs.ContainsKey(tempChild.value)) {
                           workingNode.childs.Add(tempChild.value, tempChild);
                       }
                   }
                }
            }
            return comparisonRoot;
        }
    

    All unique values are attached to a non-valued root, just for the purposes of using this root node as a map to quickly jump to any reference. (Shown below)

    enter image description here

    Here, you can see that all connections are made based on the original example tree, except that there are only one instance of nodes for each unique value. Finally, you can see that all of the nodes are also connected to the root.

    The whole point is that there is only 1 real Node object for each unique copy, and points to all possible connections by having references to other nodes as childs. It's kind of like a graph structure with a root.

    Each Node will contain all pairs of {[index], [parent index]}.

    Here is a string representation of this data structure:

    Childs { A, B, D, C, E, F, G } 
    Connections { A=[0, -1]; B=[1, 0]; D=[2, 0]; C=[3, 1][7, 4]; 
                  E=[4, 3]; F=[5, 3][8, 7]; G=[6, 3][9, 7] }
    

    Here, the first thing you may notice is that node A, which has no true parent in your example, has a -1 for its parent index. It's just simply stating that Node A has no more parent and is the root.

    Other things you may notice is that C has index values of 3 and 7, which respectively is connected to 1 and 4, which you can see is Node B and Node E (check your example if this doesn't make sense)

    So hopefully, this was a good explanation of the structure.

    So why would I decide to use this structure, and how will this help find out the index of the nodes when matched up with a certain pattern?

    Similar to suffix tries, I thought that the most elegant solution would return all "successful searches" in a single operation, rather than getting traversing through all nodes to see if each node is a successful search (brute force).

    So here is how the search will work.

    Say we have the pattern

    c ---- f
      |--- g
    

    from the example.

    In a recursive approach, leaves simply return all possible parentIndex (retrieved from our [index, parentIndex] pairs).

    Afterwards, in a natural DFS type of traversal, C will receive both return values of F and G.

    Here, we do an intersection operation (AND operation) to all the childs and see which parentIndex the sets share in common.

    Next, we do another AND operation, this time between the result from the previous step and all possible C's (our current branch's) index.

    By doing so, we now have a set of all possible C's indexes that contains both G and F.

    Although that pattern is only 2 levels deep, if we are looking at a pattern with a deeper level, we simply take the result set of C indexes, find all parent pairs of the result indexes utilizing our [index, parentIndex] map, and return that set of parentIndexes and return to step 2 of this method. (See the recursion?)

    Here is the C# implementation of what was just explained.

        private HashSet<int> search(TreeNode pattern, Node graph, bool isRoot)
        {
            if (pattern.childs.Count == 0)
            {
                //We are at a leaf, return the set of parents values.
                HashSet<int> set = new HashSet<int>();
                if (!isRoot)
                {
                    //If we are not at the root of the pattern, we return the possible
                    //index of parents that can hold this leaf.
                    foreach (int i in graph.connections.Keys)
                    {
                        set.Add(graph.connections[i]);
                    }
                }
                else
                {
                    //However if we are at the root of the pattern, we don't want to
                    //return the index of parents. We simply return all indexes of this leaf.
                    foreach (int i in graph.connections.Keys)
                    {
                        set.Add(i);
                    }
                }
                return set;
            }
            else
            {
                //We are at a branch. We recursively call this method to the
                //leaves.
                HashSet<int> temp = null;
                foreach(TreeNode tn in pattern.childs) {
                    String value = tn.value;
                    //check if our structure has a possible connection with the next node down the pattern.
                    //return empty set if connection not found (pattern does not exist)
                    if (!graph.childs.ContainsKey(value)){
                        temp = new HashSet<int>();
                        return temp;
                    }
                    Node n = graph.childs[value];
                    //Simply recursively call this method to the leaves, and
                    //we do an intersection operation to the results of the
                    //recursive calls.
                    if (temp == null)
                    {
                        temp = search(tn, n, false);
                    }
                    else
                    {
                        temp.IntersectWith(search(tn, n, false));
                    }
                }
                //Now that we have the result of the intersection of all the leaves,
                //we do a final intersection with the result and the current branch's
                //index set.
                temp.IntersectWith(graph.connections.Keys);
                //Now we have all possible indexes. we have to return the possible
                //parent indexes.
                if (isRoot)
                {
                    //However if we are at the root of the pattern, we don't want to
                    //return the parent index. We return the result of the intersection.
                    return temp;
                }
                else
                {
                    //But if we are not at the root of the pattern, we return the possible
                    //index of parents.
                    HashSet<int> returnTemp = new HashSet<int>();
                    foreach (int i in temp)
                    {
                        returnTemp.Add(graph.connections[i]);
                    }
                    return returnTemp;
                }
            }
        }
    

    To call this method, simply

    //pattern - root of the pattern, TreeNode object
    //root    - root of our generated structure, which was made with the compare() method
    //boolean - a helper boolean just so the final calculation will return its
    //          own index as a result instead of its parent's indices
    HashSet<int> answers = search(pattern, root.childs[pattern.value], true);
    

    Phew, that was a long answer, and I'm not even sure if this is as efficient as other algorithms out there! I am also sure that there may be more efficient and elegant ways to search for a subtree inside a larger tree, but this was a method that came into my head! Feel free to leave any criticism, advice, edit, or optimize my solution :)