indexing data-structures tree s-expression

Fast lookup of tree with placeholders?

For an application I'm considering, there would be a large (100,000+) 'database' of trees (think expressions in a programming language, or S-expressions), and I would need to query that database for expressions that match a specific given expression.

Before giving the details of what I'd like to have, note that I'd appreciate any information related to indexing a large set of trees for optimizing lookup by a subtree.

In my specific situation (which would be for a backend to be used by Metamath proof assistants), expressions have the following structure (in Haskell-like notation):

data Expression = Placeholder Id | VarName Id | ConstName Id [Expression]

or as a BNF for an S-expression form:

Expression = '?' Id | Id | '(' Id Expression* ')'

where Id is some kind of identifier.

For example, I could have a database with expressions like

(equiv ?ph ?ps)
(not (in (appl (sqrt) (2)) (Q)))
(equiv (eq ?A ?B) (forall ?x (equiv (in ?x ?A) (in ?x ?B))))

In this context, two expressions match if they can be made equal by substitution of expressions for placeholders. So looking up (equiv (eq A (emptyset)) ?ph) in the above mini-database would result in the first and last expressions.

So again: how would I implement fast lookups in a large set of (expression) trees with placeholders? What kind of index data structure could I use?

Solution

I would implement the lookup with a trie. Each key would consist of one of the following:

ConstName Identifier
Variable w/ context info
ConstValue
Placeholder

These should be ordered in some fashion- possibly Placeholder, then all ConstNames (alphabetical), then variables (scope ordering, then argument order), then ConstValues (numerical order). As long as there's a concrete ordering for usage in the trie, you're fine.

Traverse the expression's tree, injecting the appropriate keys into the trie as they are encountered. Do this for all the expressions you want to insert into your data structure. When it comes time to query it, you can traverse the trie in a similar fashion, but with a few new rules.

Everything matches a placeholder node. If it matches some other key as well, then you'll need to explore both branches (easily done via a recursive DFS-like approach).
A placeholder matches everything. This is not equivalent to the previous point- we are talking about placeholders in the query here, the previous bullet is regarding placeholders as trie keys.

Now, this does mean that the search space can somewhat "explode" as you encounter placeholders, but there is one thing you can do to try to mitigate this in practice. Traverse the expression's tree in a breadth-first fashion (both in construction of the trie, and querying). This means if one of the arguments is a placeholder, you won't have to full-depth search every single subtree that matches that expression so far- instead you jump ahead to the next argument- which may not be a placeholder, and will thus greatly prune the search space (compared to matching "everything").

For completeness sake, lets take one of your examples

(not (in (appl (sqrt) (2)) (Q)))

and make a trie entry from that-

not -> in -> apply -> "Q" -> sqrt -> 2

adding (not (in ?ph E)) to this would result in-

not -> in -> apply -> "Q" -> sqrt -> 2
         \-> ?ph   -> "E"

Continue in this fashion injecting expressions into the trie. Also traverse in this fashion for querying until you reach the ends of your searches into the trie, and return those that matched.

Note- the uniqueness of these entries is based on the assumption you do not have to support variadic functions. If you do, attach to each key some context info (read the next paragraphs for info on how to do this) to distinguish which arguments go to which functions

There is one detail I glossed over- variables. If you only want it to match if they are the exact same variable name, then no work is necessary. But this likely isn't what you want; you probably want it to match generic variables as long as they are "consistent" with each other. The way to do this is to assign each variable an identifier that represents the scope of which it was first defined.

The easiest way to do this is just compose an identifier from the concatenation of the argument ordering of its ancestors. That is, if a variable is first defined as the second argument to a function which is the fifth argument to the root function, then we might label it as (5, 2) or (2, 5), whichever makes more sense intuitively. Either way, this will ensure the variable is given a consistent identifier regardless of other variables / functions elsewhere. Then proceed as normal with this new variable name.