recursioncodeql

How is this recursive CodeQL predicate is evaluated?


I'm in the process of trying to learn CodeQL and I'm a little confused about how certain CodeQL code is evaluated. I'm hoping someone can help me with a more simplistic explanation.

Take the following CodeQL code:

string getANeighbor(string country){
  country = "France" and result = "Belgium"
  or
  country = "France" and result = "Germany"
  or
  country = "Germany" and result = "Austria"
  or
  country = "Germany" and result = "Belgium"
  or
  country = getANeighbor(result)
}

select getANeighbor("Germany")

The result I get back is:

France
Austria
Belgium

I understand why Belgium and Austria are returned. However I'm confused as to how CodeQL determines that France is to be returned as a result. My imperative programming intuition tells me that for France to be returned, I would need an additional line that looks something like country = "Germany" and result = "France", but I'm really confused here how France is being returned without that line of code.

Also, how does this line work exactly?:

country = getANeighbor(result)

With some of the simple examples that are given in the CodeQL handbook, they make it seem like that the 'result' keyword almost acts like 'return' in other languages. I feel that I have a fundamental misunderstanding of what 'result' does and how it works. I can't seem to find a good explanation after googling.

Thanks in advance!


Solution

  • I found the question from a a user who was kind enough to answer my question at a different site. Source (https://github.com/github/securitylab/discussions/85):

    Answer (Provided by: Pavel Avgustinov from Semmle):

    This is a great question. I'll give the full details on recursive predicates below, but let's first deal with result.


    Also, how does this line work exactly?:

    country = getANeighbor(result)

    With some of the simple examples that are given in the CodeQL handbook, they make it seem like that the 'result' keyword almost acts like 'return' in other languages. I feel that I have a fundamental misunderstanding of what 'result' does and how it works. I can't seem to find a good explanation after googling.


    You're right to observe that in many examples result = f() in CodeQL seems to act like return f() in other languages. However, a return statement (with its usual meaning of actually interrupting the evaluation of a function and returning to the caller) does not make sense in a logic language, where it's up to the CodeQL evaluator how and in what order to evaluate the different parts of a predicate you define. Instead, when a predicate declares a return type (string in your example), this implicitly declares the special variable result in its body, whose type is the declared return type of the predicate. You can use this variable just like any other, in particular to constrain it as part of the predicate.

    Note that this is quite similar to how you implicitly get a special variable called this in a member predicate (or method, in traditional languages).

    Other than its implicit declaration, there is nothing special about the result variable. In particular, logically you could write the getANeighbor predicate like this, without its use:

    predicate getANeighbor(string country, string neighbor) {   
    country = "France" and neighbor = "Belgium"
    or
    country = "France" and neighbor = "Germany" 
    or
    country = "Germany" and neighbor = "Austria" 
    or 
    country = "Germany" and neighbor = "Belgium" 
    or  
    getANeighbor(neighbor, country)
    }
    

    You do have to invoke the predicate differently if you declare it without a result type: getANeighbor(neighbor, country) rather than country = getANeighbor(result). Other than this syntactic detail, the two are entirely equivalent.

    If you get used to the idea that result is just an implicitly declared variable, not some special way of signaling the returned value of a predicate, then code like country = getANeighbor(result) (or more complex cases, where result is used multiple times to place multiple conditions on it) will make more sense. Contrast it to the result-free version I gave above, which explicitly flips the order of arguments in the recursive call to getANeighbor: it's exactly the same thing, we're invoking getANeighbor with "flipped" arguments, namely result as the first parameter, and country equated to the "returned" value.

    As a final observation, you can of course get something more similar to the return-like style of many examples by introducing an intermediate variable: Instead of country = getANeighbor(result), we can write exists(string neighbor | country = getANeighbor(neighbor) and result = neighbor), but you can see how this ends up being more verbose.


    The general intuition behind recursion in CodeQL is given here, but at a high level you can think of each recursive call as representing the "current" set of value tuples in the called predicate, and all predicate definitions keep being evaluated until there is no change. Let's see how this works in your example predicate:

    string getANeighbor(string country)
    {   
    country = "France" and result = "Belgium"   
    or
    country = "France" and result = "Germany"
    or
    country = "Germany" and result = "Austria"
    or
    country = "Germany" and result = "Belgium"
    or
    country = getANeighbor(result) 
    }
    

    The first time we try to evaluate getANeighbour, we do not know anything about it, and so we treat a call to it as a call to the empty set of values. This means we deduce the following values for it:

    +-----------+----------+---------+
    | iteration | country  | result  |
    +-----------+----------+---------+
    |    1      | France   | Belgium |
    |    1      | France   | Germany |
    |    1      | Germany  | Austria |
    |    1      | Germany  | Belgium |
    +-----------+----------+---------+
    

    However, after deducing these values, we know that there was a recursive call to getANeighbor, and that the values we have for it changed. Thus, we need to re-evaluate the definition of getANeighbor, now using our expanded set of values for the recursive call. This means that we deduce the following:

    +-----------+----------+---------+
    | iteration | country  | result  |
    +-----------+----------+---------+
    |    2      | France   | Belgium |
    |    2      | France   | Germany |
    |    2      | Germany  | Austria |
    |    2      | Germany  | Belgium |
    |           |          |         |
    |    2      | Belgium  | France  |
    |    2      | Germany  | France  |
    |    2      | Austria  | Germany |
    |    2      | Belgium  | Germany |
    +-----------+----------+---------+
    

    Note that we re-deduced the first four lines, which do not depend on a recursive call, just like we did in iteration 1. This is fine -- we already knew that "Belgium" = getANeighbor("France") was true, and thus we can simply ignore the fact that we deduced it in another way. However, the last four lines are new; they arise from the recursive call.

    As before, we see that iteration 2 of evaluating getANeighbor depended on a recursive predicate for which we have deduced new values, and so we need to go around again. In the third iteration, we deduce:

    • (a) the first four lines of iteration 2 (and iteration 1), just like before, from the non-recursive parts of the predicate definition;
    • (b) the last four lines of iteration 2, in the same way as for iteration 2 itself; and
    • (c) all 8 lines from iteration 2, just by applying the recursive case to the results of iteration 2. However, we didn't deduce any new values, and so we know that at this point we can stop evaluation of getANeighbor. We say that it has "converged" (meaning "stopped changing"), and for any other references to it, we are going to use the final set of values which we computed.

    This is why in your query, France is reported as a neighbor of Germany.

    [I've omitted some details of how to make such a recursive evaluation efficient -- in practice there are some ways to avoid re-deducing the same values again and again on each recursive iteration --, because they are not necessary to understand how recursion works in the first place.]

    Thanks again to Pavel Avgustinov from Semmle for this detailed response!