I'm doing an inter-procedrual analysis project in Java at the moment and I'm looking into using an IFDS solver to compute the control flow graph of a program. I'm finding it hard to follow the maths involved in the description of the IFDS framework and graph reachability. I've read in several places that its not possible to compute the points-to sets of a program using this solver as "pointer analysis is known to be a non-distributive problem." [1] Other sources have said that this is often specifically with regard to 'strong updates', which from what I can gather are field write statements.
I think I can basically follow how the solver computes edges and works out the dataflow facts. But I don't quite follow what this: f(A ∪ B) = f(A) ∪ f(B) means in practical terms as a definition of a distributive function, and therefore what it means to say that points-to analysis deals with non-distributive functions.
The linked source [1] gives an example specific to field write statements:
A a = new A();
A b = a;
A c = new C();
b.f = c;
It claims that in order to reason about the assignment to b.f one must also take into account all aliases of the base b. I can follow this. But what I don't understand is what are the properties of this action that make it non-distributive.
A similar (I think) example from [2]:
x = y.n
Where before the statement there are points-to edges y-->obj1 and obj1.n-->obj2 (where obj1 and 2 are heap objects). They claim
it is not possible to correctly deduce that the edge x-->obj2 should be generated after the statement if we consider each input edge independently. The flow function for this statement is a function of the points-to graph as a whole and cannot be decomposed into independent functions of each edge and then merged to get a correct result.
I think I almost understand what, at least the first, example is saying but that I am not getting the concept of distributive functions which is blocking me getting the full picture. Can anyone explain what a distributive or non-distributive function is on a practical basis with regards to pointer analysis, without using set theory which I am having difficulty following?
[1] http://karimali.ca/resources/pubs/conf/ecoop/SpaethNAB16.pdf
[2] http://dl.acm.org/citation.cfm?doid=2487568.2487569 (paywall, sorry)
The distributiveness of a flow function is defined as: f(a Π b) = f(a) Π f(b), with Π being the merge function. In IFDS, Π is defined as the set union ∪. What this means is that it doesn't matter whether or not you apply the merge function before or after the flow function, you will get the same result in the end.
In a traditional data-flow analysis, you go through the statements of your CFG and propagate sets of data-flow facts. So with a flow function f, for each statement, you compute f(in, stmt) = out, with in and out the sets of information you want to keep (e.g.: for an in-set {(a, allocA), (b, allocA)} -denoting that the allocation site of objects a and b is allocA, and the statement "b.f = new X();" -which we will name allocX, you would likely get the out-set {(a, allocA), (b, allocA), (a.f, allocX), (b.f, allocX)} because a and b are aliased).
IFDS explodes the in-set into its individual data-flow facts. So for each fact, instead of running your flow-function once with your entire in-set, you run it on each element of the in-set: ∀ d ∈ in, f(d, stmt) = out_d. The framework then merges all out_d together into the final out-set. The issue here is that for each flow function, you don't have access to the entire in-set, meaning that for the example we presented above, running the flow-function f((a, allocA)) on the statement would yield a first out-set {(a, allocA)}, f((b, allocA)) would yield a second out-set {(b, allocA)}, and f(0) would yield a third out-set {(0), (b.f, allocX)}. So the global out-set after you merge the results would be {(a, allocA), (b, allocA), (b.f, allocX)}. We are missing the fact {(a.f, allocX)} because when running the flow function f(0), we only know that the in-fact is 0 and that the statement is "b.f = new X();". Because we don't know that a and b refer to the allocation site allocA, we don't know that they are aliased, and we therefore cannot know that a.f should also point to allocX after the statement.
IFDS runs on the assumption of distributiveness: merging the out-sets after running the flow-function should yield the same results as merging the in-sets before running the flow-function. In other words, if you need to combine information from multiple elements on the in-set to create a certain data-flow fact in your out-set, then you are not distributive, and should not express your problem in IFDS (unless you do something to handle those combination cases, like the authors of the paper you refer to as [1] did).