linux-kernelstatic-analysisllvm-ircall-graph

static analysis of linux kernel on source code or LLVM IR?


in https://www.usenix.org/system/files/sec21-tan.pdf the authors do static analysis on LLVM IR of linux kernel (a pass for call graph construction, a pass for data flow analysis and alias analysis and ...). and in some other papers I see they do static analysis on LLVM IR and not the source code. my question is why they do their static analysis on LLVM IR? why they don't analyze the source code of linux kernel instead? (for example, they can construct the call graph with analyzing the source code but they construct it by analyzing the LLVM IR).


Solution

  • Analyzing the LLVM IR simplifies analysis of the semantics of the program while analyzing the source code is needed to see what the program does in the terms of the programming language. What I mean is that the C expression *x is definitely "performing an indirection" but it may or may not load or store to memory, for instance the larger expression &*x does not even though it contains *x. This sort of thing doesn't happen with LLVM IR. Every memory access is either a load or store instruction, or a memory access occurs inside a called function through a call instruction. However if x is NULL then *x is still undefined behaviour even if the larger expression is &*x, and you won't be able to see that bug by looking only at the LLVM IR.

    LLVM also has a bunch of analysis built in, for instance LLVM already has the ability to build a call graph. Sometimes the call graph isn't immediately obvious from the source code and you need to run some optimizations to see what the callee is (or to remove dead code, eliminating function calls with it), and LLVM performs optimizations quite well too.