javajavacjavacompiler

Internal Architecture of Java Compiler


I have been working on Java from more than 8 years.

Last week, in a small meeting in my company, one of my colleague has asked me how exactly does Java Compiler work? I was with no answer.

I tried explaining, like Java Compiler takes statements one by one and converts them to byte code that is not targeted to any OS but to JVM.

No one satisfied with that answer even me.

Now the main question is how exactly java compiler works. i.e. How many steps or stages or phases are there which will be done by the compiler in case of compiling a Java file.

What exactly the Java's compiler architecture?

What if there are multiple Java classes in same .java file. Then how many classes will be compiled.

What if there are imports pointing to un-compiled Java classes? Then the un-compiled classes be compiled or ignored?

I googled for more than half a day and all are providing same answer as I gave to my colleagues.

But finally I found some useful tutorial here.

But the tutorial also covering not too in-depth and I could not visualize that tutorial.

Still I am not satisfied and eager to learn something more about this from you.

So if any one knows something more than me and the above blog, something by using which I can visualize what exactly the internal architecture of Java Compiler please explain me.


Solution

  • Some basic steps:

    1. parse: Reads a set of *.java source files and maps the resulting token sequence into AST (Abstract Syntax Tree)-Nodes.
    2. enter: Enters symbols for the definitions into the symbol table.
    3. process annotations: If Requested, processes annotations found in the specifed compilation units.
    4. attribute: Attributes the Syntax trees. This step includes name resolution, type checking and constant folding.
    5. flow: Performs dataflow analysis on the trees from the previous step. This includes checks for assignments and reachability.
    6. desugar: Rewrites the AST and translates away some syntactic sugar.
    7. generate: Generates Source Files or Class Files.

    In more details:

    1. Lex - Break the source file into individual words, or tokens.
    2. Parse - Analyze the phrase structure of the program.
    3. Semantic Actions - Build a piece of abstract syntax tree corresponding to each phrase.
    4. Semantic Analysis - Determine what each phrase means, relate uses of variables to their definitions, check types of expressions, request translation of each phrase.
    5. Frame Layout - Place variables, function-parameters, etc. into activation records (stack frames) in a machine-dependent way.
    6. Translate - Produce intermediate representation trees (IR trees), a notation that is not tied to any particular source language or targetmachine architecture.
    7. Canonicalize - Hoist side effects out of expressions, and clean up conditional branches, for the convenience of the next phases.
    8. Instruction Selection - Group the IR-tree nodes into clumps that correspond to the actions of target-machine instructions.
    9. Control Flow Analysis - Analyze the sequence of instructions into a control flow graph that shows all the possible flows of control the program might follow when it executes.

    10. Dataflow Analysis - Gather information about the flow of information through variables of the program; for example, liveness analysis calculates the places where each program variable holds a still-needed value (is live).

    11. Register Allocation - Choose a register to hold each of the variables and temporary values used by the program; variables not live at the same time can share the same register.

    12. Code Emission - Replace the temporary names in each machine instruction with machine registers.

    There is a nice book:

    Modern Compiler Implementation in Java

    You may want to look inside javac code:

    Javac Documentation

    OpenJDK source code

    Hacker's guide to javac

    Don't Panic! To help newcomers to javac navigate their way around the code base

    JVM JLS