c++clangllvm

Parsing headers with Clang


I'm creating a programming language and like Zig I want to allow importing / including c headers in my programming language, I ask Clang to parse the header into a ASTUnit using this function

clang::ASTUnit *ClangLoadFromCommandLine(
        const char **args_begin,
        const char **args_end,
        struct ErrorMsg **errors_ptr,
        unsigned long *errors_len,
        const char *resources_path,
        clang::IntrusiveRefCntPtr<clang::DiagnosticsEngine> diags
) {

    std::shared_ptr<clang::PCHContainerOperations> pch_container_ops = std::make_shared<clang::PCHContainerOperations>();

    bool only_local_decls = true;
    bool user_files_are_volatile = true;
    bool allow_pch_with_compiler_errors = false;
    bool single_file_parse = false;
    bool for_serialization = false;
    bool retain_excluded_conditional_blocks = false;
    bool store_preambles_in_memory = false;
    llvm::StringRef preamble_storage_path = llvm::StringRef();
    clang::ArrayRef<clang::ASTUnit::RemappedFile> remapped_files = std::nullopt;
    std::unique_ptr<clang::ASTUnit> err_unit;
    llvm::IntrusiveRefCntPtr<llvm::vfs::FileSystem> VFS = nullptr;
    std::optional<llvm::StringRef> ModuleFormat = std::nullopt;
    std::unique_ptr<clang::ASTUnit> ast_unit_unique_ptr = clang::ASTUnit::LoadFromCommandLine(
            args_begin, args_end,
            pch_container_ops,
            diags,
            resources_path,
            store_preambles_in_memory,
            preamble_storage_path,
            only_local_decls,
            clang::CaptureDiagsKind::All,
            remapped_files,
            true, // remapped files keep original name
            0, // precompiled preable after n parses
            clang::TU_Complete,
            false, // cache code completion results
            false, // include brief comments in code completion
            allow_pch_with_compiler_errors,
            clang::SkipFunctionBodiesScope::None,
            single_file_parse,
            user_files_are_volatile,
            for_serialization,
            retain_excluded_conditional_blocks,
            ModuleFormat,
            &err_unit,
            VFS);
    clang::ASTUnit *ast_unit = ast_unit_unique_ptr.release();

    *errors_len = 0;

    // Early failures in LoadFromCommandLine may return with ErrUnit unset.
    if (!ast_unit && !err_unit) {
        return nullptr;
    }

    if (diags->hasErrorOccurred()) {
        // Take ownership of the err_unit ASTUnit object so that it won't be
        // free'd when we return, invalidating the error message pointers
        clang::ASTUnit *unit = ast_unit ? ast_unit : err_unit.release();
        take_clang_diagnostics(errors_ptr, errors_len, unit->stored_diag_begin(), unit->stored_diag_end());
        return nullptr;
    }

    return ast_unit;
}

But the problem happens when user includes two headers that have a third header in common. For example user's code is

import "a.h"
import "b.h"

However a.h and b.h both include c.h. Now since I'm using that function the ASTUnit contains c.h in both a.h unit and b.h unit, which I translate twice leading to c.h being translated twice

To do this I have tried to do the following things

There's also InclusionDirective function in the PPCallbacks, I tried to implement it, but it offers no way to avoid the #include or track properly

I just want this, user writes

import "a.h"
import "b.h"

I ask clang using a function like this

ASTUnit* heyClangPleaseGiveAstUnitFor(clang_state, a.h)

ASTUnit* heyClangPleaseGiveAstUnitFor(clang_state, b.h)

The ast unit for a.h contains the c.h that was included by the user, but the ASTunit for b.h shouldn't contain the declarations from c.h because user is using include guards or pragma once


Solution

  • What you're asking for is not possible.

    First, to clarify the question: you want to first ask Clang to parse a.h and create a translation unit (represented as an ASTUnit). Then you want Clang to parse b.h as some sort of continuation, where you get a distinct ASTUnit object, but it nevertheless excludes the definitions that came from c.h since the ASTUnit for a.h already contains it.

    This is not possible for at least two reasons:

    1. A translation unit (TU) in C++ cannot be finished and then re-opened. At the end of TU processing, certain activities like template instantiation take place, and those must happen last. Consequently, Clang has no API to "add on" to an existing ASTUnit with more source code (whether or not you want a new ASTUnit object). This might be possible with a C-only parser, but that isn't Clang.

    2. The Clang TU representation requires (to a good first approximation) that everything a declaration refers to also be declared in that same TU. When b.h refers to something in c.h, there is no way for its TU to point at a declaration in a different TU such as the one for a.h.

    You suggest intercepting the act of b.h including c.h using PPCallbacks, but even if you could use that mechanism to prevent processing of c.h (which I don't think is possible with reasonable effort), you would only succeed in causing Clang to choke on any declarations in b.h that require declarations from c.h since the latter is unavailable, and (again) there is no way for Clang to "borrow" declarations from another TU.

    You might think you could leverage the AST serialization mechanism, but that doesn't help. If you, say, parse a.h and then serialize that before finalization, then when you add b.h to it, the TU will still contain everything from c.h. Serialization shifts the computation in time and space but doesn't change the fact that, in the end, each TU needs to be self-contained.

    What you can do instead is, after parsing using Clang, in your own code, exclude duplicate processing of the entities in c.h by recognizing them during AST traversal using their source location.

    For example, parse a.h and b.h as separate TUs (as you are already doing), then build a map from source location to translated entity. Then, when you see the declarations in c.h for a second time, you know to ignore them. However, be aware that the Clang SourceLocation object cannot be directly compared across TUs, so you have to turn it into a file/line/column representation (using the SourceManager API) for cross-TU comparison (including use as a map key).

    Note: The Clang source location mechanism is heavily optimized and quite fast. The approach I'm suggesting should not run into major performance problems due to using locations if reasonable care is taken in the implementation. Furthermore, if you only care about the file granularity, it is even faster because decoding just the FileID is very fast.