I'm creating a programming language and like Zig I want to allow importing / including c headers in my programming language, I ask Clang to parse the header into a ASTUnit using this function
clang::ASTUnit *ClangLoadFromCommandLine(
const char **args_begin,
const char **args_end,
struct ErrorMsg **errors_ptr,
unsigned long *errors_len,
const char *resources_path,
clang::IntrusiveRefCntPtr<clang::DiagnosticsEngine> diags
) {
std::shared_ptr<clang::PCHContainerOperations> pch_container_ops = std::make_shared<clang::PCHContainerOperations>();
bool only_local_decls = true;
bool user_files_are_volatile = true;
bool allow_pch_with_compiler_errors = false;
bool single_file_parse = false;
bool for_serialization = false;
bool retain_excluded_conditional_blocks = false;
bool store_preambles_in_memory = false;
llvm::StringRef preamble_storage_path = llvm::StringRef();
clang::ArrayRef<clang::ASTUnit::RemappedFile> remapped_files = std::nullopt;
std::unique_ptr<clang::ASTUnit> err_unit;
llvm::IntrusiveRefCntPtr<llvm::vfs::FileSystem> VFS = nullptr;
std::optional<llvm::StringRef> ModuleFormat = std::nullopt;
std::unique_ptr<clang::ASTUnit> ast_unit_unique_ptr = clang::ASTUnit::LoadFromCommandLine(
args_begin, args_end,
pch_container_ops,
diags,
resources_path,
store_preambles_in_memory,
preamble_storage_path,
only_local_decls,
clang::CaptureDiagsKind::All,
remapped_files,
true, // remapped files keep original name
0, // precompiled preable after n parses
clang::TU_Complete,
false, // cache code completion results
false, // include brief comments in code completion
allow_pch_with_compiler_errors,
clang::SkipFunctionBodiesScope::None,
single_file_parse,
user_files_are_volatile,
for_serialization,
retain_excluded_conditional_blocks,
ModuleFormat,
&err_unit,
VFS);
clang::ASTUnit *ast_unit = ast_unit_unique_ptr.release();
*errors_len = 0;
// Early failures in LoadFromCommandLine may return with ErrUnit unset.
if (!ast_unit && !err_unit) {
return nullptr;
}
if (diags->hasErrorOccurred()) {
// Take ownership of the err_unit ASTUnit object so that it won't be
// free'd when we return, invalidating the error message pointers
clang::ASTUnit *unit = ast_unit ? ast_unit : err_unit.release();
take_clang_diagnostics(errors_ptr, errors_len, unit->stored_diag_begin(), unit->stored_diag_end());
return nullptr;
}
return ast_unit;
}
But the problem happens when user includes two headers that have a third header in common. For example user's code is
import "a.h"
import "b.h"
However a.h and b.h both include c.h. Now since I'm using that function the ASTUnit contains c.h in both a.h unit and b.h unit, which I translate twice leading to c.h being translated twice
To do this I have tried to do the following things
There's also InclusionDirective function in the PPCallbacks, I tried to implement it, but it offers no way to avoid the #include or track properly
I just want this, user writes
import "a.h"
import "b.h"
I ask clang using a function like this
ASTUnit* heyClangPleaseGiveAstUnitFor(clang_state, a.h)
ASTUnit* heyClangPleaseGiveAstUnitFor(clang_state, b.h)
The ast unit for a.h contains the c.h that was included by the user, but the ASTunit for b.h shouldn't contain the declarations from c.h because user is using include guards or pragma once
What you're asking for is not possible.
First, to clarify the question: you want to first ask Clang to parse
a.h
and create a translation unit (represented as an
ASTUnit
).
Then you want Clang to parse b.h
as some sort of continuation, where
you get a distinct ASTUnit
object, but it nevertheless excludes the
definitions that came from c.h
since the ASTUnit
for a.h
already
contains it.
This is not possible for at least two reasons:
A translation unit (TU) in C++ cannot be finished and then re-opened.
At the end of TU processing, certain activities like template
instantiation take place, and those must happen last. Consequently,
Clang has no API to "add on" to an existing ASTUnit
with more
source code (whether or not you want a new ASTUnit
object). This
might be possible with a C-only parser, but that isn't Clang.
The Clang TU representation requires (to a good first
approximation) that everything a declaration refers to also be
declared in that same TU. When b.h
refers to something in c.h
,
there is no way for its TU to point at a declaration in a different
TU such as the one for a.h
.
You suggest intercepting the act of b.h
including c.h
using
PPCallbacks
,
but even if you could use that mechanism to prevent processing of c.h
(which I don't think is possible with reasonable effort), you would only
succeed in causing Clang to choke on any declarations in b.h
that
require declarations from c.h
since the latter is unavailable, and
(again) there is no way for Clang to "borrow" declarations from another
TU.
You might think you could leverage the AST serialization mechanism, but that doesn't help. If you, say, parse a.h
and then serialize that before finalization, then when you add b.h
to it, the TU will still contain everything from c.h
. Serialization shifts the computation in time and space but doesn't change the fact that, in the end, each TU needs to be self-contained.
What you can do instead is, after parsing using Clang, in your own code,
exclude duplicate processing of the entities in c.h
by recognizing
them during AST traversal using their source location.
For example, parse a.h
and b.h
as separate TUs (as you are already
doing), then build a map from source location to translated entity.
Then, when you see the declarations in c.h
for a second time, you know
to ignore them. However, be aware that the Clang
SourceLocation
object cannot be directly compared across TUs, so you have to turn it
into a file/line/column representation (using the
SourceManager
API) for cross-TU comparison (including use as a map key).
Note: The Clang source location mechanism is heavily optimized and quite fast. The approach I'm suggesting should not run into major performance problems due to using locations if reasonable care is taken in the implementation. Furthermore, if you only care about the file granularity, it is even faster because decoding just the FileID
is very fast.