Implicit null-checks for function calls

Implicit null checks are a technique for removing explicit checks for null-ptrs/references in a high-level languages native representation, and instead relying on the processor emitting an access-violation, which is then handled (ie. SEH), and translated into a managed exception. It is mainly used in cases where exception-handling overhead is secondary, for example if we know that null-exceptions are rare.

In all examples I found, those checks are done for statments that make an access to the ptr in question:

int m(Foo foo) {
  return foo.x;
}

Here, we could simply emit the asm-code:

mov rax,[rcx]

And have the native exception-handling mechanism deal with generating a NullReferenceException, instead of crashing.

However, what about function-calls?

int m(Foo foo) {
  return foo.MemberFunction();
}

Is it possible to use implicit null-checks there too? I'm specifically interested in x64-asm. It seems more difficult there. Let's look at an examplary non-virtual function-call in asm (the code does not match the function 1:1, it contains one "mov" just to show that an object is setup into the register used for a member-function call on Windows):

mov     rcx,[rsp+20h]           // load target-object from stack-local (Foo*)
call    Foo::MemberFunction     // call Foo::MemberFunction, can be represented with an address w/o fixups of the ptr

Here, we do not have any access to the memory that "rcx" points to. So, if by definition of the language, such a call must throw a NullReferenceException on the call-site, we would need to use explicit checks:

mov     rcx,[rsp+32h]           // load target-object from stack-local (Foo*)
test    rcx,rcx
je      .L0                     // exception-handler already moved out of hot-path
call    Foo::MemberFunction     // call Foo::MemberFunction, can be represented with an address w/o fixups of the ptr 

...
.L0:
call throwNullReferenceException();

Or is there any more efficient way to replace the test+je pair with an instruction, which generates an access-violation? I was thinking I could do

mov     rcx,[rsp+32h]           // load target-object from stack-local (Foo*)
mov     rax,[rcx]               // mov into unused reg, to trigger access-violation
call    Foo::MemberFunction     // call Foo::MemberFunction, can be represented with an address w/o fixups of the ptr

This would use no branch and would not require an additional call to an exception-invokation. However, it would potentially need to read the memory of [rcx], which is not needed in the other method. How does that perform compared to the branch? If it is worse, is there any way that is better? See below for further explanation of the full use-case.

Background

I have a custom high-level language, which is compiled to bytecode, and then to native ASM. The language handles null-checks gracefully with NullReference-exceptions. Exceptions are stll always errors that needs addressing, and not something that just occurs normally. Thus, the code for dealing with exceptions can be inefficient. What's important is that the code runs as fast as possible, given the common case of no exceptions (so no null-references). That's why implicit null-checks seem appealing. Removing all the branches and additional code needed to handle the exception for calls could be benifical. Though, even the existing checks should already be fast. The branch should be well-predicatable to always be false, and I already made it so that this case will not require a jmp at all, but have the code be executed linearily (which I've read is more optimal).

So given that, is my attempt to get rid of those checks in the case I mentioned foolish, or is there some way to achieve it optimally?

Solution

However, it would potentially need to read the memory of [rcx]

That's cheap unless it misses in cache. And the callee would pay that cost later anyway, unless it never touches its *this.

If it does, then the earlier load was basically a prefetch... unless the first "real" access was a write in which case we could have gone straight to MESI Exclusive / Modified state, without first getting a copy that's only in Shared state from a read-only access, if multiple threads access this object. If no other cores owned the cache line before, a simple load will normally get the cache line into Exclusive state, which can transition to Modified without another off-core transaction (a Read For Ownership = RFO).

If the callee will also do a read as its first access, there's no downside to reading here.

With a large object, if the member function would only touch members that live in later cache lines, touching the first cache line would pollute the cache.

Letting out-of-order exec handle a load is great, probably cheaper than a test/branch assuming it never actually faults. A test/branch would be subject to branch mispredicts. Like for every instruction, the pipeline strongly assumes loads won't fault, only actually doing anything if a faulting instruction reaches retirement (becomes non-speculative).

But branches are always predicted one way or the other, and take up branch-prediction resources and can alias other branches so get mispredicted even if they're always strongly not-taken.

What exactly happens when a skylake CPU mispredicts a branch?
Can a speculatively executed CPU branch contain opcodes that access RAM? (my answer there mostly talks about store buffers being needed to make speculative exec of stores possible, since unlike loads they would otherwise modify cache.)
Out-of-order execution vs. speculative execution - all instructions are treated as speculative until they retire.

Modern x86 CPUs have very good load port throughput, like 2 or 3 per clock cycle (with L1d cache hits for naturally-aligned loads), and fairly generous numbers of load buffers to track outstanding loads. For example Haswell from over 10 years ago (https://www.realworldtech.com/haswell-cpu/5/) has 72 load buffer entries for a ROB (ReOrder Buffer) of 192 entries.

Loads into a 32-bit register with mov eax, [rcx] (2 bytes of machine code) or movzx eax, byte ptr [rcx] (3 bytes) for structs smaller than 4 bytes are probably your best bet, even cheaper than @user555045's suggestion of test with an immediate 0. Store-forwarding from 8-byte stores to 4-byte loads of the first 8 bytes are efficient on x86 CPUs even many years old, and 32-bit operand size avoids a REX prefix.

test [rcx], cl would save code-size and still not have a false dependency, but mov also avoids an ALU uop for the back-end execution units. It should be only 1 micro-op (uop) for the front-end and issue/rename stages on any CPUs that do any kind of micro-fusion. (Or AMD simply decoding it as 1 uop in the first place).

Both major x86-64 calling conventions have at least one purely call-clobbered reg it's always safe to load into before a call (e.g. EAX for Win x64, R11D for AMD64 SysV: variadic functions use AL to pass the number of XMM args. Although you could simply set it to a constant after mov into EAX, unless this is a shim/trampoline that passes on variadic args to another function.)

Writing a GPR and/or FLAGS is equivalent in terms of register-file limits on how far out-of-order exec can see ahead: a physical register file entry has room for a 64-bit integer plus a FLAGS result, so instructions like add rax, rcx can be treated as writing only one result by the out-of-order exec machinery.

test cl, [rcx] or test eax, [rcx] are only slightly worse than mov eax, [rcx], so don't worry too much about using them if for some reason you can't easily pick a register to write.