assemblyarm64cpu-cachememory-barriersself-modifying

Synchronizing caches for JIT/self-modifying code on ARM


The general, more abstract procedure for writing and later executing JIT or self-modifying code is, to my understanding, something like the following.

From what I can tell from this post about self-modifying code on x86, manual cache management is apparently not necessary. I imagined that a clflushopt would be necessary, but x861 apparently automatically handles cache invalidation upon loading from a location with new instructions, such that instruction fetches are never stale. My question is not about x86, but I wanted to include this for comparison.

The situation in AArch64 is a little more complicated, as it distinguishes between shareability domains and how "visible" a cache operation should be. From just the official documentation for ARMv8/ARMv9, I first came up with this guess.

But the documentation for DMB/DSB/ISB says that "instructions following the ISB are fetched from cache or memory". That gives me an impression that cache control operations are indeed necessary. My new guess is thus this.

But I couldn't help but feel that even this is not quite right. A little while later, I found something on the documentation that I missed, and something pretty much the same on a paper. Both of them give an example that looks like this.

dc cvau, Xn ; Clean cache to PoU, so the newly written code will be visible
dsb ish     ; Wait for cleaning to finish
ic ivau, Xn ; Invalidate cache to PoU, so the newly written code will be fetched
dsb ish     ; Wait for invalidation to finish
isb sy      ; Make sure new instructions are fetched from cache or memory

For a big block of code, this would probably be a loop of cleaning, dsb ish, a loop of invalidation, dsb ish, then an isb sy. Please correct me if this is incorrect. In any case, this example makes sense, and I guess the only thing I missed was that dsb ish alone does not synchronize the I-cache and D-cache, and that the new data must be manually cleaned and invalidated. My actual questions for this post are thus as follows.


0 Only to the extent that all the cores that are supposed to see it will see it.
1 At least, all the reasonably modern ones should.


Solution

  • (Disclaimer: this answer is based on reading specs and some tests, but not on previous experience.)

    First of all, there is an explanation and example code for this exact case (one core writes code for another core to execute) in B2.2.5 of the Architecture Reference Manual (version G.b). The only difference from the examples you've shown is that the final isb needs to be executed in the thread that will execute the new code (which I guess is your "consumer"), after the cache invalidation has finished.


    I found it helpful to try to understand the abstract constructs like "inner shareable domain", "point of unification" from the architecture reference in more concrete terms.

    Let's think about a system with several cores. Their L1d caches are coherent, but their L1i caches need not be unified with L1d, nor coherent with each other. However, the L2 cache is unified.

    The system does not have any way for L1d and L1i to talk to each other directly; the only path between them is through L2. So once we have written our new code to L1d, we have to write it back to L2 (dc cvau), then invalidate L1i (ic ivau) so that it repopulates from the new code in L2.

    In this setting, PoU is the L2 cache, and that's exactly where we want to clean / invalidate to.

    There's some explanation of these terms in page D4-2646. In particular:

    The PoU for an Inner Shareable shareability domain is the point by which the instruction and data caches and the translation table walks of all the PEs in that Inner Shareable shareability domain are guaranteed to see the same copy of a memory location.

    Here, the Inner Shareable domain is going to contain all the cores that could run the threads of our program; indeed, it is supposed to contain all the cores running the same kernel as us (page B2-166). And because the memory we are dc cvauing is presumably marked with the Inner Shareable attribute or better, as any reasonable OS should do for us, it cleans to the PoU of the domain, not merely the PoU of our core (PE). So that's just what we want: a cache level that all instruction cache fills from all cores would see.

    The Point of Coherency is further down; it is the level that everything on the system sees, including DMA hardware and such. Most likely this is main memory, below all the caches. We don't need to get down to that level; it would just slow everything down for no benefit.

    Hopefully that helps with your question 1.


    Note that the cache clean and invalidate instructions run "in the background" as it were, so that you can execute a long string of them (like a loop over all affected cache lines) without waiting for them to complete one by one. dsb ish is used once at the end to wait for them all to finish.

    Some commentary about dsb, towards your questions #2 and #3. Its main purpose is as a barrier; it makes sure that all the pending data accesses within our core (in store buffers, etc) get flushed out to L1d cache, so that all other cores can see them. This is the kind of barrier you need for general inter-thread memory ordering. (Or for most purposes, the weaker dmb suffices; it enforces ordering but doesn't actually wait for everything to be flushed.) But it doesn't do anything else to the caches themselves, nor say anything about what should happen to that data beyond L1d. So by itself, it would not be anywhere near strong enough for what we need here.

    As far as I can tell, the "wait for cache maintenance to complete" effect is a sort of bonus feature of dsb ish. It seems orthogonal to the instruction's main purpose, and I'm not sure why they didn't provide a separate wcm instruction instead. But anyway, it is only dsb ish that has this bonus functionality; dsb ishst does not. D4-2658: "In all cases, where the text in this section refers to a DMB or a DSB, this means a DMB or DSB whose required access type is both loads and stores".

    I ran some tests of this on a Cortex A-72. Omitting either of the dc cvau or ic ivau usually results in the stale code being executed, even if dsb ish is done instead. On the other hand, doing dc cvau ; ic ivau without any dsb ish, I didn't observe any failures; but that could be luck or a quirk of this implementation.


    To your #4, the sequence we've been discussing (dc cvau ; dsb ish ; ci ivau ; dsb ish ; isb) is intended for the case when you will run the code on the same core that wrote it. But it actually shouldn't matter which thread does the dc cvau ; dsb ish ; ci ivau ; dsb ish sequence, since the cache maintenance instructions cause all the cores to clean / invalidate as instructed; not just this one. See table D4-6. (But if the dc cvau is in a different thread than the writer, maybe the writer has to have completed a dsb ish beforehand, so that the written data really is in L1d and not still in the writer's store buffer? Not sure about that.)

    The part that does matter is isb. After ci ivau is complete, the L1i caches are cleared of stale code, and further instruction fetches by any core will see the new code. However, the runner core might previously have fetched the old code from L1i, and still be holding it internally (decoded and in the pipeline, uop cache, speculative execution, etc). isb flushes these CPU-internal mechanisms, ensuring that all further instructions to be executed have actually been fetched from the L1i cache after it was invalidated.

    Thus, the isb needs to be executed in the thread that is going to run the newly written code. And moreover you need to make sure that it is done after all the cache maintenance has fully completed; maybe by having the writer thread notify it via condition variable or the like.

    I tested this too. If all the cache maintenance instructions, plus an isb, are done by the writer, but the runner doesn't isb, then once again it can execute the stale code. I was only able to reproduce this in a test where the writer patches an instruction in a loop that the runner is executing concurrently, which probably ensures that the runner had already fetched it. This is legal provided that the old and new instruction are, say, a branch and a nop respectively (see B2.2.5), which is what I did. (But it is not guaranteed to work for arbitrary old and new instructions.)

    I tried some other tests to try to arrange it so that the instruction wasn't actually executed until it was patched, yet it was the target of a branch that should have been predicted taken, in hopes that this would get it prefetched; but I couldn't get the stale version to execute in that case.


    One thing I wasn't quite sure about is this. A typical modern OS may well have W^X, where no virtual page can be simultaneously writable and executable. If after writing the code, you call the equivalent of mprotect to make the page executable, then most likely the OS is going to take care of all the cache maintenance and synchronization for you (but I guess it doesn't hurt to do it yourself too).

    But another way to do it would be with an alias: you map the memory writable at one virtual address, and executable at another. The writer writes at the former address, and the runner jumps to the latter. In that case, I think you would simply dc cvau the writable address, and ic ivau the executable one, but I couldn't find confirmation of that. But I tested it, and it worked no matter which alias was passed to which cache maintenance instruction, while it failed if either instruction was omitted altogether. So it appears that the cache maintenance is done by physical address underneath.