Data corruption issue with DMA operations on ARM Cortex-M7 (STM32F7) MCU

I'm using an ARM Cortex-M7 microcontroller (specifically the STM32F767ZG) to communicate with external devices using 4 USARTs (configured as asynchronous transmitters/receivers, and using DMA to handle transfers). While testing the (bare metal) code, I noticed an issue with data corruption, possibly relating to the way ARM and/or the compiler deals with variables in cache and RAM. See the following test code:

volatile char buffer[3];

// USART & DMA initialization code
// ...

buffer[0] = 0x11;   //
buffer[1] = 0x22;   // Buffer initial values
buffer[2] = 0x33;   //

// Some other code
// ...

buffer[0] = 0xAA;   //
buffer[1] = 0xBB;   // Buffer updated values
buffer[2] = 0xCC;   //

// DMA stream starts here
// ...

Executing the above code, the data that comes out of the USART is the following:

0x11   (OLD value of buffer[0])
0x22   (OLD value of buffer[1])
0xCC   (NEW value of buffer[2])

I suspect this is relating to how ARM and/or the compiler deals with variables and their storage in cache and RAM. It seems that the contents of buffer[] take some time to reach the actual RAM, and, as a result, DMA picks up the old values. Note that, for the first two bytes, the USART Tx register is immediately free (due to USART's internal buffering), so the first two bytes (buffer[0] and buffer[1]) are read almost instantly by DMA. For the third byte, there is a 1-byte transmission delay (which, at 9600 bps is just over 1 ms), so in this case the MCU has plenty of time to update the RAM, hence the new value of buffer[2] is read by DMA.

This can be eliminated by simply adding a very small delay of just 1 microsecond before starting the DMA stream, like this:

...

Delay_us(1);

// DMA stream starts here
// ...

In this case, the USART sends the following (expected) data:

0xAA   (NEW value of buffer[0])
0xBB   (NEW value of buffer[1])
0xCC   (NEW value of buffer[2])

In fact, the above delay can be fine-tuned (in the nanosecond range), so that only the first byte is old, and the next two bytes are new (i.e., USART sends 0x11, 0xBB, 0xCC).

My question is, how can I be absolutely sure that the actual RAM contents (to be read by DMA) reflect the buffer values I set in code? Adding a delay before initiating the DMA stream seems like a very crude and uncertain solution. Is there a definite way (a technique in C, or even an Assembly command) to flush the MCU cache and transfer its contents to RAM, so that there is no corruption in the buffer data in RAM?

Solution

I'm posting an answer to my own question, after my investigation and lab experiments with the real h/w. It turns out that the DMA data corruption issue is indeed related to the MCU's Level 1 (L1) cache. The RAM region allocated to buffer[] by the compiler is, in general, cacheable, meaning that any read/write accesses from/to buffer[] go through L1 cache. However, as expected, DMA always directly accesses physical RAM, not cache, and this can lead to data corruption because the code and/or peripheral may actually be accessing old versions of the buffer data. This situation is called loss of coherency.

There are three standard approaches that can be employed to solve this:

Use a special instruction to clean the cache before initiating the DMA Tx transfer, forcing the MCU to write the cache contents to RAM (so that they can then be read by DMA). Similarly, use a special instruction to invalidate the cache before accessing the incoming data from the DMA Rx transfer, forcing the MCU to read the new data from RAM, instead of the old data in the cache.
Use the Memory Protection Unit (MPU) of the MCU to configure a specific region of RAM as non-cacheable. This ensures that all accesses to this region will always be done directly to RAM and will never go through the cache. I find the MPU approach to be the most elegant, as it guarantees coherency without the need to use CleanDCache() / InvalidateDCache() instructions before/after every DMA transaction. Alternatively, you can use the MPU to assign a write-through caching policy to the RAM region of interest. In this way, all cache writes are also immediately written to RAM, thus ensuring there is coherency between buffer data and DMA read accesses.
Allocate the DMA buffer inside a RAM region that is already non-cacheable. In the ARM Cortex-M7 MCU I'm using (STM32F767xx), there is a region of SRAM called DTCM-RAM, on TCM interface (Tightly Coupled Memory interface), mapped at address range 0x20000000 ~ 0x2001FFFF (128 KB). This region is non-cacheable. Please note that there is no guarantee that DMA can access TCM memory in all MCUs, so this needs to be checked by reading the datasheet for the specific MCU of interest. In my case (STM32F767xx), DMA can access the 128 KB of DTCM-RAM, which is exactly what I need. This last method is the quickest solution, which can be applied by simply allocating buffer[] anywhere inside the non-cacheable region, as follows:

volatile char buffer[100] absolute 0x20000000;

Adding this single line in my code has completely solved all my DMA data corruption issues!

For more detailed information, please see this post, and the links therein. Also, see the following resources by STMicroelectronics:

PM0253 — STM32F7 Series and STM32H7 Series Cortex®-M7 processor programming manual

AN4839 — Level 1 cache on STM32F7 Series and STM32H7 Series

AN4838 — Introduction to memory protection unit management on STM32 MCUs