cembeddedstm32moving-averagestm32h743

Optimize a weighted moving average


Environment : STM32H7 and GCC
Working with a flow of data : 1 sample received from SPI every 250 us
I do a "triangle" weighted moving average with 256 samples, like this but middle sample is weighted 1 and it forms a triangle around it
My samples are stored in uint32_t val[256] circular buffer, it works with a uint8_t write_index
The samples are 24 bits, the max value of a sample is 0x00FFFFFF

uint8_t write_idx =0;
uint32_t val[256];
float coef[256];

void init(void)
{
  uint8_t counter=0;
  // I calculate my triangle coefs
  for(uint16_t c=0;c<256;c++) 
  {
    coef[c]=(c>127)?--counter:++counter;
    coef[c]/=128;
  }
}

void ACQ_Complete(void)
{
  uint32_t moy=0;
  // write_idx is meant to wrap
  val[write_idx++]= new_sample;
  // calc moving average (uint8_t)(c-write_idx) is meant to wrap
  for(uint16_t c=0;c<256;c++)
    moy += (uint32_t)(val[c]*coef[(uint8_t)(c-write_idx)]);
  moy/=128;
}

I have to do the calcs during a 250 us time span, but I measured with a debug GPIO pin that the "moy" part takes 252 us
Code is simulated here
Interesting fact : If I remove the (uint32_t) cast near the end it takes 274 us instead of 252 us

How can I get it done faster ?

I was thinking of using uint32 instead of float for coef (by multiply by 1000 for example) but my uint32 would overflow


Solution

  • This should unquestionably be done in integer. It will be both faster and more accurate.

    This processor can do 32x32+64=64 multiply accumulate in a single cycle!

    Multiply all your coefficients by a power of 2 (not 1000 mentioned in the comments), and then shift down at the end rather than divide.

    uint32_t coef[256];
    
    uint64_t moy = 0;
    
    for(unsigned int c = 0; c < 256; c++)
    {
       moy += (val[c] * (uint64_t)coef[(c - write_idx) & 0xFFu]);
    }
    
    moy >>= N;