csignal-processingreshapestm32f4cmsis

elegant (and fast!) way to rearrange columns and rows in an ADC buffer


Abstract: I am looking for an elegant and fast way to "rearrange" the values in my ADC Buffer for further processing.

Introduction: on an ARM Cortex M4 Processor I am using 3 ADCs to sample analog values, with DMA and "Double Buffer Technique". When I get a "half buffer complete Interrupt" the data in the 1D array are arranged like this:

Ch1S1, Ch2S1, Ch3S1, Ch1S2, Ch2S2, Ch3S2, Ch1S3 ..... Ch1Sn-1, Ch2Sn-1, Ch3Sn-1, Ch1Sn, Ch2Sn, Ch3Sn Where Sn stands for Sample# and CHn for Channel Number. As I do 2x Oversampling n equals 16, the channel count is 9 in reality, in the example above it is 3

Or written in an 2D-form

Ch1S1, Ch2S1, Ch3S1,
Ch1S2, Ch2S2, Ch3S2,
Ch1S3 ...
Ch1Sn-1, Ch2Sn-1, Ch3Sn-1,
Ch1Sn, Ch2Sn, Ch3Sn

Where the rows represent the n samples and the colums represent the channels ...

I am using CMSIS-DSP to calculate all the vector stuff, like shifting, scaling, multiplication, once I have "sorted out" the channels. This part is pretty fast.

Issue: But the code I am using for "reshaping" the 1-D Buffer array to an accumulated value for each channel is pretty poor and slow:

    for(i = 0; i < ADC_BUFFER_SZ; i++) {
       for(j = 0; j < MEAS_ADC_CHANNELS; j++) {
          if(i) *(ADC_acc + j) += *(ADC_DMABuffer + bP);    // sum up all elements
          else *(ADC_acc + j) = *(ADC_DMABuffer  + bP);     // initialize new on first run
          bP++;
       }
     }

After this procedure I get a 1D array with one (accumulated) U32 value per Channel, but this code is pretty slow: ~4000 Clock cycles for 16 Samples per channel / 9 Channels or ~27 Clock cycles per sample. In order to archive higher Sample rates, this needs to be many times faster, than it is right now.

Question(s): What I am looking for is: some elegant way, using the CMSIS-DPS functions to archive the same result as above, but much faster. My gut says that I am thinking in the wrong direction, that there must be a solution within the CMSIS-DSP lib, as I am most probably not the first guy who stumbles upon this topic and I most probably won't be the last. So I'm asking for a little push in the right direction, I as guess this could be a severe case of "work-blindness" ...

I was thinking about using the dot-product function "arm_dot_prod_q31" together with an array filled with ones for the accumulation task, because I could not find the CMSIS function which would simply sum up an 1D array? But this would not solve the "reshaping" issue, I still had to copy data around and create new buffers to prepare the vectors for the "arm_dot_prod_q31" call ... Besides that it feels somehow awkward using a dot-product, where I just want to sum up array elements …

I also thought about transforming the ADC Buffer into a 16 x 9 or 9 x 16 Matrix, but then I could not find anything where I could easily (=fast & elegant) access rows or columns, which would leave me with another issue to solve, which would eventually require to create new buffers and copying data around, as I am missing a function where I could multiply a matrix with a vector ...

Maybe someone has a hint for me, that points me in the right direction? Thanks a lot and cheers!


Solution

  • ARM is a risk device, so 27 cycles is roughly equal to 27 instructions, IIRC. You may find that you're going to need a higher clock rate to meet your timing requirements. What OS are you running? Do you have access to the cache controller? You may need to lock data buffers into the cache to get high enough performance. Also, keep your sums and raw data physically close in memory as your system will allow.

    I am not convinced your perf issue is entirely the consequence of how you are stepping through your data array, but here's a more streamlined approach than what you are using:

    int raw[ADC_BUFFER_SZ];
    int sums[MEAS_ADC_CHANNELS];
    
    for (int idxRaw = 0, int idxSum = 0; idxRaw < ADC_BUFFER_SZ; idxRaw++)
    {
        sums[idxSum++] += raw[idxRaw];
        if (idxSum == MEAS_ADC_CHANNELS) idxSum = 0;
    }
    

    Note that I have not tested the above code, nor even tried to compile it. The algorithm is simple enough, you should be able to get working quickly.

    Writing pointer math in your code, will not make it any faster. The compiler will convert array notation to efficient pointer math for you. You definitely don't need two loops.

    That said, I often use a pointer for iteration:

    int raw[ADC_BUFFER_SZ];
    int sums[MEAS_ADC_CHANNELS];
    
    int *itRaw = raw;
    int *itRawEnd = raw + ADC_BUFFER_SZ;
    int *itSums = sums;
    int *itSumsEnd = itSums + MEAS_ADC_CHANNELS;
    
    while(itRaw != itEnd)
    {
        *itSums += *itRaw;
        itRaw++;
        itSums++;
        if (itSums == itSumsEnd) itSums = sums;
    }
    

    But almost never, when I am working with a mathematician or scientist, which is often the case with measurement/metrological device development. It's easier to explain the array notation to non-C reviewers, than the iterator form.

    Also, if I have an algorithm description that uses the phrase "for each...", I tend to prefer the for loop form, but when the description uses "while ...", then of course I will probably use the while... form, unless I can skip one or more variable assignment statements by rearranging it to a do..while. But I often stick as close as possible to the original description until after I've passed all the testing criteria, then do rearrangement of loops for code hygiene purposes. It's easier to argue with a domain expert that their math is wrong, when you can easily convince them that you implemented what they described.

    Always get it right first, then measure and make the determination whether to further hone the code. Decades ago, some C compilers for embedded systems could do a better job of optimizing one kind of loop than another. We used to have to keep a warry eye on the machine code they generated, and often developed habits that avoided those worst case scenarios. That is uncommon today, and almost certainly not the case for you ARM tool chain. But you may have to look into how your compilers optimization features work and try something different.

    Do try to avoid doing value math on the same line as your pointer math. It's just confusing:

       *(p1 + offset1) += *(p2 + offset2); // Can and should be avoided.
       *(p1++) = *(p2++); // reasonable, especially for experienced coders/reviewers.
       p1[offset1] += p2[offset2]; // Okay. Doesn't mix math notation with pointer notation.
       p1[offset1 + A*B/C] += p2...; // Very bad.
       // But...
       int offset1 += A*B/C; // Especially helpful when stepping in the debugger.
       p1[offset1]... ; // Much better.
    

    Hence the iterator form mentioned earlier. It may reduce the lines of code, but does not reduce the complexity and definitely does increase the odds of introducing a bug at some point.

    A purist could argue that p1[x] is in fact pointer notation in C, but array notation has almost, if not completely universal binding rules across languages. Intentions are obvious, even to non programmers. While the examples above are pretty trivial and most C programmers would have no problems reading any of them, it's when the number of variables involved and the complexity of the math increases, that mixing your value math with pointer math quickly becomes problematic. You'll almost never do it for anything non-trivial, so for consistency's sake, just get in the habit of avoiding it all-together.