c++optimizationmicrobenchmark

Is this the fastest way to reformat a buffer of 64b values to 16b?


I have a datastream which outputs what are physically 64bit values to a buffer. When the buffer reaches a certain level, it needs to be reformatted to consecutive 16bit values. The real values are never more than 24 of the 64 bits of each value produced by the datastream, so this amounts to truncating a 24b value to 16b and rearranging the buffer so the values are now consecutive. I believe I have found the fastest way to do this, however I am not sure if there are optimizations I could be missing or faster ways provided by C++ standard utilities. Below is a MRE showing my reformatting function as well as a test harness to produce data like what I am encountering and time the reformatting.

#include <iostream>
#include <chrono>
#include <unistd.h>

int num_samples = 160000;

void fill_buffer(uint8_t** buffer){
  *buffer = (uint8_t*)malloc(num_samples * sizeof(uint64_t));
  for (int i = 0; i < num_samples; i += 8){
    (*buffer)[i] = rand() % 0xFF;
    (*buffer)[i + 1] = rand() % 0xFF;
    (*buffer)[i + 2] = rand() % 0xFF;
  }
}

void reformat_1(uint8_t* buf){
  uint64_t* p_8byte = (uint64_t*)buf;
  uint16_t* p_2byte = (uint16_t*)buf;

  for (int i = 0; i < num_samples; i++){
    p_2byte[i] = p_8byte[i] >> 8;
  }
}

int main(int argc, char const* argv[]){
  uint8_t* buffer = NULL;

  fill_buffer(&buffer);
  auto start = std::chrono::high_resolution_clock::now();
  reformat_1(buffer);
  auto stop = std::chrono::high_resolution_clock::now();
  auto duration = std::chrono::duration_cast<std::chrono::microseconds>(stop - start);
  std::cout << "Time taken by function one: " << duration.count() << " microseconds" << std::endl;

  return 0;
}

Also willing to hear feedback on my benchmarking setup, I find it interesting that with -O3 I get ~130uS on my actual sample data read from a file, while with randomly generated data, I am seeing closer to 1800uS, so this is apparently not a perfectly representative example.

One other thing I will note that I would think would work against my actual times (vs synthetic) but apparently not: While the num_samples is a magic number here, in practice, it is calculated and usually constant (not always), but not something that the compiler would replace with a constant as to unroll loops or etc (I think).


Solution

  • This micro-improvement increases performance by ~10%:

    void reformat_2(uint8_t* buf){
      uint32_t* p_8byte = (uint32_t*)buf;
      uint16_t* p_2byte = (uint16_t*)buf;
      uint16_t* p_2end  = p_2byte + num_samples;
    
      while(p_2byte < p_2end){
        *p_2byte++ = *p_8byte >> 8;
        p_8byte += 2;
      }
    }
    

    For see more clear numbers, I increased buffer size 100x, to 16M entries.