cunixgcc9

No performance difference in different variations of the same program


I copied glibc's implementation of binary search algorithm, then modified it a little bit to suit my needs. I decided to test it and other things I have learned about GCC (attributes and built-ins). The code looks as:

int main() {
  uint_fast16_t a[61] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61 };
  uint64_t t1 = Time(0);
  for(register uint_fast16_t i = 0; i < 10000000; ++i) {
    binary_search(rand() % 62, a, 61);
  }
  printf("%ld\n", Time(0) - t1);
  return 0;
}

Now, this program runs just fine. The problem begins when I add more lines of code, for instance:

uint_fast16_t a[61] __attribute__ ((aligned (64) )) = /* ... */

In this case I would expect faster code, yet performance has not changed after multiple tests (tens of tests). I also tested the program with alignment of 8 and 1 - no changes. I even expected gcc to throw an error/warning, because using alignment less than type size (in my case 64bit machine, uint_fast16_t is 8 bytes), but there was none. Then another change, which was adding caching (introduced in GCC 9). I added the following code before the for loop:

caches(a, uint_fast16_t, uint_fast16_t, 61, 0, 3);
// where "caches" is:
#define caches(x, type, data_type, size, rw, l) ({ \
  for(type Q2W0 = 0; Q2W0 < size; Q2W0 += 64 / sizeof(data_type)) { \
    __builtin_prefetch(x + Q2W0, rw, l); \
  } \
})

No change in performance as well. I figured out maybe my CPU is caching the array automatically after first binary_search so I eliminated the for loop and measure a few times again with and without the caching line, but I have not noticed any change in performance as well.
More information:

  1. Using CentOS8 64bit latest kernel
  2. Using GCC 9.2.1 20191120
  3. Compiling with -O3 -Wall -pthread -lanl -Wno-missing-braces -Wmissing-field-initializers, no errors / warnings during compilation
  4. Things are not optimised away (checked asm output)

I am pretty sure I don't know about something / I am doing something wrong.

Full code available here.


Solution