Let A
be a properly aligned array of 32-bit integers in shared memory.
If a single warp tries to fetch elements of A
at random, what is the expected number of bank conflicts?
In other words:
__shared__ int A[N]; //N is some big constant integer
...
int v = A[ random(0..N-1) ]; // <-- expected number of bank conflicts here?
Please assume Tesla or Fermi architecture. I don't want to dwell into 32-bit vs 64-bit bank configurations of Kepler. Also, for simplicity, let us assume that all the random numbers are different (thus no broadcast mechanism).
My gut feeling suggests a number somewhere between 4 and 6, but I would like to find some mathematical evaluation of it.
I believe the problem can be abstracted out from CUDA and presented as a math problem. I searched it as an extension to Birthday Paradox, but I found really scary formulas there and didn't find a final formula. I hope there is a simpler way...
I assume fermi 32-bank shared memory where each 4 consequent bytes are stored in consequent banks. Using following code:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define NBANK 32
#define N 7823
#define WARPSIZE 32
#define NSAMPLE 10000
int main(){
srand ( time(NULL) );
int i=0,j=0;
int *conflictCheck=NULL;
int *randomNumber=NULL;
int *statisticCheck=NULL;
conflictCheck=(int*)malloc(sizeof(int)*NBANK);
randomNumber=(int*)malloc(sizeof(int)*WARPSIZE);
statisticCheck=(int*)malloc(sizeof(int)*(NBANK+1));
while(i<NSAMPLE){
// generate a sample warp shared memory access
for(j=0; j<WARPSIZE; j++){
randomNumber[j]=rand()%NBANK;
}
// check the bank conflict
memset(conflictCheck, 0, sizeof(int)*NBANK);
int max_bank_conflict=0;
for(j=0; j<WARPSIZE; j++){
conflictCheck[randomNumber[j]]++;
max_bank_conflict = max_bank_conflict<conflictCheck[randomNumber[j]]? conflictCheck[randomNumber[j]]: max_bank_conflict;
}
// store statistic
statisticCheck[max_bank_conflict]++;
// next iter
i++;
}
// report statistic
printf("Over %d random shared memory access, there found following precentages of bank conflicts\n");
for(i=0; i<NBANK+1; i++){
//
printf("%d -> %6.4f\n",i,statisticCheck[i]/(float)NSAMPLE);
}
return 0;
}
I got following output:
Over 0 random shared memory access, there found following precentages of bank conflicts
0 -> 0.0000
1 -> 0.0000
2 -> 0.0281
3 -> 0.5205
4 -> 0.3605
5 -> 0.0780
6 -> 0.0106
7 -> 0.0022
8 -> 0.0001
9 -> 0.0000
10 -> 0.0000
11 -> 0.0000
12 -> 0.0000
13 -> 0.0000
14 -> 0.0000
15 -> 0.0000
16 -> 0.0000
17 -> 0.0000
18 -> 0.0000
19 -> 0.0000
20 -> 0.0000
21 -> 0.0000
22 -> 0.0000
23 -> 0.0000
24 -> 0.0000
25 -> 0.0000
26 -> 0.0000
27 -> 0.0000
28 -> 0.0000
29 -> 0.0000
30 -> 0.0000
31 -> 0.0000
32 -> 0.0000
We can come to conclude that 3 to 4 way conflict is the most likely with random access. You can tune the run with different N
(number of elements in array), NBANK
(number of banks in shared memory), WARPSIZE
(warp size of machine), and NSAMPLE
(number of random shared memory accesses generated to evaluate the model).