performancecudascatterstream-compaction

Improving the Efficiency of Compact/Scatter in CUDA


Summary:

Any ideas about how to further improve upon the basic scatter operation in CUDA? Especially if one knows it will only be used to compact a larger array into a smaller one? or why the below methods of vectorizing memory ops and shared memory didn't work? I feel like there may be something fundamental I am missing and any help would be appreciated.

EDIT 03/09/15: So I found this Parallel For All Blog post "Optimized Filtering with Warp-Aggregated Atomics". I had assumed atomics would be intrinsically slower for this purpose, however I was wrong - especially since I don't think I care about maintaining element order in the array during my simulation. I'll have to think about it some more and then implement it to see what happens!

EDIT 01/04/16: I realized I never wrote about my results. Unfortunately in that Parallel for All Blog post they compared the global atomic method for compact to the Thrust prefix-sum compact method, which is actually quite slow. CUB's Device::IF is much faster than Thrust's - as is the prefix-sum version I wrote using CUB's Device::Scan + custom code. The warp-aggregrate global atomic method is still faster by about 5-10%, but nowhere near the 3-4x faster I had been hoping for based on the results in the blog. I'm still using the prefix-sum method as while maintaining element order is not necessary, I prefer the consistency of the prefix-sum results and the advantage from the atomics is not very big. I still try various methods to improve compact, but so far only marginal improvements (2%) at best for dramatically increased code complexity.


Details:

I am writing a simulation in CUDA where I compact out elements I am no longer interested in simulating every 40-60 time steps. From profiling it seems that the scatter op takes up the most amount of time when compacting - more so than the filter kernel or the prefix sum. Right now I use a pretty basic scatter function:

    __global__ void scatter_arrays(float * new_freq, const float * const freq, const int * const flag, const int * const scan_Index, const int freq_Index){
            int myID =  blockIdx.x*blockDim.x + threadIdx.x;
            for(int id = myID; id < freq_Index; id+= blockDim.x*gridDim.x){
                 if(flag[id]){
                    new_freq[scan_Index[id]] = freq[id];
                 }
             } 
    }

freq_Index is the number of elements in the old array. The flag array is the result from the filter. Scan_ID is the result from the prefix sum on the flag array.

Attempts I've made to improve it are to read the flagged frequencies into shared memory first and then write from shared memory to global memory - the idea being that the writes to global memory would be more coalesced amongst the warps (e.g. instead of thread 0 writing to position 0 and thread 128 writing to position 1, thread 0 would write to 0 and thread 1 would write to 1). I also tried vectorizing the reads and the writes - instead of reading and writing floats/ints I read/wrote float4/int4 from the global arrays when possible, so four numbers at a time. This I thought might speed up the scatter by having fewer memory ops transferring larger amounts of memory. The "kitchen sink" code with both vectorized memory loads/stores and shared memory is below:

    const int compact_threads = 256;
    __global__ void scatter_arrays2(float * new_freq, const float * const freq, const int * const flag, const int * const scan_Index, const int freq_Index){
        int gID =  blockIdx.x*blockDim.x + threadIdx.x; //global ID
        int tID = threadIdx.x; //thread ID within block
        __shared__ float row[4*compact_threads];
        __shared__ int start_index[1];
        __shared__ int end_index[1];
        float4 myResult;
        int st_index;
        int4 myFlag;
        int4 index;
        for(int id = gID; id < freq_Index/4; id+= blockDim.x*gridDim.x){
            if(tID == 0){
                index = reinterpret_cast<const int4*>(scan_Index)[id];
                myFlag = reinterpret_cast<const int4*>(flag)[id];
                start_index[0] = index.x;
                st_index = index.x;
                myResult = reinterpret_cast<const float4*>(freq)[id];
                if(myFlag.x){ row[0] = myResult.x; }
                if(myFlag.y){ row[index.y-st_index] = myResult.y; }
                if(myFlag.z){ row[index.z-st_index] = myResult.z; }
                if(myFlag.w){ row[index.w-st_index] = myResult.w; }
            }
            __syncthreads();
            if(tID > 0){
                myFlag = reinterpret_cast<const int4*>(flag)[id];
                st_index = start_index[0];
                index = reinterpret_cast<const int4*>(scan_Index)[id];
                myResult = reinterpret_cast<const float4*>(freq)[id];
                if(myFlag.x){ row[index.x-st_index] = myResult.x; }
                if(myFlag.y){ row[index.y-st_index] = myResult.y; }
                if(myFlag.z){ row[index.z-st_index] = myResult.z; }
                if(myFlag.w){ row[index.w-st_index] = myResult.w; }
                if(tID == blockDim.x -1 || gID == mutations_Index/4 - 1){ end_index[0] = index.w + myFlag.w; }
            }
            __syncthreads();
            int count = end_index[0] - st_index;

            int rem = st_index & 0x3; //equivalent to modulo 4
            int offset = 0;
            if(rem){ offset = 4 - rem; }

            if(tID < offset && tID < count){
                new_mutations_freq[population*new_array_Length+st_index+tID] = row[tID];
            }

            int tempID = 4*tID+offset;
            if((tempID+3) < count){
                reinterpret_cast<float4*>(new_freq)[tID] = make_float4(row[tempID],row[tempID+1],row[tempID+2],row[tempID+3]);
            }

            tempID = tID + offset + (count-offset)/4*4;
            if(tempID < count){ new_freq[st_index+tempID] = row[tempID]; }
        }
        int id = gID + freq_Index/4 * 4; 
        if(id < freq_Index){
            if(flag[id]){
                new_freq[scan_Index[id]] = freq[id];
            }
        }
    }

Obviously it gets a bit more complicated. :) While the above kernel seems stable when there are hundreds of thousands of elements in the array, I've noticed a race condition when the array numbers in the tens of millions. I'm still trying to track the bug down.

But regardless, neither method (shared memory or vectorization) together or alone improved performance. I was especially surprised by the lack of benefit from vectorizing the memory ops. It had helped in other functions I had written, though now I am wondering if maybe it helped because it increased Instruction-Level-Parallelism in the calculation steps of those other functions rather than the fewer memory ops.


Solution

  • I found the algorithm mentioned in this poster (similar algorithm also discussed in this paper) works pretty well, especially for compacting large arrays. It uses less memory to do it and is slightly faster than my previous method (5-10%). I put in a few tweaks to the poster's algorithm: 1) eliminating the final warp shuffle reduction in phase 1, can simply sum the elements as they are calculated, 2) giving the function the ability to work over more than just arrays sized as a multiple of 1024 + adding grid-strided loops, and 3) allowing each thread to load their registers simultaneously in phase 3 instead of one at a time. I also use CUB instead of Thrust for Inclusive sum for faster scans. There may be more tweaks I can make, but for now this is good.

    //kernel phase 1
    int myID =  blockIdx.x*blockDim.x + threadIdx.x;
    //padded_length is nearest multiple of 1024 > true_length
    for(int id = myID; id < (padded_length >> 5); id+= blockDim.x*gridDim.x){
        int lnID = threadIdx.x % warp_size;
        int warpID = id >> 5;
    
        unsigned int mask;
        unsigned int cnt=0;//;//
    
        for(int j = 0; j < 32; j++){
            int index = (warpID<<10)+(j<<5)+lnID;
            
            bool pred;
            if(index > true_length) pred = false;
            else pred = predicate(input[index]);
            mask = __ballot(pred); 
    
            if(lnID == 0) {
                flag[(warpID<<5)+j] = mask;
                cnt += __popc(mask);
            }
        }
    
        if(lnID == 0) counter[warpID] = cnt; //store sum
    }
    
    //kernel phase 2 -> CUB Inclusive sum transforms counter array to scan_Index array
    
    //kernel phase 3
    int myID =  blockIdx.x*blockDim.x + threadIdx.x;
    
    for(int id = myID; id < (padded_length >> 5); id+= blockDim.x*gridDim.x){
        int lnID = threadIdx.x % warp_size;
        int warpID = id >> 5;
    
        unsigned int predmask;
        unsigned int cnt;
    
        predmask = flag[(warpID<<5)+lnID];
        cnt = __popc(predmask);
    
        //parallel prefix sum
    #pragma unroll
        for(int offset = 1; offset < 32; offset<<=1){
            unsigned int n = __shfl_up(cnt, offset);
            if(lnID >= offset) cnt += n;
        }
    
        unsigned int global_index = 0;
        if(warpID > 0) global_index = scan_Index[warpID - 1];
    
        for(int i = 0; i < 32; i++){
            unsigned int mask = __shfl(predmask, i); //broadcast from thread i
            unsigned int sub_group_index = 0;
            if(i > 0) sub_group_index = __shfl(cnt, i-1);
            if(mask & (1 << lnID)){
                compacted_array[global_index + sub_group_index + __popc(mask & ((1 << lnID) - 1))] = input[(warpID<<10)+(i<<5)+lnID]; 
            }
        }
    }
    

    }

    EDIT: There is a newer article by a subset of the poster authors where they examine a faster variation of compact than what is written above. However, their new version is not order preserving, so not useful for myself and I haven't implemented it to test it out. That said, if your project doesn't rely on object order, their newer compact version can probably speed up your algorithm.