There is AMD HIP C++ which is very similar to CUDA C++. Also AMD created Hipify to convert CUDA C++ to HIP C++ (Portable C++ Code) which can be executed on both nVidia GPU and AMD GPU: https://github.com/GPUOpen-ProfessionalCompute-Tools/HIP
shfl
operations on nVidia GPU: https://github.com/GPUOpen-ProfessionalCompute-Tools/HIP/tree/master/samples/2_Cookbook/4_shfl#requirement-for-nvidiarequirement for nvidia
please make sure you have a 3.0 or higher compute capable device in order to use warp shfl operations and add -gencode arch=compute=30, code=sm_30 nvcc flag in the Makefile while using this application.
shfl
for 64 wavesize (WARP-size) on AMD: https://github.com/GPUOpen-ProfessionalCompute-Tools/HIP/blob/master/docs/markdown/hip_faq.md#why-use-hip-rather-than-supporting-cuda-directlyIn addition, HIP defines portable mechanisms to query architectural features, and supports a larger 64-bit wavesize which expands the return type for cross-lane functions like ballot and shuffle from 32-bit ints to 64-bit ints.
But which of AMD GPUs does support functions shfl
, or does any AMD GPU support shfl
because on AMD GPU it implemented by using Local-memory without hardware instruction register-to-register?
nVidia GPU required 3.0 or higher compute capable (CUDA CC), but what are the requirements for using shfl
operations on AMD GPU using HIP C++?
Yes, there are new instructions in GPU GCN3 such as ds_bpermute
and ds_permute
which can provide the functionality such as __shfl()
and even more
These ds_bpermute
and ds_permute
instructions use only route of Local memory (LDS 8.6 TB/s), but don't actually use Local memory, this allows to accelerate data exchange between threads: 8.6 TB/s < speed < 51.6 TB/s: http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/
They use LDS hardware to route data between the 64 lanes of a wavefront, but they don’t actually write to an LDS location.
http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/
now, most of the vector instructions can do cross-lane reading at full throughput.
For example, wave_shr
-instruction (Wavefront shift right) for Scan algorithm:
More about GCN3: https://github.com/olvaffe/gpu-docs/raw/master/amd-open-gpu-docs/AMD_GCN3_Instruction_Set_Architecture.pdf
New Instructions
- “SDWA” – Sub Dword Addressing allows access to bytes and words of VGPRs in VALU instructions.
- “DPP” – Data Parallel Processing allows VALU instructions to access data from neighboring lanes.
- DS_PERMUTE_RTN_B32, DS_BPERMPUTE_RTN_B32.
...
DS_PERMUTE_B32 Forward permute. Does not write any LDS memory.