In the Kepler architecture whitepaper, NVIDIA states that there are 32
Special Function Units (SFUs) and 32
Load/Store Units (LD/ST) on a SMX.
The SFU are for "fast approximate transcendental operations". Unfortunately, I don't understand what this is supposed to mean. On the other hand, at Special CUDA Double Precision trig functions for SFU it is said, that they only work in single precision. Is this still correct on a K20Xm?
The LD/ST units are obviously for storing and loading. Is any memory load/write required to go through one of theses? And are they also used as a single warp? In other words, can there be only one warp which is currently writing or reading?
Cheers, Andi
The SFU are for "fast approximate transcendental operations"
SFUs compute functions like __cosf()
, __expf()
etc.
On the other hand here is said, that they only work in single precision, is this still correct on a K20Xm?
According to recent CUDA C Programming Guide, section G.5.1 they still only work in single precision.
It makes some sense, since if you need double precision it's unlikely you would use inaccurate math functions. You can refer to this answer for suggestions on double-precision arithmetic optimizarions.
The implementation details of double-precision operations could be found in /usr/local/cuda-5.5/include/math_functions_dbl_ptx3.h
(or wherever your CUDA Toolkit is installed).
E.g. for sin
and cos
it uses Payne-Hanek argument reduction followed by Taylor expansion (up to the order 14).
For double precision calculations, SFUs seem to be used only in __internal_fast_rcp
and __internal_fast_rsqrt
, which in turn are used in acos
, log
, cosh
and several other functions (see math_functions_dbl_ptx3.h
). So most of the time they stall, like LD/ST units stall if there's no ongoing memory transactions.
Is any memoryload/write required to go through one of theses?
Yes, each access to global memory.
And are they also used as a single warp? In other words can there be only one warp which is currently writing or reading?
The number of units constrains only the number of instructions issued each cycle. I.e. each clock cycle 32 read instructions could be issued, and 32 results could be returned.
One instruction can read/write up to 128 bytes, so if each thread in warp reads 4 bytes and they are coalesced, then whole warp would require a single load/store instruction. If accesses are uncoalesced, then more instruction should be issued.
Moreover, units are pipelined, meaning multiple read/store request could be executing concurrently by single unit.