I've read about coalesced memory access(In CUDA, what is memory coalescing, and how is it achieved?) and its performance importance. However I don't know what a typical GPU does when a non coalesced memory access occur. When a thread "asks" for a byte in position P and the other threads asks for something far away the GPU gets a complete block of 128 bytes for that thread? If the reading is aligned can I read the other 127 bytes for "free"?
General rules:
If a single thread generates an address that is not near any of the other addresses generated in the warp, then the memory controller will need to request a whole line/segment from DRAM, which may be either 32 bytes or 128 bytes, depending on device and which caches are involved (i.e. what type of "miss" occurred) just to satisfy that one address from that one thread. Therefore regardless of whether that thread is requesting a minimum of 1 byte or up to the maximum of 16 bytes possible in a single thread read transaction, the memory controller must read either 32 bytes or 128 bytes from DRAM to satisfy the read originating from that thread. Similar logic will apply to every other address emanating from that particular "warp read".
This type of scattered or isolated access pattern is "uncoalesced", because no other thread in the warp needs an address close enough so that it can fulfill its needs from the same segment/line.
When a thread "asks" for a byte in position P and the other threads asks for something far away the GPU gets a complete block of 128 bytes for that thread?
Yes, either 32 bytes or 128 bytes is the minimum granularity of request that can be made from DRAM.
If the reading is aligned can I read the other 127 bytes for "free"?
Whether you need it or not, and regardless of alignment of requests within the line/segment, you will get either 32 bytes or 128 bytes from any DRAM read transaction.
This doesn't cover every case, but a general breakdown of the 32byte/128byte difference is as follows:
int
or float
, for example) so ultimately four L2 This presentation will be instructive. In particular, slide 17 shows one example of "perfect" coalescing, whereas slide 25 shows an example of a "fully uncoalesced" load.