I am confused about the following statements in the CUDA programming guide 4.0 section 5.3.2.1 in the chapter of Performance Guidelines.
Global memory resides in device memory and device memory is accessed via 32-, 64-, or 128-byte memory transactions.
These memory transactions must be naturally aligned:Only the 32-, 64- , 128- byte segments of device memory that are aligned to their size (i.e. whose first address is a multiple of their size) can be read or written by memory transactions.
Question Part 1:
My understanding of device memory was that accesses to the device memory by threads is uncached: So if thread accesses memory location a[i]
it will fetch only a[i]
and none of the
values around a[i]
. So the first statement seems to contradict this. Or perhaps I am misunderstanding the usage of the phrase "memory transaction" here?
Question Part 2: The second sentence does not seem very clear. Can someone explain this?