Based on what I know, when threads of a warp access the same address in global memory, requests get serialized so it's better to use constant memory. Does serializing of simultaneous global memory accesses happen when GPU is equipped with L1 and L2 cache levels (in Fermi and Kepler architecture)? In other words, when threads of a warp access the same global memory address, do 31 threads of a warp benefit from cache existence because 1 thread has already requested that address? What happens when the access is a read and also when access is a write?
Simultaneous global accesses to the same address by threads in the same warp in Fermi and Kepler do not get serialized. The warp read has a broadcast mechanism which satisfies all such reads from a single cacheline read with no performance impact. The performance is the same as if it were a fully coalesced read. This is true regardless of cache specifics, for example it is true even if L1 caching is disabled.
The performance of simultaneous writes is not specified (AFAIK) but behaviorally, simultaneous writes always get serialized, and the order is undefined.
EDIT responding to additional questions below:
- Even if all threads in the warp write the same value into the same address, does it get serialized? Isn't there a write broadcast mechanism that recognizes such situation?
There is not a write broadcast mechanism that looks at all the simultaneous writes to see if they are all the same, and then take some action based on that. The correct answer is that the writes happen in unspecified order, and the performance characteristics are undefined. Obviously, if all the values being written are the same, you can be assured that the value that ends up in the location will be that value. But if you're asking whether the write activity is collapsed to a single cycle or requires multiple cycles to complete, that actual behavior is undefined (undocumented) and in fact may vary from one architecture to the next (for example, cc1.x may serialize in such all way that all the writes are performed, whereas cc2.x may "serialize" in such a way that one write "wins" and all the others are discarded, not consuming actual cycles.) Again, the performance is undocumented/unspecified, but the program-observable behavior is defined.
2 With this broadcast mechanism you explained, the only difference between constant memory broadcast access and global memory broadcast access is that the first one may route the access all the way to the global memory but the latter has a dedicated hardware and is faster, right?
__constant__
memory uses the constant cache, which is a dedicated piece of hardware that is available on a per-SM basis, and caches a particular section of global memory in a read-only fashion. This HW cache is physically and logically separate from L1 cache (if it exists and is enabled) and L2 cache. For Fermi and beyond, both mechanisms support broadcast on read, and for constant cache, this is the preferred access pattern, because the constant cache can only service one read access per cycle (i.e. does not support a whole cacheline read by a warp.) Either mechanism may "hit" in the cache (if present) or "miss" and trigger a global read. On the first read of a given location (or cacheline), niether cache will have the requested data, and it will therefore "miss" and trigger a global memory read, to service the access. Thereafter, in either case, subsequent reads will be serviced out of the cache, assuming the relevant data is not evicted in the interim. For early cc1.x devices, the constant memory cache was pretty valuable since those early devices did not have a L1 cache. For Fermi and beyond the principal reason to use the constant cache would be if identifiable data(i.e. read-only) and access patterns (same address per warp) are available, then using the constant cache will prevent those reads from travelling through L1 and possibly evicting other data. In effect you are increasing the cacheable footprint somewhat, over just what the L1 can support alone.