I was just studying the access time for different cache configurations when i stumbled on a term in the cacti interface "Number of Banks".
Number of banks is the number of interleaved modules in a cache which increases the bandwidth of the cache and the number of parallel accesses to it.
In this context, I wanted to find the number of banks in the caches of Nehalem architecture. I googled for this thing but did not hit anything useful.
My reasoning here is :
Is my intuition correct ?? Plus, does the number of banks change the way the data/program in structured (Ideally it should not but still ...) ??
The overview graphics of the Wikipedia article depicts the Nehalem (first cpu branded as "Core i7") to have 256 KByte of L2 Cache per core.
I don't get what you mean by the word "bank" here. Nehalem's cache is 8-way associative with 64bits (8 bytes) per cache line.
That means that for every read/write access to the cache 8 bytes of data are transferred which corresponds well to a 64bit architecture where all virtual addresses have 8 bytes. So every time an address has to be retrieved from or stored in memory, 8 bytes have to be transported, thus it is a natural fit to design a single entry in a cache that way. (Other cache sizes make sense, too, depending on applications: Such as larger sizes for data caches for vector processing units).
x-way Associativity determines the relationship of a memory address and the place where information in that address can be stored inside the cache. The term "8 ways associativity" refers to the fact that data stored at a certain memory address can be held in 8 different cache lines. Caches have an address comparison mechanism to select the matching entry inside one way, and some replacement strategy to decide which of the x ways is to be used - possibly expelling a previous valid value.
Your using of the term "bank" probably refers to one such "set" of this 8-way associativity. Thus the answer to your question probably is "8". And again, that's one L2 cache per core, and each have that structure.
Your assumption on simultaneous access is a valid one as well. It is documented e.g. for ARM's Cortax A15 However, if and how those sets or banks of the cache can be accessed independently is anyone's guess. The Wikipedia diagram shows a 256 bit bus between the L1 data cache and the L2 cache. This could both imply that it is possible to access 4 ways independently (4*64=256, but more likely is that only one memory load/store is actually transferred at any given time and the slower L2 cache just feeds 4 cache lines simultaneously to the faster L1 cache in what one could call a burst.
This assumption is supported by the fact that the System Architecture Manual which can be found on intel's page, in chapter 2.2.6 lists the later Sandy Bridge improvements, emphasizing "Internal bandwidth of two loads and one store each cycle.". Thus CPUs before SandyBridge should have a smaller number of concurrent load/stores.
Note that there's a difference of "in flight" load/stores and actual data transmitted. "in flight" are those operations that are currently being executed. In case of a load that could entail waiting for the memory to yield data after all caches reported misses. So you can have many loads going on in parallel, but you can still have the data bus between any two caches used only once at any given time. The above SandyBridge improvement actually widens that data bus to two loads and one store actually transmitting data at the same time which Nehalem (one "tock", or one architecture before Sandy Bridge) could not do.
Your intuition is not correct on some accounts:
Regarding your point about software optimizations: Worry if you are a low level hardware/firmware developer. Otherwise just follow high level ideas: If you can, keep your innermost loop of intense operations small to make it fit into a L3 cache. Do not start more threads with intense computing on local data than you have cores. If you do start to worry about such speed implications, start compiling/optimizing your code with the matching cpu switches, and control other tasks on the machine (even infrastructure services).
In summary: