c++cperformanceoptimizationmemory

Struggling to understand Data alignement


So basically I am struggling to understand data alignement. I don't understand why on a 64 bits architecture for example its important to store an 4 byte data to a multiple of the address 0x0, 0x4, 0x8, 0xC. Does the fetch of the CPU start to every multiple of the size of the word (which is 8 bytes here ) ? and why a 2 byte data have to be store at 0x0, 0x2, 0x4, 0x6, 0x8, 0xA, 0xC, 0xE address ? the CPU could load in one clock 2 byte data even if it is store at 0x1... So why it should be at the address 0x0, 0x2, 0x4, 0x6, 0x8, 0xA, 0xC, 0xE.

Plus if the CPU cache line is for example 64 bytes for example why should I care of the data alignment if the data does not overlap between the address 0x...00 and 0x...40 ? its confusing ...


Solution

  • For an object of n bytes that is smaller than the word size used to access memory, ensuring the object is aligned to a multiple of n bytes ensures the object will not straddle words (provided n is a factor of the word size).

    Suppose a machine has an eight-byte memory interface: Every aligned sequence of eight bytes can be read from memory or written to memory with a single transfer operation. So all eight bytes from 0 to 7 can be read from memory in one transfers, all eight bytes from 8 to 15 can be read from memory in one transfers, and so on. But reading just the two bytes 7 and 8 would require two transfers, because the machine architecture cannot read just any eight bytes in a transfer; it can only read one sequence of eight bytes starting at a multiple of eight.

    Now consider a four-byte object type, say int. When somebody declares an array of these, int a[7];, what happens when a starts at address 2? The object a[0] is in bytes 2, 3, 4, and 5. The object a[1] is in 6, 7, 8, and 9. And so on.

    a[0] can be read in a single memory transfer. The CPU can get bytes 0-7 in one transfer and take a[0] out of bytes 2-5. However, a[1] cannot be read in a single memory transfer. The CPU cannot read bytes 6, 7, 8, and 9 in one transfer. It needs to issue one transfer to get bytes 0-7 and another to get 8-9.

    When we require a four-byte object type have four-byte alignment, we prevent this. Then the array a of int a[7]; could not start at byte 2. It would have to start at byte 0, 4, 8, 12, and so on. If, for example, it starts at byte 4, then a[0] is in bytes 4-7, which is inside the eight-byte set 0-7. a[1] is in 8-11, which is inside 8-15. a[2] is in 12-15, which is inside 8-15. And so every element of a is inside one eight-byte set of aligned bytes. So memory access will be more efficient than if a were not four-byte aligned.