I use this ALU block diagram as a learning material : http://www.righto.com/2013/09/the-z-80-has-4-bit-alu-heres-how-it.html
I am not familiar with electronics. I am currently believing that a clock cycle is needed to move data from registers or latch to another register or latch, eventually throught a net of logical gates.
So here is my understanding of what happens for and ADD :
I think operations cycle 3 are done in parallell because there are two 4 bits bus (for high and low nibbles) and the register bus seems to be 8 bits.
Per the z80 data sheet:
The PC is placed on the address bus at the beginning of the M1 cycle. One half clock cycle later the MREQ signal goes active. At this time the address to the memory has had time to stabilize so that the falling edge of MREQ can be used directly as a chip enable clock to dynamic memories. The RD line also goes active to indicate that the memory read data should be enabled onto the CPU data bus. The CPU samples the data from the memory on the data bus with the rising edge of the clock of state T3 and this same edge is used by the CPU to turn off the RD and MREQ signals. Thus, the data has already been sampled by the CPU before the RD signal becomes inactive. Clock state T3 and T4 of a fetch cycle are used to refresh dynamic memories. The CPU uses this time to decode and execute the fetched instruction so that no other operation could be performed at this time.
So it appears mostly to be about memory interfacing to read the opcode rather than actually doing the addition — decode and execution occurs entirely within clock states T3 and T4. Given that the z80 has a 4-bit ALU, it would take two operations to perform an 8-bit addition. Which likely explains the use of two cycles.