On a general purpose CPU parallel processing is performed splitting calculation / problem into sub-problems, distributing them and running them in parallel on a number of cores on one or several sockets / servers.
What is the execution "flow" on a GPU from loading data to sending back results to the CPU ? What are key differences between execution on a GPU and execution on a CPU ?
Should we see a GPU as a "kind of CPU with a higher (huge) number smaller cores" or are there additionnal differences in nature ?
The fundamental difference in parallel processing between a CPU and a GPU is that CPUs are MIMD (Multi-Instruction-Multi-Data), while GPUs are SIMD (Single-Instruction-Multi-Data). In a multicore CPU, each core fetches its instructions and data independently of the others, whereas in a GPU there is only one instruction stream for a group of cores (typically 32 or 64). While there is only one instruction stream for the 32/64 cores, each of them is working on different data elements (typically located together in memory; more below). Such SIMD execution means that the GPU cores operate in a lock-stepped fashion.
For the above mentioned reason, a GPU can't be viewed as a "kind of CPU with a higher (huge) number smaller cores".
In order to support SIMD execution (also sometimes called wide-execution), we need wide fetch of input data. For a 32-wide execution, we fetch a contiguous 4B x 32 block = 128B that is consumed (typically) entirely by a 32-wide pipe. Contrast this to a MIMD multicore, where each of 32 CPU cores would fetch a separate instruction and then load from 32 different cachelines. This SIMD nature of (wide-) instruction/data fetch results in huge power savings compared to MIMD. As a result, for the same power budget, we can put more cores on a GPU (=> more HW parallelism) than a multicore-CPU.
The SIMD nature of GPUs is driven by applications that do exactly the same operation over very many input elements (e.g.; Image processing where we apply a filter on every pixel of say a 1024x768 image) so that wide instruction/data fetch works well. At the same time, applications where each core's computation is different (e.g., take if() when input data is zero, or else() if input data is 1) or each core needs to fetch data from a different page fail to take advantage of the SIMD nature of GPUs.
A partially related fact is that GPUs support applications (e.g., images/videos) that are streaming (almost zero data reuse) and have massive data-parallelism. Streaming means that we don't need huge caches like CPUs, and massive data-parallelism almost entirely cuts the need for HW coherence mechanisms.