Suppose I have a single c/c++ app running on the host. there are few threads running on the host CPU and 50 threads running on the Xeon Phi cores.
How can I make sure that each of these 50 runs on its own Xeon Phi core and is never purged off the core cache (given the code is small enough).
Could someone please to outline a very general idea how to do this and which tool/API would be more suitable (for C/C++ code) ?
What is the fastest way to exchange data between the host thread-aggregator and the 50 Phi threads?
Given that the actual parallelism will be very limited - this application is going to be more like 51 thread plane application with some basic multithreading data sync.
Can I use conventional C/C++ compiler to create the app like this?
You have raised several questions:
Yes, you can use conventional C program and compile it using regular Intel C/C++/Fortran compilers (known as Intel Composer XE) in order to generate binary being able to run on Intel Xeon Phi co-processor in either "native"/"symmetric" or "offload" modes. In simplest case - you just recompile your C/C++ program with -mmic and run it "natively" on Phi just "as is".
Which API to use? Use OpenMP4.0 standard or Intel Cilk Plus programming models (actually set of pragmas or keywords applicable to C/C++). OpenCL, Intel TBB and likely OpenACC are also possible, but OpenMP and Cilk Plus have capability to express threading, vectorization and offload (i.e. 3 things essential for Xeon Phi programming) without re-factoring or rewriting "conventional C/C++/Fortran" program .
Threads pinning: could be achieved via OpenMP affinity (see more details on MIC_KMP_AFFINITY below) or Intel TBB affinity stuff.
The fastest way to exchange the data between the host and target Phi - is.. avoid any exchange -using MPI symmetric approach for example. However you seem to ask about "offload" programming model specifically, so using asynchronous offload you can achieve the best performance. At the same time synchronous offload is theoretically simpler in terms of programming, but worse in terms of achievable performance.
Overall, you tend to ask several general questions, so I would recommend to start from the very beginning - i.e. looking at following ~10-pages Dr. Dobbs manual or given Intel' intro document.
Threads pinning is more advanced topic and at the same time seems to be "most interesting" for you, so I will explicitly explain more: