I am aware that the Intel Xeon phi coprocessor SE10X has 61 cores and it is suggested to use only 60 cores since 1 core is used for the offload daemon. Also, since intel xeon phi coprocessor 5110P has 60 cores, is it suggested to use 59 cores?
I evaluated the performance of my test code on a intel xeon phi 7120p card. I observed that the code performance was best when no. of threads was a multiple of (number of cores - 1). This is because one of the cores is busy running the Linux micro-OS services.
In general:
No. of threads to create >= K * T * (N-1)
K = Positive integer (=2 works fine)
T = No. of thread contexts on hardware(4 in my case)
N = No. of cores present on hardware.