I have written a small test program in which I try to use the Windows API call SetThreadAffinityMask to lock the thread to a single NUMA node. I retrieve the CPU bitmask of a node with the GetNumaNodeProcessorMask API call, then pass that bitmask to SetThreadAffinityMask along with the thread handle returned by GetCurrentThread. Here is a greatly simplified version of my code:
// Inside a function called from a boost::thread
unsigned long long nodeMask = 0;
GetNumaNodeProcessorMask(1, &nodeMask);
HANDLE thread = GetCurrentThread();
SetThreadAffinityMask(thread, nodeMask);
DoWork(); // make-work function
I of course check whether the API calls return 0 in my code, and I've also printed out the NUMA node mask and it is exactly what I would expect. I've also followed advice given elsewhere and printed out the mask returned by a second identical call to SetThreadAffinityMask, and it matches the node mask.
However, from watching the resource monitor when the DoWork function executes, the work is split among all cores instead of only those it is ostensibly bound to. Are there any trip-ups I may have missed when using SetThreadAffinityMask? I am running Windows 7 Professional 64-bit, and the DoWork function contains a loop parallelized with OpenMP which performs operations on the elements of three very large arrays (which combined are still able to fit in the node).
Edit: To expand on the answer given by David Schwartz, on Windows any threads spawned with OpenMP do NOT inherit the affinity of the thread which spawned them. The problem lies with that, not SetThreadAffinityMask.
Did you confirm that the particular thread whose affinity mask was running on a core in another numa node? Otherwise, it's working as intended. You are setting the processor mask on one thread and then observing the behavior of a group of threads.