How can I know whether my CPU shares the vector registers among the cores or each core has its private ones?
Where can I get the references?
I hope to use multi-threading and SIMD to optimise my program's floating-point computation. Will they cause any conflicts?
Judging from your comments in another reply, it appears you're referring to the Xeon Phi. If I understand correctly, you're asking if each thread has its own private set of vector registers? The answer is yes!
To clarify a little further, the Xeon Phi has about 50 cores each with their own register file. Internally each core supports simultaneous multithreading (SMT) so the register file is shared among threads however there will be at least 4x16 512-bit logical vector registers per core. If you choose to use SMT, there won't be any conflicts with respect to the registers but there might be conflicts with respect to the vector functional units. The idea is that you can switch between threads within the same core when one is waiting a cache miss or something similar.
Edit to answer your question: What is SMT?
The Xeon Phi has 50 physical in-order cores. Each core has its own L1 instruction cache, data cache and two functional units. In a traditional design the core would fetch two adjacent instructions in a thread from the i-cache and try to execute them on the two available functional units. One common problem that made this technique inefficient occurred when executing load instructions where the data was not present in the d-cache. The processor would have difficulty finding instructions to send to its functional units since subsequent instructions very often depended on the data being loaded.
SMT is a technique to help alleviate this. It gives each core just enough extra structure to manage additional threads efficiently. In the Xeon Phi, the logical register file and program counter are replicated four times. Most other structures like the caches and functional units can remain more of less the same. Now when there is a d-cache miss, the processor will start fetching another thread's instructions and send them to the functional units where they operate on that thread's subset of registers. This way it can find work to do when waiting for main memory without the high overheads of a full context switch.
To summarise: You might see 200 cores on your Xeon Phi but in reality only 50 of those are working in parallel at any given time, the rest just switch threads really quickly.