Out of Order Execution and Memory Fences

I know that modern CPUs can execute out of order, However they always retire the results in-order, as described by wikipedia.

"Out of Oder processors fill these "slots" in time with other instructions that are ready, then re-order the results at the end to make it appear that the instructions were processed as normal."

Now memory fences are said to be required when using multicore platforms, because owing to Out of Order execution, wrong value of x can be printed here.

Processor #1:
 while f == 0
  ;
 print x; // x might not be 42 here

Processor #2:
 x = 42;
 // Memory fence required here
 f = 1

Now my question is, since Out of Order Processors (Cores in case of MultiCore Processors I assume) always retire the results In-Order, then what is the necessity of Memory fences. Don't the cores of a multicore processor sees results retired from other cores only or they also see results which are in-flight?

I mean in the example I gave above, when Processor 2 will eventually retire the results, the result of x should come before f, right? I know that during out of order execution it might have modified f before x but it must have not retired it before x, right?

Now with In-Order retiring of results and cache coherence mechanism in place, why would you ever need memory fences in x86?

Solution

This tutorial explains the issues: http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-7.pdf

FWIW, where memory ordering issues happen on modern x86 processors, the reason is that while the x86 memory consistency model offers quite strong consistency, explicit barriers are needed to handle read-after-write consistency. This is due to something called the "store buffer".

That is, x86 is sequentially consistent (nice and easy to reason about) except that loads may be reordered wrt earlier stores. That is, if the processor executes the sequence

store x
load y

then on the processor bus this may be seen as

load y
store x

The reason for this behavior is the afore-mentioned store buffer, which is a small buffer for writes before they go out on the system bus. Load latency is, OTOH, a critical issue for performance, and hence loads are permitted to "jump the queue".

See Section 8.2 in http://download.intel.com/design/processor/manuals/253668.pdf