What does multicore assembly language look like?

Once upon a time, to write x86 assembler, for example, you would have instructions stating "load the EDX register with the value 5", "increment the EDX" register, etc.

With modern CPUs that have 4 cores (or even more), at the machine code level does it just look like there are 4 separate CPUs (i.e. are there just 4 distinct "EDX" registers) ? If so, when you say "increment the EDX register", what determines which CPU's EDX register is incremented? Is there a "CPU context" or "thread" concept in x86 assembler now?

How does communication/synchronization between the cores work?

If you were writing an operating system, what mechanism is exposed via hardware to allow you to schedule execution on different cores? Is it some special priviledged instruction(s)?

If you were writing an optimizing compiler/bytecode VM for a multicore CPU, what would you need to know specifically about, say, x86 to make it generate code that runs efficiently across all the cores?

What changes have been made to x86 machine code to support multi-core functionality?

Solution

This isn't a direct answer to the question, but it's an answer to a question that appears in the comments. Essentially, the question is what support the hardware gives to multi-core operation, the ability to run multiple software threads at truly the same time, without software context-switching between them. (Sometimes called an SMP system).

Nicholas Flynt had it right, at least regarding x86. In a multi-core environment (Hyper-threading, multi-core or multi-processor), the Bootstrap core (usually hardware-thread (aka logical core) 0 in core 0 in processor 0) starts up fetching code from address 0xfffffff0. All the other cores (hardware threads) start up in a special sleep state called Wait-for-SIPI. As part of its initialization, the primary core sends a special inter-processor-interrupt (IPI) over the APIC called a SIPI (Startup IPI) to each core that is in WFS. The SIPI contains the address from which that core should start fetching code.

This mechanism allows each core to execute code from a different address. All that's needed is software support for each hardware core to set up its own tables and messaging queues.

The OS uses those to do the actual multi-threaded scheduling of software tasks. (A normal OS only has to bring up other cores once, at bootup, unless you're hot-plugging CPUs, e.g. in a virtual machine. This is separate from starting or migrating software threads onto those cores. Each core is running the kernel, which spends its time calling a sleep function to wait for an interrupt if there isn't anything else for it to be doing.)

As far as the actual assembly is concerned, as Nicholas wrote, there's no difference between the assemblies for a single threaded or multi threaded application. Each core has its own register set (execution context), so writing:

mov edx, 0

will only update EDX for the currently running thread. There's no way to modify EDX on another processor using a single assembly instruction. You need some sort of system call to ask the OS to tell another thread to run code that will update its own EDX.