How do the C and C++ compilers distinguish unary operators?

If I have the minus sign next to a variable like:

int a;
int b = -a; // UNARY OPERATOR
b = - a; // UNARY OPERATOR

The minus before the 'a' is considered to be a unary operator and negative value of a is taken. However in this:

int a, b;
a -b; // SUBTRACTION
a - b; // SUBTRACTION

So from this I deduce that:

Whether or not the operator is separated from the operand by a space is irrelevant.
Deduction of whether it's a subtraction or a unary operator depends on the presence of a previous operand and is highly contextual.

Can someone give a simple summary of the rules of how the compiler decides this?

Solution

Theory about parsing and formal languages is addressed in multiple computer science courses, so only a slight introduction can be provided in a Stack Overflow answer. Compilers may use an LALR parser.

To give a brief view of this, there is essentially a list of patterns of things that may appear in the language being compiled. For example, in C, an additive-expression is one of:

multiplicative-expression
additive-expression + multiplicative-expression
additive-expression - multiplicative-expression

Each of the items in the rule may be a token for another pattern, such as multiplicative-expression, or a terminal symbol, such as + or int.

As the compiler reads the source code, it tries to match the text it sees with patterns. When it finds a proper match, it reduces the items that match to the token whose pattern they matched. So a - b will be reduced to additive-expression, and it will never match a unary-expression pattern. While - b without another additive-expression before will be reduced to a unary-expression and will never match an additive-expression pattern.

Again, that is a simplified view of part of the process. Before the compiler even gets to processing tokens in the grammar, it performs a lexical analysis that clusters characters into grammatical tokens. So turning int foo; into “keyword int” and “identifier foo” and turning a---b into “identifier a”, “operator --”, “operator -”, and “identifier b” happens at an earlier/lower level of analysis. And there is also, conceptually at least, a separate preprocessor level.

Further, the compiler does not actually have the patterns listed, and it is not directly matching up input sequences to the patterns. Instead, the grammar rules are used to construct a conceptual “machine” that implements that pattern matching. That machine is called a parser, and it is built into the compiler. Every new input token changes the state of the machine, and the machine is designed in such a way that the states correspond to recognition of the patterns.

There is formal mathematics that shows how formal grammars of this sort can be used to construct “machines” (mathematical models of computing) that perform the parsing. To fully understand this, one should take courses in discrete mathematics, finite state machines and automata, formal languages (parsing theory), and compiler construction.