It is often handy to have a canonical representation of a language (in my case they are usually domain specific languages); however, I believe there are strict limits on the expressiveness of the languages involved that determine whether a canonical form can be determined and/or created for an arbitrary program in that language. Unfortunately, I've been unable to find the references that I (vaguely) recall reading about this in.
On one hand, it seems reasonable that creating a canonical representation of a language is of comparable complexity to many hard graph problems (eg: graph isomorphism), but on the other hand, iirc, compilers like gcc, yhc, and ghc use intermediate representations to generate outputs in various formats (assembly, javascript, etc.), so this is at least in some forms, a solved problem.
When is it possible to determine / generate a canonical form for a given language? (How expressive can that language be, and how does language expressiveness impact the utility of the canonical forms?) Please provide references or proofs if at all possible.
Edit: For example, a Regular Language (eg: the 'pure' form of regular expressions) can not express many of the same things that a Turing-complete language can. In other words, you can't write a web server in a regular language, but you can with lambda calculus). My question is about the theoretical possibilities, and does have a specific answer relating to complexity theory. If I have a DSL that needs to be transmitted to another system, it will often be beneficial to generate a canonical form of that code before transmitting it, since that will decouple the independent representations used by the two different systems. However, if it is P-Space complete, or NP-Complete to translate a Turing-complete language into a canonical form, then you shouldn't waste time trying to build a canonical form -- either find another way to do it, or reduce the language complexity to something that can be canonicalized in polynomial time.
By "canonical representation" I assume you mean the following: Call programs P and Q equivalent if they "do the same thing" on the same inputs. "Doing the same thing" means that the programs have the same output, and either both programs halt after a finite time or both enter an infinite loop. This equivalence relation defines equivalence classes in the set of all programs. The "canonical representation" of a program P is a program P' belonging to the same equivalence class, and you require that all members of the same equivalence class have the same canonical representation.
For Turing-complete languages, a Turing-computable canonical representation would enable you to solve the Halting Problem as follows: first write a program consisting of an infinite loop and find its canonical representation Q. Then for any input program P, first transform it mechanically into a program P0 that does the same thing except that it produces no output, and then find the canonical representation P0' of this program. If the result is Q, you know that P0 does not halt, and therefore neither does P. Otherwise, P0 halts, and so does P.
For even more fun, read some of Gregory Chaitin's work on what he calls "elegant" programs.