I guess this is more a math question than it is an OpenGL one, but I digress. Anyways, if the whole purpose of the perspective divide is to get usable x and y coordinates, why bother dividing z by w? Also how do I get w in the first place?
Actually, the explanation has much more to do with the limitations of the depth buffer than it does math.
At its simplest, "the depth buffer is a texture in which each on-screen pixel is assigned a grayscale value depending on its distance from the camera. This allows visual effects to easily alter with distance." Source
More accurately, a depth buffer is a texture containing the value of z/w for each fragment, where:
In the following diagram illustrating the relationship between z, w, and z/w, n is equal to the zNear
parameter passed to gluPerspective
, or an equivalent function, and f is equal to the zFar
parameter passed to the same function.
At a glance, this system look unintuitive. But as a result, z/w is always a floating-point value between 0 and 1 (0/n and f/f), and can therefore be represented as a single channel of a texture.
A second important note: the depth buffer is nonlinear, meaning an object exactly in between the near and far clipping planes is nowhere near a value of 0.5 in the depth buffer. As shown above, it would correlate to a value of 0.999 in the depth buffer. Depending on your view, this could be good or bad; you may want the depth buffer to be more detailed close-up (which it is), or offer even detail throughout (which it doesn't).
TL;DR: