I understand that in order to project a point in camera coordinate system (which is in 3d) onto image coordinate system (which is 2d), we need to use homogenous coordinates for the image coordinates , so that we can use a linear matrix multiplication. But why we also need to change the camera coordinates? Specifically, what I see in a lot of notes, the perspective projection is like this:
(here, ux, uy are pixel coordinates, x,y,z are camera coordinates)
Sure, if you know for sure the matrix is as in your example, and the 3d homogeneous coordinates are (xc, yc, zc, 1), then, obviously the result of your (3×3)×(3) matrix multiplication is the same as the (3×4)×(4) one.
But generally speaking, the (4) homegeneous coordinates of a 3D point in camera system may not have the form (xc, yc, zc, 1).
For example, because it can come from another projective geometry transform. And not all of them end up with 1 as 4th component.
Lot of tutorial and documentation on internet make it seems like the whole point of homogeneous coordinates is to permit to combine some rotation and translation in a single linear operation (in other words, to use linear operations to perform affine operations). But that is not all. Projective transformation allow some projections, not just the one on the camera image. For example, you could need to use 2 cameras (stereoscopy). You could know that what you are watching on the camera happens to be a 2D image in your 3D world (eg, a poster image on a wall seen by your camera), and compute the 3D coordinates from some other projective transform.
So, in other words, your (xc,yc,zc,1) could very well be (xc,yc,zc,12). Which is not at all the same as (xc,yc,zc,1) (it is the same as (xc/12,yc/12,zc/12,1).)
Of course, in the camera coordinates, for the projection you game, the result happens to be the same. But just dropping the last component because you know that the result happen to be the same, is a bit anticipating on the operation itself.
One particular example: how would your project a point that is at infinity, in the direction (1,2,3)? That point as no 3d-non homogeneous coordinates (that cannot code infinity). But has 3d homogenous coordinates (1,2,3,0)
And sure, since the result of that is (fx+3ox, 2fy+3oy, 3) aka ((fx+3ox)/3, (2fy+3oy)/3, 1), which happens to be the same result as the projection of points (1,2,3,1) aka (1,2,3) in 3D, you may think that using homogeneous coordinates was an unnecessary complexification. But that is because you happen to know that (1,2,3,1) and (1,2,3,0), which are very very different points (one is a point quite close the the camera, the other is a star in the sky), happen to have the same projection. So, your approximation in only valid because you know the result.
But in real life, we never just use that single operation alone anyway. In real life, you combine this matrix with other matrix, to obtain an aggregated 3x4 matrix, that do not have that column of 0 which makes you think that you could simply ignore the 4th component, since result is invariant to it.