I have 2 images (left and right) of a scene captured by a single camera.
I know the intrinsic matrices K_L
and K_R
for both images and the relative rotation R
between the two cameras.
How do I compute the precise relative translation t
between the two cameras?
You can only do it up to scale, unless you have a separate means to resolve scale, for example by observing an object of known size, or by having a sensor (e.g. LIDAR) give you the distance from a ground plane or from an object visible in both views.
That said, the solution is quite easy. You could do it by calculating and then decomposing the essential matrix, but here is a more intuitive way. Let xl and xr be two matched pixels in the two views in homogeneous image coordinates, and let X be their corresponding 3D world point, expressed in left camera coordinates. Let Kli and Kri be respectively the inverse of the left and right camera matrices Kl and Kr. Denote with R and t the transform from the right to the left camera coordinates. It is then:
X = sl * Kli * xl = t + sr * R * Kri * xr
where sl and sr are scales for the left and right rays back-projecting to point X from left and right camera respectively.
The second equality above represents 3 scalar equations in 5 unknowns: the 3 components of t, sl and sr. Depending on what additional information you have, you can solve it in different ways.
For example, if you know (e.g. from LIDAR measurements) the distance from the cameras to X, you can remove the scale terms from the equations above and solve directly. If there is a segment of known length [X1, X2] that is visible in both images, you can write two equations like above and again solve directly.