math language-agnostic coordinates coordinate-transformation transformation-matrix

How to find a transformation matrix given the measurements from two coordinate systems?

I am working on a task where a customer's gaze direction is calculated to determine whether they looked at the monitor or outside of it. I drew the following to get an understanding of what needs to be done:

The picture depicts the following (measurements in mm):

Black rectangle with 530x942 dimensions is a monitor.
A person is standing 500 from the monitor, with the height of 1675 from his eyes to the ground.
A blue mark located 50mm from the top center from the monitor is a camera.
P1, P2 and P3 are points where a person looks.
The distance from the camera to the ground is 2000
d=537.04 is the distance from the eye to P3, calculated by the pythagorean theorem (√(500² + 196²))
Similarly, the distance from eye to P1 = 761.89 (from eye to P2 as well), calculated by the pythagorean theorem (√(540.43² + 537.04²))

So far, I manually calculated the distances as coordinates of X,Y,Z. They are as follows:

The eye coordinates relative to the camera are:
Eye=(0,−325,−596.34)

Points on the Monitor Relative to the Camera

P1 (Top-left of the monitor):

Horizontal offset from the center: −265 mm
Vertical offset from the top center of the monitor: −50 mm
Depth: 0 mm (since it's on the same plane as the camera)

Coordinates: P1=(−265,−50,0)

P2 (Top-right of the monitor):

Horizontal offset from the center: 265 mm
Vertical offset from the top center of the monitor: −50 mm
Depth: 0 mm

Coordinates: P2=(265,−50,0)

P3 (Center of the monitor):

Horizontal offset: 0 mm
Vertical offset from the camera: −521 mm
Depth: 0 mm

Coordinates: P3=(0,−521,0)

Thus, I derived the following:
Eye to P1:
Vector=P1−Eye=(−265,−50−(−325),0−(−596.34))=(−265,275,596.34)

Eye to P2:
Vector=P2−Eye=(265,−50−(−325),0−(−596.34))=(265,275,596.34)

Eye to P3:
Vector=P3−Eye=(0,−521−(−325),0−(−596.34))=(0,−196,596.34)

Now, I would like to know if I have got the gaze directions (of a person'e eye to P1, P2 and P3 from the camera's PoV) correctly based on the following method where it states:

Please note that although the 3D gaze (gaze_dir) is defined as a difference between target's and subject's positions (target_pos3d - person_eyes3d) each of them is expressed in different coordinate system, i.e. gaze_dir = M * (target_pos3d - person_eyes3d) where M depends on a normal direction between eyes and the camera.

Also, how do I calculate the transformation matrix M if ever need be?

Solution

According to the correspondent paper P. Kellnhofer et al. (2019), Gaze360: Physically Unconstrained Gaze Estimation in the Wild, ICCV (PDF file size ~17MB), 4th page, on the topic Gaze Direction the gaze vector is converted from camera (there called ladybug) coordinate system L = [L_x, L_y, L_z] into eye coordinate system E = [E_x, E_y, E_z] as follows.

Gaze vector g_L in ladybug coordinates:

  g_L = p_t − p_e

where p_t is the target cross point, and p_e the eye point, relative to the camera.

The eye coordinate system E has its origin in p_e. The basis vector E_z has in world coordinates the same direction as g_L, i.e. doesn't point from p_e to p_t but "backwards" from p_e. That's why E_z is the negated g_L. It sounds unintuitive but actually it is more convenient for operations considering the view depth when operating on negative z-values in eye coordinates. Additionally we normalize E_z by dividing by its length.

  E_z = —g_L / ||g_L||

The other basis vectors E_x and E_y have to be orthogonal to E_x. According to the text E_x lies in the plane defined by L_x and L_y without a roll, i.e. without a rotation around the x-axis. In other words we can temporarily assume that the yet unknown E_y runs parallel to L_y. That's actually usually not true, as we mostly don't gaze into the camera but at a target point elsewhere, but it's enough for now as the actually performed roll when looking somewhere else than the camera won't change E_x.

Now, the vector created by the cross product of two vectors is orthogonal to these. So we calculate E_x, with normalization:

  E_x = (L_y × E_z) / ||L_y × E_z||

Note that E_x is now orthogonal to the eye's YZ-plane, and E_z will be orthogonal to the XY-plane per definitionem because it's our anchor vector. The only remaining step is to calculate the actual E_y as the orthogonal vector to the XZ-plane, i.e. consider that we actually make a roll of some angle relative to the camera about the now known E_x when lookng somewhere (angle = 0 when looking straight at the camera). Again we're using the cross product. No normalization needed, as the cross product of two normalized vectors will be normalized, too.

  E_y = E_z × E_x

Then the gaze vector in eye coordinates g_E is, like in the text, yielded by applying a view transformation to g_L:

  g_E = E ∙ g_L / ||g_L||

At that E is nothing else then the view transformation matrix M with the columns, from left to right, E_x, E_y, E_z.

When the subject looks directly at the camera, i.e. p_t = [0, 0, 0], it's guaranteedly g_E = [0, 0, −1].

For P₁ one gets after all the above mentioned calculations, arbitratily assuming L_y = [0, 1, 0] (actually use the true L_y according to the camera's orientation):

  E_z = [0.3742, -0.3883, -0.8421]
  E_x = [-0.9138, 0, -0.4061]
  E_y = [0.1577, 0.9215, 0.3548]

i.e. the view transform matrix

      | -0.9138   0.1577   0.3742 |
  M = |  0        0.9215  -0.3883 |
      | -0.4061   0.3548  -0.8421 |