I am working on a task where a customer's gaze direction is calculated to determine whether they looked at the monitor or outside of it. I drew the following to get an understanding of what needs to be done:
The picture depicts the following (measurements in mm):
So far, I manually calculated the distances as coordinates of X,Y,Z. They are as follows:
The eye coordinates relative to the camera are:
Eye=(0,−325,−596.34)
P1 (Top-left of the monitor):
Horizontal offset from the center: −265 mm
Vertical offset from the top center of the monitor: −50 mm
Depth: 0 mm (since it's on the same plane as the camera)
Coordinates: P1=(−265,−50,0)
P2 (Top-right of the monitor):
Horizontal offset from the center: 265 mm
Vertical offset from the top center of the monitor: −50 mm
Depth: 0 mm
Coordinates: P2=(265,−50,0)
P3 (Center of the monitor):
Horizontal offset: 0 mm
Vertical offset from the camera: −521 mm
Depth: 0 mm
Coordinates: P3=(0,−521,0)
Thus, I derived the following:
Eye to P1:
Vector=P1−Eye=(−265,−50−(−325),0−(−596.34))=(−265,275,596.34)
Eye to P2:
Vector=P2−Eye=(265,−50−(−325),0−(−596.34))=(265,275,596.34)
Eye to P3:
Vector=P3−Eye=(0,−521−(−325),0−(−596.34))=(0,−196,596.34)
Now, I would like to know if I have got the gaze directions (of a person'e eye to P1, P2 and P3 from the camera's PoV) correctly based on the following method where it states:
Please note that although the 3D gaze (gaze_dir) is defined as a difference between target's and subject's positions (target_pos3d - person_eyes3d) each of them is expressed in different coordinate system, i.e.
gaze_dir = M * (target_pos3d - person_eyes3d)
whereM
depends on a normal direction between eyes and the camera.
Also, how do I calculate the transformation matrix M if ever need be?
According to the correspondent paper P. Kellnhofer et al. (2019), Gaze360: Physically Unconstrained Gaze Estimation in the Wild, ICCV (PDF file size ~17MB), 4th page, on the topic Gaze Direction the gaze vector is converted from camera (there called ladybug) coordinate system L = [Lx, Ly, Lz] into eye coordinate system E = [Ex, Ey, Ez] as follows.
Gaze vector gL in ladybug coordinates:
gL = pt − pe
where pt is the target cross point, and pe the eye point, relative to the camera.
The eye coordinate system E has its origin in pe. The basis vector Ez has in world coordinates the same direction as gL, i.e. doesn't point from pe to pt but "backwards" from pe. That's why Ez is the negated gL. It sounds unintuitive but actually it is more convenient for operations considering the view depth when operating on negative z-values in eye coordinates. Additionally we normalize Ez by dividing by its length.
Ez = —gL / ||gL||
The other basis vectors Ex and Ey have to be orthogonal to Ex. According to the text Ex lies in the plane defined by Lx and Ly without a roll, i.e. without a rotation around the x-axis. In other words we can temporarily assume that the yet unknown Ey runs parallel to Ly. That's actually usually not true, as we mostly don't gaze into the camera but at a target point elsewhere, but it's enough for now as the actually performed roll when looking somewhere else than the camera won't change Ex.
Now, the vector created by the cross product of two vectors is orthogonal to these. So we calculate Ex, with normalization:
Ex = (Ly × Ez) / ||Ly × Ez||
Note that Ex is now orthogonal to the eye's YZ-plane, and Ez will be orthogonal to the XY-plane per definitionem because it's our anchor vector. The only remaining step is to calculate the actual Ey as the orthogonal vector to the XZ-plane, i.e. consider that we actually make a roll of some angle relative to the camera about the now known Ex when lookng somewhere (angle = 0 when looking straight at the camera). Again we're using the cross product. No normalization needed, as the cross product of two normalized vectors will be normalized, too.
Ey = Ez × Ex
Then the gaze vector in eye coordinates gE is, like in the text, yielded by applying a view transformation to gL:
gE = E ∙ gL / ||gL||
At that E is nothing else then the view transformation matrix M with the columns, from left to right, Ex, Ey, Ez.
When the subject looks directly at the camera, i.e. pt = [0, 0, 0], it's guaranteedly gE = [0, 0, −1].
For P1 one gets after all the above mentioned calculations, arbitratily assuming Ly = [0, 1, 0] (actually use the true Ly according to the camera's orientation):
Ez = [0.3742, -0.3883, -0.8421] Ex = [-0.9138, 0, -0.4061] Ey = [0.1577, 0.9215, 0.3548]
i.e. the view transform matrix
| -0.9138 0.1577 0.3742 |
M = | 0 0.9215 -0.3883 |
| -0.4061 0.3548 -0.8421 |