opencv computer-vision augmented-reality homography projective-geometry

Project coordinate system onto angled planar surface in image

I want to define a coordinate system projected onto some planar surface within the image. I just need some hints and ideas for how to implement this, not necessarily code, but everything is appreciated! If code is provided, it would be preferred in OpenCV with Python as language.

Example: I have an image of a table from some angles. I have taken N calibration images using a checkerboard pattern which were placed on the tabletop. We can image it being a flat snooker table. I have computed intrinsic/extrinsic calibration matrices, a homography matrix, i can project images onto my tabletop surface - all good. However now i want to create a coordinate system on the table top. If i know the dimensions of the tabletop to be 1 meter by 2 meters, I would like a function where i can say: Give me the 2d pixel coordinates in the image which correspond to the point (x=0.75 meters, y=1.65 meters) on the tabletop surface. I would like the origin to begin in the top left of the tabletop, but that should simply be a matter of translation. How can i even make this coordinate system and the corresponding function to "use" it?

I have come up with two potential approaches, but both seem very ineffective and i am convinced a more effective and robust method must exist.

1: From my calibration process, I have N different images of a calibration pattern placed on the tabletop. This gives me N local coordinate systems, one for each calibration pattern. Since each calibration pattern is left on the table top, the x- and y-axes will always be in the same plane. Using this OpenCV guide i can draw the axes which follow the planar surface of the tabletop, each point is matched to be the size of a checkerboard square, so I have a working metric conversion meters-to-pixels. The big problem is that if the checkerboard edges are not parallel to the table edges, then my coordinate system becomes misaligned. Another problem is that by having N small coordinate systems, I have to choose an arbitrary one, and then manually finding its corresponding translation to the top-left tabletop position AND do rotation around origo to re-align the axes. This makes the solution very manual, and difficult to make dynamical. Ultimately it would be nice to just define the 4 manually identified table edges in pixel values, and then whatever derived values are needed, such as homography matrix, rotation matrix, translation matrix etc.

2: I can generate an image of a coordinate grid and then project this image-grid onto my table-top image. The technique would be something similar to this. Then i would have augmented my original table-top image to contain a grid which has been transformed correctly (i.e. grid coordinate squares closer to the camera are larger than grid coordinate squares further away). This however does not make it possible for me to extract coordinates from the drawn grid. If instead of an image, i could project a grid, then that would be a very simple solution. As far as i can see on OpenCV documentation, this is not possible.

Solution

In recipe form:

Identify in an image, with whatever means necessary (e.g. click and gather mouse-click coordinates), the image location of the desired origin.
Identify in the same fashion a point on the desired X axis and one on the desired Y axis.
Back-project such pixels into rays in world coordinates, using as "world" any one of the coordinate transforms (a.k.a. extrinsic parameters) obtained during calibration. Let's call its rotation matrix and translation with respect to the camera Rc0, Tc0 respectively
Intersect the rays with the XY world plane (== the plane of the calibration target placed on the table), obtaining 3D points Ot, Xt and Yt.
Compute the vectors xt = (Xt - Ot) / np.linalg.norm(Xt - Ot) and yt = (Yt - Ot) / np.linalg.norm(Yt - Ot).
Orthogonalize them using the Gram method: yt = yt - yt.dot(xt) * xt; yt = yt / np.linalg.norm(yt).
Compute the third axis: zt = np.cross(xy, yt).
The triple (xt, yt, zt) is the coordinate frame you desire, centered at Ot. The matrix R0t = np.hstack((xt, yt, zt)) is the rotation from that frame to the world frame.
The coordinate transform from the new frame to the camera is Rct = Rc0.dot(R0t), Tt = Ot