pytorch conv-neural-network object-detection keypoint detectron

How can I avoid getting overlapping keypoints during inference?

I have been using Detectron2 for recognizing 4 keypoints on each image, My dummy dataset consists of 1000 images, and I applied augmentations.

def build_train_loader(cls, cfg):
    augs = [
           T.RandomFlip(prob=0.5,horizontal=True),
           T.RandomFlip(prob=0.5,horizontal=False,vertical=True),
           T.RandomRotation(angle=[0, 180]),                           
           T.RandomSaturation(0.9, 1.9)
           ]
    return build_detection_train_loader(cfg, 
                                        mapper=DatasetMapper(cfg, 
                                                is_train=True,
                                                augmentations=augs)
                                    )

I have checked the images after those transforms which I have applied (each type of transform was tested separately), and it seems it has done well, the keypoints are positioned correctly.

Now after the training phase (keypoint_rcnn_R_50_FPN_3x.yaml), I get some identical keypoints, which means in many images the keypoints overlap, Here are few samples from my results:

[[[180.4211, 332.8872,   0.7105],
[276.3517, 369.3892,   0.7390],
[276.3517, 366.9956,   0.4788],
[220.5920, 296.9836,   0.9515]]]

And from another image:

[[[611.8049, 268.8926,   0.7576],
[611.8049, 268.8926,   1.2022],
[699.7122, 261.2566,   1.7348],
[724.5556, 198.2591,   1.4403]]]

I have compared the inference's results with augmentations and without, and it seems with augmentation the keypoints are barely getting recognized . gosh, How can it be?

Can someone please suggest any idea how to overcome those kind of mistakes? what am I doing wrong?

Thank you!

I have added a link to my google colab notebook: https://colab.research.google.com/drive/1uIzvB8vCWdGrT7qnz2d2npEYCqOxET5S?usp=sharing

Solution

The problem is that there's nothing unique about the different corners of the rectangle. However, in your annotation and in your loss function there is an implicit assumption that the order of the corners is significant:
The corners are labeled in a specific order and the network is trained to output the corners in that specific order.

However, when you augment the dataset, by flipping and rotating the images, you change the implicit order of the corners and now the net does not know which of the four corners to predict at each time.

As far as I can see you have two ways of addressing this issue:

Explicitly force order on the corners:
Make sure that no matter what augmentation the image underwent, for each rectangle the ground truth points are ordered "top left", "top right", "bottom left", "bottom right". This means you'll have to transform the coordinates of the corners (as you are doing now), but also reorder them.
Adding this consistency should help your model overcome the ambiguity in identifying the different corners.
Make the loss invariant to the order of the predicted corners:
Suppose your ground truth rectangle span the domain [0, 1]x[0, 1]: the four corners you should predict are [[0, 0], [1, 1], [1, 0], [0, 1]]. Note that if you predict [[1, 1], [0, 0], [0, 1], [1, 0]] your loss is very high, although you predicted the right corners just in a different order than annotated ones.
Therefore, you should make youy loss invariant to the order of the predicted points:
where pi(i) is a permutation of the corners.