deep-learningobject-recognitionyolo

Confused about Yolo


I am a bit confused with how Yolo works. In the paper, they say that:

The confidence prediction represents the IOU between the predicted box and any ground truth box."

But how do we have the ground truth box? Let's say I use my Yolo network (already trained) on an image that is not labelled. What is my confidence then?


Solution

  • But how do we have the ground truth box?

    You seem to be confused about what exactly is training data and what is the output or prediction by YOLO.

    Training data is a bounding box along with the class label(s). This is referred to as 'ground truth box', b = [bx, by, bh, bw, class_name (or number)] where bx, by is the midpoint of annotated bounding box and bh, bw is height and width of box.

    Output or prediction is bounding box b along with class c for an image i. Formally: y = [ pl, bx, by, bh, bw, cn ] where bx, by is the midpoint of annotated bounding box. bh, bw is height and width of box and pc - The probability of having class(es) c in 'box' b.

    Let's say I use my Yolo network (already trained) on an image that is not labelled. What is my confidence then?

    When you say you have a pre-trained model (which you refer to already trained), your network already 'knows' bounding boxes for certain object classes and it tries to approximate where the object might be in new image but while doing so your network might predict bounding box somewhere else than its supposed to be. So how do you calculate how much is the box 'somewhere else'? IOU to the rescue! What IOU (Intersection Over Union) does is, it gets you a score of area of overlap over area of union.

    IOU = Area of Overlap / Area of Union
    

    While it's rarely perfect or 1. Its somewhat closer, the lesser the value of IOU, the worse YOLO is predicting the bounding box with reference to ground truth. IOU Score of 1 means the bounding box is accurately or very confidently predicted with reference to ground truth.