TensorRT finding boundin box data after inference

I'm trying to use TensorRT for inference using my trained YOLOv5 model. The model has been converted to an .engine file, which I have no problem loading and running the inference with. My problem is accessing the data. What I basically end up getting as output is a 1x25200x85 tensor, which I have no way to process. So far I have been able to copy the data to the CPU, and tried accessing it as follows:

    void postprocessAndDisplay(cv::Mat &img, float *gpu_output, const Dims dims, float treshold){
    // Copy to CPU
    size_t dimsSize = accumulate(dims.d+1, dims.d+dims.nbDims, 1, multiplies<size_t>());
    vector<float> cpu_output (dimsSize);

    cudaMemcpy(cpu_output.data(), gpu_output, cpu_output.size()*sizeof(float), cudaMemcpyDeviceToHost);

    vector<int> classIds, indices;
    vector<cv::Rect> boxes, boxesNMS;
    vector<float> confidences;

    int img_width = img.cols;
    int img_height = img.rows;

    int n_boxes = dims.d[1], n_classes = dims.d[2];

//    printf("Image size: %i x %i, n_boxes: %i, n_classes: %i\n", img_width, img_height, n_boxes, n_classes);

    for (int i = 0; i < n_boxes; i++){

        uint32_t maxClass = 0;
        float maxScore = -1000.0f;

        for (int j = 1; j < n_classes; j++){ // Starte paa 1 sia 0 er himmelen???
            float score = cpu_output[i * n_classes + j];

//            printf("Confidence found %f\n", score);

            if (score < treshold)continue;

            if (score > maxScore){
                maxScore = score;
                maxClass = j;
            }
        }

//        printf("Max score for %i, class %i: %f\n", i, maxClass , maxScore);
        if (maxScore > treshold){
            float left_raw = (cpu_output[4*i]);
            float top_raw = (cpu_output[4*i + 1]);
            float right_raw = (cpu_output[4*i + 2]);
            float bottom_raw = (cpu_output[4*i + 3]);

//            int width = right - left + 1;
//            int height = bottom - top + 1;
//
//            cv::rectangle(img, cv::Rect(left, top, width, height), cv::Scalar(255, 0, 0), 1);

//            printf("Drawing rectangle at: %f %f %f %f\n", left_raw, top_raw, right_raw, bottom_raw);

            //printf("Found class %i\n", maxClass);
        }
    }


    cv::resize(img, img, cv::Size(1000, 1000));
//    cv::imshow("Test", img);
//    cv::waitKey(0);
}

However, it seems like the trying to find the confidence score with cpu_output[i * n_classes + j] doesn't work, as sometimes the confidence is over 600. When trying to find the bbox-data using cpu_output[4*i], I just get a lot of data equaling to basically 0. Here's the one code similar example I was being able to find, however it doesn't use the YOLo network: https://visp-doc.inria.fr/doxygen/visp-3.5.0/tutorial-detection-tensorrt.html

Another weird thing is the network being 1x25200x85, while me having just 80 classes, which hints me to the 85 being something else.

Any ideas?

Solution

The output of the NN describes 25200 boxes with 85 numbers.

Each box represents a unique detection with its bounding rectangle and confidences for each coco class. There are potentially up to 25200 boxes (since the NN must have a static sized output) but in practise it only finds a handful of detections for each image.

The first 5 numbers are:

x (topleft)
y (topleft)
width
height
objectness (score for the max class)

The rest, 80 numbers are scores for invidual coco classes. You can find the semantic meaning for those classes here and here.

The x, y, width, height are pixel values between 0...image size (0...640 in the model I use).

So after each inference you have potentially 25200 matches for 80 different classes. But probably you will have less. You will use the objectness to filter out matches that have too low confidence. Then you have to check the maximum value of the 80 class scores. That way you can find the probabilities for each class for that particular bounding rectangle.

In terms of your code, cpu_output should have 25200 rows that have 85 floats each. You need to loop through all the rows like this:

const int rowSize = 85;
const int nClasses = 80;
for (int rowIndex = 0; rowIndex < 25200; ++rowIndex)
{
  float* rowBeginPtr = cpu_output[rowIndex * rowSize];
  const float x = rowBeginPtr[0];
  const float y = rowBeginPtr[1];
  const float w = rowBeginPtr[2];
  const float h = rowBeginPtr[3];
  const float score = rowBeginPtr[4];
  if (score < scoreThreshold)
  {
    continue;
  }

  // Then read indices 5...79 in rowBeginPtr and find max class score
  float maxClassScore = 0.0;
  int maxClassIndex = 0;
  for (int i = 0; i < nClasses; ++i)
  {
    const float& v = rowBeginPtr[5 + i];
    if (v > maxClassScore)
    {
      maxClassScore = v;
      maxClassIndex = i;
    }
  }

  const float score = objectness * maxClassScore;
  if (score < scoreThreshold)
  {
     continue;
  }

  // TODO: return x, y, w, h, score
  // or save them to your data structure
  
}

As you can see from the snipped above, you were reading the data from wrong indices as you forgot to add the offset of 85 (the row size) to your indexing.

I recommend you to check at this repository for inspiration. Especially this function. Good luck!