c++opencvopencl

GPU with OpenCL is slower than CPU. Why?


Environment:

I'm trying to use OpenCL to speed up my code. But the result shows CPU is faster than GPU. How could I speed up my code?

void GetHoughLines(cv::Mat dst) {
    cv::ocl::setUseOpenCL(true);

    int img_w = dst.size().width; // 5000
    int img_h = dst.size().height; // 4000

    cv::UMat tmp_dst = dst.getUMat(cv::ACCESS_READ);
    cv::UMat tmp_mat = cv::UMat(dst.size(), CV_8UC1, cv::Scalar(0));

    for (size_t i = 0; i < 1000; i++)
    {
        tmp_mat = tmp_mat.mul(tmp_dst);
    }
}

It took about 3000ms when I used only CPU. When I used Intel UHD Graphics 630, it took 3500ms. And I also tried GTX1050, but it took about 3000ms.

Please give me some ideas to speed it up. I should make it at least 1000ms. Should I use AMP or OpenMP? But as I know, they can only compute simple operations, not suitable for OpenCV functions.


Solution

  • Basically, Your code is slow because the way OpenCV uses OpenCL is inefficient. It has nothing to do with the underlying hardware.

    In order for OpenCL code (or any GPU related code for that matter) to be efficient, it is crucial for the host side code to properly utilize the GPU. To name a few principles:

    Even if you write the most optimized GPU kernels, but fail to adhere to these basics, you are very unlikely to gain any performance boosts.

    The OpenCV codebase is a great example of how not to adhere to these principles.

    As for your example, if you rewrite your code to avoid memory copies and use device memory explicitly, you might witness a reasonable performance:

    auto frame1 = cv::UMat(size, format, cv::USAGE_ALLOCATE_DEVICE_MEMORY);
    auto frame2 = cv::UMat(size, format, cv::USAGE_ALLOCATE_DEVICE_MEMORY);
    auto frame3 = cv::UMat(size, format, cv::USAGE_ALLOCATE_DEVICE_MEMORY);
    
    for (size_t i = 0; i < 10; i++)
    {
        cv::multiply(frame1, frame2, frame3);
    }
    

    But in any case, I recommend you learn using the OpenCL API without OpenCV.