I've implemented a software for searching a pattern inside an image. With cvMatchTemplate the execution time is around 10ms (because I'm taking a pattern of 40x40 in a search window of 120x160 pixels. The image is 640x480 so I'm not considering the whole image).
I've implemented the same algorithm by using the gpu::MatchTemplate, and I was expecting improvements for the execution time. It is taking 220ms to compute the score.
what is happening?
Thanks.
EDIT: I measured the loading time of the images and it takes 1ms to perform the ".upload" function because the images are already uncompressed.
Isn't the same algorithm?
EDIT: I wrote the code using CUDA and my own kernel: it performs the FFT using the cuda functions on the images, and the whole execution of the algorithm is less than 2 ms with 1024x1024 images and a pattern of 200x200. I used the thread_sync in order to measure the exec. time.
I think it is very much dependant on your GPU processing power, some gpu's cannot perform better than cpu's. See this question gpuvscpu