[SOLVED] CUDA program running slower on Tesla K20 than GTX 965

CUDA program running slower on Tesla K20 than GTX 965

I'm doing a project where i have to compare various gpu cards for performance analysis.

I had ran the same cuda code for Canny Edge Detection in both GPU's and found that gtx 965 is much faster(200%) than the Tesla K20. Also i observed that Tesla C2075 is running same as that of Tesla K20.

As far as i know K20 has 2496 cores, 965 has 1024 cores and C2075 has 448 cores. K20 and C2075 are NVIDIA Kepler architecture and 965 is Maxwell architecture.

What is it i'm doing wrong or is there any difference in hardware part that is causing this problem?

Also, can we check the power consumed by the graphic card using any program or theoretical calculations?

Solution

Many cores do not necessarily mean shorter execution times. If your CUDA app would only be utilizing single thread and you would run your app on:

K20, which has lots of cores with 706MHz frequency,
As opposed to GTX965 which has roughly half of them but working on 944MHz

... then obviously GTX965 can work faster. In theory, as long as you would be utilizing less than 1024 cores by your app, GTX can outperform K20, in case if the memory is not the bottleneck as actually K20 has:

Bigger memory bandwidth,
Much more memory in general,
A tiny bit higher memory clock.

So, to sum up, it is quite easy to "tailor" the CUDA app to suit one GPU better than the others, taking hardware limitations into account. Just take into consideration such simple things as kernel launch parameters, i.e. grid size and block size.

Also, the same goes for C2075 as according to spec, its core clock is 1.15GHz, so superior to both K20 and GTX965.