[SOLVED] Cheap approximate integer division on a GPU

Cheap approximate integer division on a GPU

So, I want to divide me some 32-bit unsigned integers on a GPU, and I don't care about getting an exact result. In fact, let's be lenient and suppose I'm willing to accept a multiplicative error factor of upto 2, i.e. if q = x/y I'm willing to accept anything between 0.5*q and 2*q.

I haven't yet measured anything, but it seems to me that something like this (CUDA code) should be useful:

__device__ unsigned cheap_approximate_division(unsigned dividend, unsigned divisor)
{
    return 1u << (__clz(dividend) - __clz(divisor));
}

it uses the "find first (bit) set" integer intrinsic as a cheap base-2-logarithm function.

Notes: I could make this non-32-bit-specific but then I'd have to complicate the code with templates, wrap __clz() with a templated function to use __clzl() and __clzll() etc.

Questions:

Is there a better method for such approximate division, in terms of clock cycles? Perhaps with slightly different constraints?
If I want better accuracy, should I stay with integers or should I just go through floating-point arithemtic?

Solution

Going via floating point gives you a much more precise result, slightly lower instruction count on most architectures, and potentially a higher throughput:

__device__ unsigned cheap_approximate_division(unsigned dividend, unsigned divisor)
{
   return (unsigned)(__fdividef(dividend, divisor) /*+0.5f*/ );
}

The +0.5f in the comment shall indicate that you can also turn the float->int conversion into proper rounding at essentially no cost other than higher energy consumption (it turns an fmul into an fmad with the constant coming straight from the constant cache). Rounding would take you further away from the exact integer result though.