sequencediscrete-mathematicslogarithm

What would be a strategy to generate semilog sequence with just a few asm instructions?


I need to generate the following number sequences. The actual usage is to find the bin in a histogram with increasingly bigger bins.

The first column is just a log sequence such that the histogram's bins increase size in powers of two.

The next column (2) is a further subdivision of the first column such that each larger bin is subdivided in two sub-bins.

I have a reference discrete solution but I am looking to further reduce the number of cycles, perhaps using floating point tricks.

 Index:   (1) (2) (3) (4) (5) (6) (7)
     1:    1   1   1   1   1   1   1
     2:    2   2   2   2   2   2   2
     3:    2   3   3   3   3   3   3
     4:    3   4   4   4   4   4   4
     5:    3   4   5   5   5   5   5
     6:    3   5   6   6   6   6   6
     7:    3   5   7   7   7   7   7
     8:    4   6   8   8   8   8   8
     9:    4   6   8   9   9   9   9
    10:    4   6   9  10  10  10  10
    11:    4   6   9  11  11  11  11
    12:    4   7  10  12  12  12  12
    13:    4   7  10  13  13  13  13
    14:    4   7  11  14  14  14  14
    15:    4   7  11  15  15  15  15
    16:    5   8  12  16  16  16  16
    17:    5   8  12  16  17  17  17
    18:    5   8  12  17  18  18  18
    19:    5   8  12  17  19  19  19
    20:    5   8  13  18  20  20  20
    21:    5   8  13  18  21  21  21
    22:    5   8  13  19  22  22  22
    23:    5   8  13  19  23  23  23
    24:    5   9  14  20  24  24  24
    25:    5   9  14  20  25  25  25
    26:    5   9  14  21  26  26  26
    27:    5   9  14  21  27  27  27
    28:    5   9  15  22  28  28  28
    29:    5   9  15  22  29  29  29
    30:    5   9  15  23  30  30  30
    31:    5   9  15  23  31  31  31
    32:    6  10  16  24  32  32  32
    33:    6  10  16  24  32  33  33
    34:    6  10  16  24  33  34  34
    35:    6  10  16  24  33  35  35
    36:    6  10  16  25  34  36  36
    37:    6  10  16  25  34  37  37
    38:    6  10  16  25  35  38  38
    39:    6  10  16  25  35  39  39
    40:    6  10  17  26  36  40  40
    41:    6  10  17  26  36  41  41
    42:    6  10  17  26  37  42  42
    43:    6  10  17  26  37  43  43
    44:    6  10  17  27  38  44  44
    45:    6  10  17  27  38  45  45
    46:    6  10  17  27  39  46  46
    47:    6  10  17  27  39  47  47
    48:    6  11  18  28  40  48  48
    49:    6  11  18  28  40  49  49
    50:    6  11  18  28  41  50  50
    51:    6  11  18  28  41  51  51
    52:    6  11  18  29  42  52  52
    53:    6  11  18  29  42  53  53
    54:    6  11  18  29  43  54  54
    55:    6  11  18  29  43  55  55
    56:    6  11  19  30  44  56  56
    57:    6  11  19  30  44  57  57
    58:    6  11  19  30  45  58  58
    59:    6  11  19  30  45  59  59
    60:    6  11  19  31  46  60  60
    61:    6  11  19  31  46  61  61
    62:    6  11  19  31  47  62  62
    63:    6  11  19  31  47  63  63

Solution

  • The columns show a semi-logarithmic mapping of values. This means that while the major switchover points proceed in logarithmic progression, intermediate points proceed in linear fashion. This is strongly reminiscent of IEEE-754 floating-point encodings, where the exponent field expresses the binary logarithm of the represented quantity, while the significand field provides a linear progression between powers of two. While use of IEEE-754 floating-point formats is extremely widespread, it is not universal, so this approach can only be considered semi-portable.

    One idea for an efficient implementation on most modern processors, CPUs and GPU alike, is therefore to convert the input Index into an IEEE-754 binary32 number represented as float at C or C++ source code level. One then extracts the appropriate bits (consisting of the exponent bits and some leading bits of the significand) from the binary representation of that float number, where the number of included significand bits increases by one with each output column, i.e. the column number is a granularity factor of the mapping.

    There are various ways of accomplishing the details of the process outlined above. The most straightforward implementation is to use either ldexpf or multiplication with an appropriate power of two so the smallest input is scaled to the lowest binade of the binary32 format using multiplication with a power of two produced via exp2f().

    The biggest caveat with this approach is that the lowest binade (biased exponent of 0) contains subnormal numbers, which are not available on platforms operating in FTZ (- flush to zero) mode. This may be the case for both x86 SIMD as well as for NVIDIA GPUs, for example. Another caveat is that ldexpf() or exp2f() may not implemented efficiently, that is, via (almost) direct support in hardware. Lastly, the accuracy of these functions may be insufficient. For example, for CUDA 11.8 I found that exp2f() with a negative integer argument does not always deliver the correct power-of-two result (specifically, exp2f(-127) is off by one ulp), making the variant using exp2f() fail.

    Alternate approaches convert Index into a floating-point number without scaling, i.e. starting the mapping in a binade near unity. This raises the issue that for column j > 0, the first 2j entries incorrectly have logarithmic mapping applied to them. This can be solved by manually enforcing a linear mapping to these entries, so that the result equals Index for the first 2j entries. The IEEE-754 exponent bias for the logarithmic portion of the computed values can be removed prior or after bit field extraction, with preference depending on specifics of the instruction set architecture.

    The four design variants described above are enumerated in the exemplary code below by the macro VARIANT which can take values 0 through 3. From some initial experiments it seems that when compiling with modern compilers at high optimization level (I tried gcc, clang, and icx) coding at the assembly level may not be necessary. On platforms without IEEE-754 floating-point arithmetic, a quick simulation of integer to floating-point conversion based on a CLZ (count leading zeros) instruction may be helpful.

    #include <cstdio>
    #include <cstdlib>
    #include <cstdint>
    #include <cstring>
    #include <cmath>
    
    #define VARIANT   (3)
    
    uint32_t float_as_uint32 (float a)
    {
        uint32_t r;
        memcpy (&r, &a, sizeof r);
        return r;
    }
    
    // provide semi-logarithmic mapping of i to output based on granularity parameter j
    int semilogmap (int i, int j)
    {
        const int FP32_MANT_BITS = 23;
        const int FP32_EXPO_BIAS = 127;
    #if VARIANT == 0 // this requires subnormal support and will break when using FTZ (flush to zero)!
        return float_as_uint32 ((float)i * exp2f (1 - FP32_EXPO_BIAS - j)) >> (FP32_MANT_BITS - j);
    #elif VARIANT == 1 // this requires subnormal support and will break when using FTZ (flush to zero)!
        return float_as_uint32 (ldexpf ((float)i, 1 - FP32_EXPO_BIAS - j)) >> (FP32_MANT_BITS - j);
    #elif VARIANT == 2
        return (i < (1 << j)) ? i : 
            ((float_as_uint32 ((float)i) - ((FP32_EXPO_BIAS - 1 + j) << FP32_MANT_BITS)) >> (FP32_MANT_BITS - j));
    #elif VARIANT == 3
        return (i < (1 << j)) ? i : 
            ((float_as_uint32 ((float)i) >> (FP32_MANT_BITS - j)) - ((FP32_EXPO_BIAS - 1 + j) << j));
    #else
    #error unsupported VARIANT    
    #endif // VARIANT
    }
    
    int main (void)
    {
        int col [64][7];
        for (int i = 1; i <= 63; i++) {
            for (int j = 0; j < 7; j++) {
                col[i][j] = semilogmap (i, j);
            }
        }
        for (int i = 1; i <= 63; i++) {
            printf ("%2d: ", i);
            for (int j = 0; j < 7; j++) {
                printf ("  %2d", col[i][j]);
            }
            printf ("\n");
        }
        return EXIT_SUCCESS;
    }
    

    In terms of the number of instructions generated, it might be instructive to look at a CUDA version of variant 0 for execution on NVIDIA GPUs. I had to implement my own version of exp2f() to achieve the necessary accuracy.

    /* compute 2**x with a maximum error of 2.055 ulp */
    __forceinline__ __device__ float my_exp2f (float x)
    {
        float r, t;
        t = x;
        if (x < -126.0f) t = t + 24.0f;
        asm ("ex2.approx.ftz.f32 %0,%1;" : "=f"(r) : "f"(t));
        if (x < -126.0f) r = r * 5.9604644775390625e-8f; // 0x1.0p-24
        return r;
    }
    
    /* semi-logarithmic mapping of i to output based on granularity parameter j */
    __device__ int semilogmap (int i, int j)
    {
        const int FP32_MANT_BITS = 23;
        const int FP32_EXPO_BIAS = 127;
        return __float_as_int ((float)i * my_exp2f (1 - FP32_EXPO_BIAS - j)) >> (FP32_MANT_BITS - j);
    }
    

    Compiled with nvcc -c -rdc=true -arch=sm_75 -o semilogmap.obj semilogmap.cu using the toolchain from CUDA 11.8, the following code, (comprising 11 instructions including the function return) is generated for semilogmap():

            code for sm_75
                    Function : _Z10semilogmapii
            .headerflags    @"EF_CUDA_SM75 EF_CUDA_PTX_SM(EF_CUDA_SM75)"
            /*0000*/                   IADD3 R0, -R5.reuse, -0x7e, RZ ;
            /*0010*/                   I2F R4, R4 ;
            /*0020*/                   ISETP.GT.AND P0, PT, R5.reuse, RZ, PT ;
            /*0030*/                   IADD3 R6, -R5, 0x17, RZ ;
            /*0040*/                   I2F R3, R0 ;
            /*0050*/               @P0 FADD R3, R3, 24 ;
            /*0060*/                   MUFU.EX2 R7, R3 ;
            /*0070*/               @P0 FMUL R7, R7, 5.9604644775390625e-08 ;
            /*0080*/                   FMUL R7, R4, R7 ;
            /*0090*/                   SHF.R.S32.HI R4, RZ, R6, R7 ;
            /*00a0*/                   RET.ABS.NODEC R20 0x0 ;