chashordinals

Improving on 'if/else' strcmp() ladder to determine a useable value


//...
if( strcmp( str, "January" ) == 0 )
    month = 1;
else if( strcmp( str, "February") == 0 )
    month = 2;
//...

Q: Is there any more efficient way of determining that, for instance, "April" is the fourth month of the year? Repeated calls to strcmp() must be terribly inefficient, and if/else ladders tedious to code. Sometimes it's "March" and sometimes it's abbreviated as "MAR"... There must be a better way...

Putting the known strings in a sorted array of structs would allow, at least, binary searching, but still involves a lot of guesswork from the code.


Solution

  • This is a Can I answer my own question? answer. Other answers are welcomed.


    There are several ways of translating an arbitrary string from a finite set of strings ("keywords") to a concise, useable form. Most of these involve an iterative (or a sub-optimal linear, or an expensive branching) search involving repeated comparisons (that may also need to account for case insensitivity.)

    The 12 (English) month names form such a set. A quick and small hashing function for translation of month name to month ordinal is presented below.

    A response to my response to a recent question suggested "sharing" an (admittedly arcane) hashing function that, with awareness of false positives, returns the month ordinal (1-12) when passed a string containing the name of a month (English) in 7-bit ASCII. The function performs primitive operations on the 2nd & 3rd character and out pops the function's hash value of the string (tuned to be the month's ordinal value.)

    static int monthOrd( char cp[] ) {
        return "DIE@CB@LJF@HAG@K"[ cp[1]/4&7 ^ cp[2]*2 &0xF ] &0xF;
    }
    

    The operations of the hashing were uncovered through a brute force permutation of a number of primitive operations seeking a combination that would return 12 different values between 0x0 and 0xF (4 bits). The reader is encouraged to take apart each step of the mangling of the bits of the two ASCII characters. This result was not "invented", but was "discovered".

    After the bits of two characters have been mangled, the value is used as an index into a string (aka "a cheap LUT") whose 12 letters A-L are positioned such that "?an" (January) will mangle to an index for the letter 'A'. Masking the low 4 bits of that ASCII character yields the value 1 as the ordinal for the string "JANUARY"... 1 is the return value when the function is passed variations of the string "Jan". Likewise for the other 11 month names.

    NB: Using this function allows the caller to check that the string is indeed "JAN", "jan", "January" as suits the application. The caller need not try to match any of the names of the other 11 months. This function WILL return the false positive value 1 for the strings "Random" or "CANCEL", so the caller need only validate against a single month's name (length and case appropriate to the application.)


    Day names
    Here is an equivalent function that converts "Sun(day)" (case insensitive) to 1, "MON" to 2, "tue" to 3, etc...

    static int wkdayOrd( char cp[] ) {
        return "65013427"[*cp/2 + ~cp[1] & 0x7] & 0x7;
    }
    

    Again, the caller must confirm the string against only ONE day's name, appropriate to their source data, to avoid "false positives".

    Note: only the first two letters are used, so "Su", "Mo", "Tu", ... would be sufficient for this function to process.


    Number names
    Here is an equivalent function for "zero" to "ten", again case insensitive. (Number names are not abbreviated like month names or day names.)

    static int numberOrd( char cp[] ) {
        return "@~IBAH~FCGE~~DJ~"[ ( cp[0] ^ cp[1]/2 + cp[2]*4 ) & 0xF ] & 0xF;
    }
    

    Zodiac names

    static int ZodiacOrd( char cp[] ) {
        return "BJGA@@HIECK@@DLF"[(cp[0]/2 ^ (cp[1]/2&1) + cp[2]*2) & 0xF] & 0xF;
    }
    

    Passed the name (case ambivalent) of one of the twelve Zodiac signs, this returns the ordinal of that star sign ("Aries" = 1, ...) Again, like any hashing function, there will be collisions with other strings. The caller need only subsequently check against a single known string; not twelve.


    Alternative (64 bit) version with exposition
    An improvement to monthOrd() has revealed itself. The code above uses (16+1) bytes for its string LUT; the code below uses half of that. For posterity, the scheme is explained, with examples provided.

    NB: monthOrd() (above) returns 1-12 for a valid month name (or false positive) and 0 for other hash values in the range 0-15. Below returns 0-11 for valid month names (conforming to struct tm convention) and 12 for each of the four other hashes in the range 0-15. False positives don't magically go away.

    /*
    '#' denotes hashing the ASCII values of 2nd & 3rd letters
    
    Apr: (p#r) =>  0   /want    3 = 0011 -> 0x3
    Sep: (e#p) =>  1   /want    8 = 1000 -> 0x8
    May: (a#y) =>  2   /want    4 = 0100 -> 0x4
    ---: (-#-) =>  3   /dont care = 1100 -> 0xc // 12 out of range
    Mar: (a#r) =>  4   /want    2 = 0010 -> 0x2
    Feb: (e#b) =>  5   /want    1 = 0001 -> 0x1
    ---: (-#-) =>  6   /dont care = 1100 -> 0xc // 12 out of range
    Dec: (e#c) =>  7   /want   11 = 1011 -> 0xB
    Oct: (c#t) =>  8   /want    9 = 1001 -> 0x9
    Jun: (u#n) =>  9   /want    5 = 0101 -> 0x5
    ---: (-#-) => 10   /dont care = 1100 -> 0xc // 12 out of range
    Aug: (u#g) => 11   /want    7 = 0111 -> 0x7
    Jan: (a#n) => 12   /want    0 = 0000 -> 0x0
    Jul: (u#l) => 13   /want    6 = 0110 -> 0x6
    ---: (-#-) => 14   /dont care = 1100 -> 0xc // 12 out of range
    Nov: (o#v) => 15   /want   10 = 1010 -> 0xA
    
    ==>> uint64_t LUT = 0xAc607c59Bc12c483;
    */
    
    #include <stdio.h>
    #include <stdint.h>
    
    int main( void ) {
        const uint64_t LUT = 0xAc607c59Bc12c483;
    
        char *mon[] = {
            "jan", "feb", "mar", "apr", "may", "jun",
            "JUL", "AUG", "SEP", "OCT", "NOV", "DEC",
            NULL
        };
    
        for( char **pm = mon; *pm; pm++ ) {
        //  char a = pm[0][0]; // irrelevant
            char b = pm[0][1];
            char c = pm[0][2];
    
            int bmngl  = b >> 2 & 0x7; // mangle the bits of 2nd character
            int cmngl  = c << 1;       // mangle the bits of 3rd character
            int LUTidx = bmngl ^ cmngl & 0xF; // combine and mask
    
            int rShift = 4 * LUTidx; // will right shift LUT 4*n bits
    
            int monOrd = (int)( LUT >> rShift ) & 0xF; // shift and mask
    
            printf("%s: hash = %2d ==> LU1 = %2d, LU2 = %2d, LU3 = %2d\n",
                pm[0],
                LUTidx,
                monOrd,
                // casting to int simulates return: int monthOrd( char *name );
                (int)( LUT >> 4 * ( ( b / 4 & 7 ) ^ ( c * 2 ) & 15 ) & 15 ),
                (int)( LUT>>4*(b/4&7^c*2&15)&15 ) // confident of precedence
            );
        }
    
        puts( "\nFalse positives:" );
        char *junk[] = { "COVID-19", "junk", "technical", "fopbar", NULL };
    
        /*
         * The hash function appears below as the 3rd parameter of printf()
         */
    
        for( char **pj = junk; *pj; pj++ ) {
            char b = pj[0][1];
            char c = pj[0][2];
            printf("%10s: LU = %2d\n", pj[0], (int)( LUT>>4*(b/4&7^c*2&15)&15 ) );
        }
    
        return 0;
    }
    

    Output:

    jan: hash = 12 ==> LU1 =  0, LU2 =  0, LU3 =  0
    feb: hash =  5 ==> LU1 =  1, LU2 =  1, LU3 =  1
    mar: hash =  4 ==> LU1 =  2, LU2 =  2, LU3 =  2
    apr: hash =  0 ==> LU1 =  3, LU2 =  3, LU3 =  3
    may: hash =  2 ==> LU1 =  4, LU2 =  4, LU3 =  4
    jun: hash =  9 ==> LU1 =  5, LU2 =  5, LU3 =  5
    JUL: hash = 13 ==> LU1 =  6, LU2 =  6, LU3 =  6
    AUG: hash = 11 ==> LU1 =  7, LU2 =  7, LU3 =  7
    SEP: hash =  1 ==> LU1 =  8, LU2 =  8, LU3 =  8
    OCT: hash =  8 ==> LU1 =  9, LU2 =  9, LU3 =  9
    NOV: hash = 15 ==> LU1 = 10, LU2 = 10, LU3 = 10
    DEC: hash =  7 ==> LU1 = 11, LU2 = 11, LU3 = 11
    
    False positives:
      COVID-19: LU = 10 // not "November" or any other month name
          junk: LU =  5 // not "June" or any other month name
     technical: LU = 11 // not "December" or any other month name
        fopbar: LU = 12 // outside of range 0-11. definitely not a month name!
    

    Update:
    Inspired by the clarity of code in the recent answer by @chrqlie, there's this... It compiles to the same amount of assembler, but doesn't have the baggage of additional data. (Requiring a 64bit machine, it's probably not suitable for MCUs.) Again, the size of the LUT is only 8 bytes...

    static int monthOrd( char *mn ) {
    /*
     * Branchless hashing function
     */
    #define HASH(a,b,c) ((((b + b + c)&15) - ((b & 3) == 3))*4)
    
        /*
         * compile-time initialisation of one uint64_t
         */
        static const uint64_t LUT
            = (uint64_t) 1 << HASH('j','a','n')
            | (uint64_t) 2 << HASH('f','e','b')
            | (uint64_t) 3 << HASH('m','a','r')
            | (uint64_t) 4 << HASH('a','p','r')
            | (uint64_t) 5 << HASH('m','a','y')
            | (uint64_t) 6 << HASH('j','u','n')
            | (uint64_t) 7 << HASH('j','u','l')
            | (uint64_t) 8 << HASH('a','u','g')
            | (uint64_t) 9 << HASH('s','e','p')
            | (uint64_t)10 << HASH('o','c','t')
            | (uint64_t)11 << HASH('n','o','v')
            | (uint64_t)12 << HASH('d','e','c')
            ;
    
        /*
         * Branchless run-time hashing of passed string
         */
        return -1 + (0xF & (int)(LUT >> HASH('x', mn[1], mn[2])));
    #undef HASH
    }
    

    The first version of monthOrd() calculated a 4 bit hash and returned 0 for 4/16 cases where the passed string bore no resemblance to the name of a month (a true negative). It returned 1-12 for 12/16 true positive and false positive cases. This was 'improved' in a later version, returning 0-11 to follow the convention of struct tm. Unfortunately, this meant the caller would have to use strcmp() to discern "Jan(uary)" (0) from the 4 true negative return values (also 0.) Not good!

    The most recent version still returns the desired 0-11 for the 12 hashed month names (including false positives). This version now returns -1 for the 4/16 true negative cases. The caller need not invoke strcmp() when monthOrd( str ) returns -1.