c++c++11stlcontainersunordered-set

Fastest Way to Determine if Character Belongs to a Set of Known Characters C++


Given any character, what's the fastest way for me to determine if that character belongs to a set (not the container-type) of known characters.

In other words, what's the fastest elegant way to implement the conditional:
char c = 'a';
if(c == ch1 || c == ch2 || c == ch3 ...) // Do something...

Is there an STL container (I'm thinking it might be unordered_set?) that I can just pass the character as a key to and it'll return true if the key exists?

Anything with an O(1) lookup time will work for me.


Solution

  • I went a little further and wrote two versions, one based on a lookup array, the other on a set using an underlying hash.

    class CharLookup {
    public:
      CharLookup(const std::string & set) : lookup(*std::max_element(set.begin(), set.end()) + 1) {
        for ( auto c : set) lookup[c] = true;
      }
      inline bool has(const unsigned char c) const {
        return c > lookup.size() ? false : lookup[c];
      }
    private:
      std::vector<bool> lookup;
    };
    
    class CharSet {
    public:
      CharSet(const std::string & cset) {
        for ( auto c : cset) set.insert(c);
      }
      inline bool has(const unsigned char c) const {
        return set.contains(c);
      }
    private:
      QSet<unsigned char> set;
    };
    

    Then wrote a little benchmark, added a few more containers for the sake of comparison. Lower is better, the data points are for "character set size / text size":

    enter image description here

    Seems like for short char sets and text, std::string::find_first_of is fastest, even faster than using a lookup array, but quickly dwindles as the test size increases. std::vector<bool> seems like the "golden mean", QBitArray probably has a little different implementation because it pulls ahead as the test size increases, at the largest test QVector<bool> is fastest, presumably because it doesn't have the overhead of bit access. The two hash sets are close, trading places, last and least there is the std::set.

    Tested on an i7-3770k Win7 x64 box, using MinGW 4.9.1 x32 with -O3.