c++randomdata-structuresprobability-distributiontreap

Generating random priorities for a treap in C++


I am creating a treap, and I want to know, which random number generator is most suitable for generating priorities at insertion.

The data set is about 6000 items long.

I am modifying an existing template class(largely just declared methods without definitions) that was given to us. The predefined generator is std::default_random_engine which only generates pseudo-random numbers. I would like to know, if this generator is sufficient, and if not, what are the alternatives? The data will be read from a file all at once.

The random number generator is declared as:

std::default_random_engine* generator_;

It's only used when creating in a constructor of a wrapper class

TreapItem<K, T>(key, data, (*generator_)())

I'd like to have the least number of collisions possible. Is std::default_random_engine* generator_; enough, to achieve no collisions, or is there a need for some other generator?

EDIT: I'd prefer uniform distribution, or something that is close to it. Normal distribution might work too, however.

The pointer to the generator was in the given code, it didn't appear as a flaw at first glance.


Solution

  • This is a simple (but not exhaustive!) benchmark of the c++ random generators plus the ancient C rand function and a simple rot-xor generator.

    There is a simple smoke test, taking a few bits from the middle of the number, but by no means crypto-proof.

    I think they would all work well for a randomised binary search tree.

    #include <random>
    #include <iostream>
    #include <chrono>
    #include <stdlib.h>
    
    struct rot_xor {
      int32_t seed = 0x95abcfad;
      inline uint32_t operator() () {
        return seed = (seed << 1) ^ ((seed >> 31) & 0xa53a9be9);
      }
    };
    
    struct crand {
      int32_t seed = 0x95abcfad;
      inline uint32_t operator() () {
        return rand();
      }
    };
    
    template <class Generator>
    void benchmark(std::vector<int> &histo) {
      Generator r;
      int mask = histo.size() - 1;
      for (int i = 0; i != 10000000; ++i) {
        uint32_t val = (uint32_t)r();
        histo[(val>>16) & mask]++;
      }
    }
    
    int main() {
      using std::chrono::high_resolution_clock;
      using std::chrono::duration_cast;
      using std::chrono::microseconds;
    
      for (int i = 0; i != 9; ++i) {
        std::vector<int> histo(0x100);
        auto t0 = high_resolution_clock::now();
        switch (i) {
          case 0: benchmark<std::minstd_rand0>(histo); break;
          case 1: benchmark<std::minstd_rand>(histo); break;
          case 2: benchmark<std::mt19937>(histo); break;
          case 3: benchmark<std::mt19937_64>(histo); break;
          case 4: benchmark<std::ranlux24_base>(histo); break;
          case 5: benchmark<std::ranlux48_base>(histo); break;
          case 6: benchmark<std::default_random_engine>(histo); break;
          case 7: benchmark<crand>(histo); break;
          case 8: benchmark<rot_xor>(histo); break;
        }
        auto t1 = high_resolution_clock::now();
    
        int min_histo = histo[0];
        int max_histo = histo[0];
        for (auto h : histo) {
          min_histo = std::min(min_histo, h);
          max_histo = std::max(max_histo, h);
        }
        std::cout << "test " << i << " took " << duration_cast<microseconds>(t1-t0).count() << "us\n";
        std::cout << " smoke test = " << min_histo << " .. " << max_histo << "\n";
      }
    }
    

    Results show surprising performance for the rather complex C++ defaults, only 3-5 times slower than a simple RNG. The best of the standard ones seems to be the subtract with carry versions ranlux_*. The old C rand() function, which I think contains a divide, is unsurprisingly the slowest.

    test 0 took 58066us
     smoke test = 38486 .. 39685
    test 1 took 39310us
     smoke test = 38533 .. 39604
    test 2 took 26382us
     smoke test = 38503 .. 39591
    test 3 took 29146us
     smoke test = 38591 .. 39670
    test 4 took 27721us <- not bad, ranlux24
     smoke test = 38419 .. 39597
    test 5 took 27310us
     smoke test = 38608 .. 39622
    test 6 took 38629us
     smoke test = 38486 .. 39685
    test 7 took 65377us
     smoke test = 38551 .. 39541
    test 8 took 10984us <-- fastest (rot-xor)
     smoke test = 38656 .. 39710