c++stringalgorithmperformancesuffix-tree

Strange algorithm performance


For context, I wrote this algorithm to get the number of unique substrings of any string. It builds the suffix tree for the string counting the nodes it contains and returns that as the answer. The problem I wanted to solve required a O(n) algorithm so this question is only about how this code behaves and not about how bad it is at what it does.

struct node{
    char value = ' ';
    vector<node*> children;
    ~node()
    {
        for (node* child: children)
        {
            delete child;
        }
    }
};

int numberOfUniqueSubstrings(string aString, node*& root)
{
    root = new node();
    int substrings = 0;
    for (int i = 0; i < aString.size(); ++i)
    {
        string tmp = aString.substr(i, aString.size());
        node* currentNode = root;
        char indexToNext = 0;
        for (int j = 0; j < currentNode->children.size(); ++j)
        {
            if (currentNode->children[j]->value == tmp[indexToNext])
            {
                currentNode = currentNode->children[j];
                j = -1;
                indexToNext++;
            }
        }
        for (int j = indexToNext; j < tmp.size(); ++j)
        {
            node* theNewNode = new node;
            theNewNode->value = tmp[j];
            currentNode->children.push_back(theNewNode);
            currentNode = theNewNode;
            substrings++;
        }
    }
    return substrings;
}

I decided to benchmark this algorithm for which I simply looped over a big string taking a bigger substring each iteration, calling numberOfUniqueSusbstrings measuring how long it took to end.

I plotted it in octave and this is what I got (x is string size and y is time in microseconds)

Performance plot, x is string length and y is time in microseconds

I first thought the problem lied in the input string but it's just an alphanumeric string I got from a book (any other text behaves just as strangely).

Also tried averaging many calls to the function with the same parameter and the result is pretty much the same.

This is compiling with g++ problem.cpp -std=c++14 -O3 but seems to do the same on -O2 and -O0.

Edit: After @interjay 's answer I've tried doing just that which leaves the function as:

int numberOfUniqueSubstrings(string aString, node*& root)
{
    root = new node();
    int substrings = 0;
    for (int i = 0; i < aString.size(); ++i)
    {
        node* currentNode = root;
        char indexToNext = i;
        for (int j = 0; j < currentNode->children.size(); ++j)
        {
            if (currentNode->children[j]->value == aString[indexToNext])
            {
                currentNode = currentNode->children[j];
                j = -1;
                indexToNext++;
            }
        }
        for (int j = indexToNext; j < aString.size(); ++j)
        {
            node* theNewNode = new node;
            theNewNode->value = aString[j];
            currentNode->children.push_back(theNewNode);
            currentNode = theNewNode;
            substrings++;
        }
    }
    return substrings;
}

And it does indeed make it a bit faster. But not less strange for I plotted this:

plot without tmp string

Something is happening at x = 1000 and I have no clue what it could be.

Another plot for good measure:

enter image description here

I've now run gprof for a string of size 999:

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  us/call  us/call  name    
100.15      0.02     0.02      974    20.56    20.56  node::~node()
  0.00      0.02     0.00   498688     0.00     0.00  void std::vector<node*, std::allocator<node*> >::_M_emplace_back_aux<node* const&>(node* const&)
  0.00      0.02     0.00        1     0.00     0.00  _GLOBAL__sub_I__Z7imprimePK4node
  0.00      0.02     0.00        1     0.00     0.00  numberOfUniqueSubstrings(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, node*&)
^L
            Call graph


granularity: each sample hit covers 2 byte(s) for 49.93% of 0.02 seconds

index % time    self  children    called     name
                               54285             node::~node() [1]
                0.02    0.00     974/974         test(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) [2]
[1]    100.0    0.02    0.00     974+54285   node::~node() [1]
                               54285             node::~node() [1]
-----------------------------------------------
                                                 <spontaneous>
[2]    100.0    0.00    0.02                 test(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) [2]
                0.02    0.00     974/974         node::~node() [1]
                0.00    0.00       1/1           numberOfUniqueSubstrings(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, node*&) [12]
-----------------------------------------------
                0.00    0.00  498688/498688      numberOfUniqueSubstrings(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, node*&) [12]
[10]     0.0    0.00    0.00  498688         void std::vector<node*, std::allocator<node*> >::_M_emplace_back_aux<node* const&>(node* const&) [10]
-----------------------------------------------
                0.00    0.00       1/1           __libc_csu_init [21]
[11]     0.0    0.00    0.00       1         _GLOBAL__sub_I__Z7imprimePK4node [11]
-----------------------------------------------
                0.00    0.00       1/1           test(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) [2]
[12]     0.0    0.00    0.00       1         numberOfUniqueSubstrings(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, node*&) [12]
                0.00    0.00  498688/498688      void std::vector<node*, std::allocator<node*> >::_M_emplace_back_aux<node* const&>(node* const&) [10]
-----------------------------------------------

And for a string of size 1001:

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  us/call  us/call  name    
100.15      0.02     0.02      974    20.56    20.56  node::~node()
  0.00      0.02     0.00   498688     0.00     0.00  void std::vector<node*, std::allocator<node*> >::_M_emplace_back_aux<node* const&>(node* const&)
  0.00      0.02     0.00        1     0.00     0.00  _GLOBAL__sub_I__Z7imprimePK4node
  0.00      0.02     0.00        1     0.00     0.00  numberOfUniqueSubstrings(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, node*&)


            Call graph


granularity: each sample hit covers 2 byte(s) for 49.93% of 0.02 seconds

index % time    self  children    called     name
                               54285             node::~node() [1]
                0.02    0.00     974/974         test(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) [2]
[1]    100.0    0.02    0.00     974+54285   node::~node() [1]
                               54285             node::~node() [1]
-----------------------------------------------
                                                 <spontaneous>
[2]    100.0    0.00    0.02                 test(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) [2]
                0.02    0.00     974/974         node::~node() [1]
                0.00    0.00       1/1           numberOfUniqueSubstrings(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, node*&) [12]
-----------------------------------------------
                0.00    0.00  498688/498688      numberOfUniqueSubstrings(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, node*&) [12]
[10]     0.0    0.00    0.00  498688         void std::vector<node*, std::allocator<node*> >::_M_emplace_back_aux<node* const&>(node* const&) [10]
-----------------------------------------------
                0.00    0.00       1/1           __libc_csu_init [21]
[11]     0.0    0.00    0.00       1         _GLOBAL__sub_I__Z7imprimePK4node [11]
-----------------------------------------------
                0.00    0.00       1/1           test(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) [2]
[12]     0.0    0.00    0.00       1         numberOfUniqueSubstrings(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, node*&) [12]
                0.00    0.00  498688/498688      void std::vector<node*, std::allocator<node*> >::_M_emplace_back_aux<node* const&>(node* const&) [10]
-----------------------------------------------


Index by function name

  [11] _GLOBAL__sub_I__Z7imprimePK4node [1] node::~node()
  [12] numberOfUniqueSubstrings(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, node*&) [10] void std::vector<node*, std::allocator<node*> >::_M_emplace_back_aux<node* const&>(node* const&)

However it seems that running the profiler removes the effect and the times are pretty much the same in both cases.


Solution

  • Most people's working hypothesis seems to be that there's some kind of magic number hard coded into the libraries that results in a phase transition in performance around 999-1000 (except for LSerni, who makes the prescient observation that there may be multiple magic numbers).

    I'll try to systematically explore this and a few other hypotheses below (source code is available at the end of this answer).

    I then ran my code to see if I could duplicate your results on my Intel(R) Core(TM) i5 CPU M480, Linux 4.8.0-34-generic machine, using G++ 6.2.0-5ubuntu2 as my compiler with -O3 optimizations.

    Sure enough, there's a magic drop from 999-1000 (and, another one near 1600):

    Data from one machine

    Note that my trans-1000 dataset is not as clean as yours: this may be because I'm playing with a few other things in the background on my machine whereas you had a quieter testing environment.

    My next question was: is this magic 1000 number stable between environments?

    So I tried running the code on an Intel(R) Xeon(R) CPU E5-2680 v3, Linux 2.6.32-642.6.1.el6.x86_64 machine, using G++ 4.9.2. And, not unsurprisingly, the magic number was different, occurring at 975-976:

    Data from another machine

    This tells us that, if there was a magic number, it's changed between versions. This decreases my confidence in the magic number theory for a few reasons. (a) It changes. (b) 1000+24 bytes of overhead is a good candidate for magic. 975+49 bytes is less so. (c) The first environment has better software on a slower processor, yet the first environment shows what I would consider worse performance: waiting until 1000 to speed things up. This seems like a regression.

    I tried a different test: running the program with different random input data. That gives this result:

    Data from multiple runs

    The salient point in the above graph is that the 999-1000 drop is not so special. It looks like many of the drops before it: a slow decrease in speed followed by a sharp improvement. It's also worth noting that many of the previous drops do not align.

    This suggested to me that this is an input-dependent behavior and that there's correlation between runs. Therefore, I wondered what would happen if I reduced the correlation between runs by randomizing their order. This gave:

    Randomized order of plots

    Something's still happening around 999-1000:

    Randomized order of plots (zoom)

    Let's zoom in even more:

    Randomized order of plots (super zoom)

    Running this on the faster computer with the older software gives a similar result:

    Randomized order of plots on faster machine

    Zoomed:

    Randomized order of plots on faster machine (zoomed)

    Since randomizing the order in which strings of different lengths are considered essentially eliminated the slow build-up between runs (the aforementioned correlation), this suggests that the phenomenon you're seeing requires some sort of global state. Therefore, C++ string/vector cannot be an explanation. Therefore, malloc, "the OS", or architectural constraints must be the explanation.

    Note that when the order of lengths is randomized, there's a point at which the code runs slower rather than faster. In my mind this is consistent with some sort of cache size being exceeded, but the noise in the signal coupled with the very first plot in this post also suggest possible memory fragmentation. Therefore, I decided to restart the program before each run to ensure a fresh heap. That resulted in the following:

    Randomized order of plots with a fresh heap

    And now we see that there are no more breaks or jumps. This suggests that cache-size was not the issue, but, rather, that the observed behavior has something to do with the program's overall memory usage.

    Another argument against a caching effect is as follows. Both machines have 32kB and 256kB L1 and L2 caches, so their cache performance should be similar. My slow machine has a 3,072kB L3 cache. If you assume a 4kB page per allocation, 1000 nodes gives 4,000kB allocated, which is close to the cache size. However, the fast machine has a 30,720kB L3 cache and shows a break at 975. If the phenomenon were a caching effect, you would expect the break, if anything, to come later. Therefore, I'm pretty sure caching is not at work here.

    The only remaining culprit is malloc.

    Why is this happening? I'm not sure. But, as a programmer, I don't care, as follows.

    There's probably an explanation for this, but it's at a level that's too deep to change or really worry about. I could do something exotic to fix it, but that would require thinking about what's going on somewhere in its dark underbelly. We use higher-level languages like C++ specifically to avoid messing with those kinds of details unless we really have to.

    And my results say we don't have to in this case. (a) The last graph tells us that any independent run of the code is likely to exhibit near-optimal behavior, (b) randomizing sequential runs can level performance, and (c) the loss in efficiency is on the order of a hundredth of a second, which is entirely acceptable unless you're processing massive amounts of data.

    Source code follows. Note that the code changes your version's char indexToNext to int indexToNext, fixing possible integer overflow issues. Testing interjay's suggestion that we avoid making copies of the string actually resulted in worse performance.

    #include <string>
    #include <chrono>
    #include <cstdlib>
    #include <iostream>
    #include <vector>
    #include <time.h>
    #include <algorithm>
    
    struct profiler
    {
      std::string name;
      std::chrono::high_resolution_clock::time_point p;
      profiler(std::string const &n) :
          name(n), p(std::chrono::high_resolution_clock::now()) { }
      ~profiler()
      {
          using dura = std::chrono::duration<double>;
          auto d = std::chrono::high_resolution_clock::now() - p;
          std::cout //<< name << ": "
              << std::chrono::duration_cast<dura>(d).count()
              << std::endl;
      }
    };
    
    #define PROFILE_BLOCK(pbn) profiler _pfinstance(pbn)
    
    struct node {
      char value = ' ';
      std::vector<node*> children;
      ~node(){
        for (node* child: children)
          delete child;
      }
    };
    
    int numberOfUniqueSubstrings(const std::string aString, node*& root)
    {
        root = new node();
        int substrings = 0;
        for (int i = 0; i < aString.size(); ++i)
        {
            node* currentNode = root;
            int indexToNext = i;
            for (int j = 0; j < currentNode->children.size(); ++j)
            {
                if (currentNode->children[j]->value == aString[indexToNext])
                {
                    currentNode = currentNode->children[j];
                    j = -1;
                    indexToNext++;
                }
            }
            for (int j = indexToNext; j < aString.size(); ++j)
            {
                node* theNewNode  = new node;
                theNewNode->value = aString[j];
                currentNode->children.push_back(theNewNode);
                currentNode = theNewNode;
                substrings++;
            }
        }
        return substrings;
    }
    
    
    int main(int argc, char **argv){
      const int MAX_LEN = 1300;
    
      if(argc==1){
        std::cerr<<"Syntax: "<<argv[0]<<"<SEED> [LENGTH]"<<std::endl;
        std::cerr<<"Seed of -1 implies all lengths should be explore and input randomized from time."<<std::endl;
        std::cerr<<"Positive seed sets the seed and explores a single input of LENGTH"<<std::endl;
        return -1;
      }
    
      int seed = std::stoi(argv[1]);
    
      if(seed==-1)
        srand(time(NULL));
      else
        srand(seed);
    
      //Generate a random string of the appropriate length
      std::string a;
      for(int fill=0;fill<MAX_LEN;fill++)
          a.push_back('a'+rand()%26);
    
      //Generate a list of lengths of strings to experiment with
      std::vector<int> lengths_to_try;
      if(seed==-1){
        for(int i=1;i<MAX_LEN;i++)
          lengths_to_try.push_back(i);
      } else {  
        lengths_to_try.push_back(std::stoi(argv[2]));
      }
    
      //Enable this line to randomly sort the strings
      std::random_shuffle(lengths_to_try.begin(),lengths_to_try.end());
    
      for(auto len: lengths_to_try){
        std::string test(a.begin(),a.begin()+len);
    
        std::cout<<len<<" ";
        {
          PROFILE_BLOCK("Some time");
          node *n;
          int c = numberOfUniqueSubstrings(test,n);
          delete n;
        }
      }
    }
    

    substr is a "constant"

    OP's original code included the following:

    for (int i = 0; i < aString.size(); ++i)
    {
      string tmp = aString.substr(i, aString.size());
    

    The substr operation here takes O(n) time in the length of the string. In an answer below, it is argued that this O(n) operation results in the poor performance of OP's original code.

    I disagree with this assessment. Due to caching and SIMD operations, CPUs can read and copy data in blocks up of up to 64 bytes (or more!). Due to this, the costs of memory allocation can dominate the cost of copying the string. Thus, for OP's input sizes, the substr operation acts more like an expensive constant than an additional loop.

    This can be demonstrated via testing by compiling the code with, e.g. g++ temp.cpp -O3 --std=c++14 -g and profiling with, e.g. sudo operf ./a.out -1. The resulting time-use profile looks like this:

    25.24%  a.out    a.out                [.] _ZN4nodeD2Ev        #Node destruction                                                                           
    24.77%  a.out    libc-2.24.so         [.] _int_malloc                                                                                    
    13.93%  a.out    libc-2.24.so         [.] malloc_consolidate                                                                            
    11.06%  a.out    libc-2.24.so         [.] _int_free                                                                                      
     7.39%  a.out    libc-2.24.so         [.] malloc                                                                                        
     5.62%  a.out    libc-2.24.so         [.] free                                                                                          
     3.92%  a.out    a.out                [.] _ZNSt6vectorIP4nodeSaIS1_EE19_M_emplace_back_auxIJRKS1_EEEvDpOT_                              
     2.68%  a.out    a.out                [.]
     8.07%  OTHER STUFF
    

    From which it is evident that memory management dominates the run-time.