c++algorithm sorting linked-list mergesort

`std::list<>::sort()` - why the sudden switch to top-down strategy?

Update - Visual Studio 2022 switched to an efficient recursive implementation of merge sort where each level of recursion just divides an integer size by 2 instead of scanning lists to split them, and switched back to using pointers instead of iterators. The merge logic was improved to splice multiple nodes at a time when possible. The changes preserve the no memory allocation and exception safety fixes given as reasons for the VS2015 change.

I remember that since the beginning of times the most popular approach to implementing std::list<>::sort() was the classic Merge Sort algorithm implemented in bottom-up fashion (see also What makes the gcc std::list sort implementation so fast?).

I remember seeing someone aptly refer to this strategy as "onion chaining" approach.

At least that's the way it is in GCC's implementation of C++ standard library (see, for example, here). And this is how it was in old Dimkumware's STL in MSVC version of standard library, as well as in all versions of MSVC all the way to VS2013.

However, the standard library supplied with VS2015 suddenly no longer follows this sorting strategy. The library shipped with VS2015 uses a rather straightforward recursive implementation of top-down Merge Sort. This strikes me as strange, since top-down approach requires access to the mid-point of the list in order to split it in half. Since std::list<> does not support random access, the only way to find that mid-point is to literally iterate through half of the list. Also, at the very beginning it is necessary to know the total number of elements in the list (which was not necessarily an O(1) operation before C++11).

Nevertheless, std::list<>::sort() in VS2015 does exactly that. Here's an excerpt from that implementation that locates the mid-point and performs recursive calls

...
iterator _Mid = _STD next(_First, _Size / 2);
_First = _Sort(_First, _Mid, _Pred, _Size / 2);
_Mid = _Sort(_Mid, _Last, _Pred, _Size - _Size / 2);
...

As you can see, they just nonchalantly use std::next to walk through the first half of the list and arrive at _Mid iterator.

What could be the reason behind this switch, I wonder? All I see is a seemingly obvious inefficiency of repetitive calls to std::next at each level of recursion. Naive logic says that this is slower. If they are willing to pay this kind of price, they probably expect to get something in return. What are they getting then? I don't immediately see this algorithm as having better cache behavior (compared to the original bottom-up approach). I don't immediately see it as behaving better on pre-sorted sequences.

Granted, since C++11 std::list<> is basically required to store its element count, which makes the above slightly more efficient, since we always know the element count in advance. But that still does not seem to be enough to justify the sequential scan on each level of recursion.

(Admittedly, I haven't tried to race the implementations against each other. Maybe there are some surprises there.)

Solution

    template <class _Pr2>
    static _Nodeptr _Sort(_Nodeptr& _First, const size_type _Size, _Pr2 _Pred) {
        // order [_First, _First + _Size), return _First + _Size
        switch (_Size) {
        case 0:
            return _First;
        case 1:
            return _First->_Next;
        default:
            break;
        }

        auto _Mid        = _Sort(_First, _Size / 2, _Pred);
        const auto _Last = _Sort(_Mid, _Size - _Size / 2, _Pred);
        _First           = _Merge_same(_First, _Mid, _Last, _Pred);
        return _Last;
    }
// ...
    void sort() { // order sequence
        sort(less<>{});
    }

    template <class _Pr2>
    void sort(_Pr2 _Pred) { // order sequence
        auto& _My_data = _Mypair._Myval2;
        _Scary_val::_Sort(_My_data._Myhead->_Next, _My_data._Mysize, _Pass_fn(_Pred));
    }

This could be slightly improved (about 4% faster) by checking for (_My_data._Mysize < 2) in sort(), which then only requires checking for (_Size == 1) in _Sort()

    template <class _Pr2>
    static _Nodeptr _Sort(_Nodeptr& _First, const size_type _Size, _Pr2 _Pred) {
        // order [_First, _First + _Size), return _First + _Size
        if (_Size == 1)
            return _First->_Next;
        auto _Mid        = _Sort(_First, _Size / 2, _Pred);
        const auto _Last = _Sort(_Mid, _Size - _Size / 2, _Pred);
        _First           = _Merge_same(_First, _Mid, _Last, _Pred);
        return _Last;
    }
// ...
    void sort() { // order sequence
        sort(less<>{});
    }

    template <class _Pr2>
    void sort(_Pr2 _Pred) { // order sequence
        auto& _My_data = _Mypair._Myval2;
        if (_My_data._Mysize < 2)
            return;
        _Scary_val::_Sort(_My_data._Myhead->_Next, _My_data._Mysize, _Pass_fn(_Pred));
    }

The remainder of this answer is historical, mostly about my implementation of a bottom up merge sort using iterators to replace the top down merge sort of Visual Studio 2015 to 2019.

Initially I assumed that Microsoft would not have switched to a less efficient top down merge sort when it switched to using iterators unless it was necessary, so I was looking for alternatives. It was only when I tried to analyze the issues (out of curiosity) that I realized that the original bottom up merge sort could be modified to work with iterators.

In @sbi's comment, he asked the author of the top down approach, Stephan T. Lavavej, why the change to iterators was made. Stephan's response was "to avoid memory allocation and default constructing allocators". VS2015 introduced non-default-constructible and stateful allocators, which presents an issue when using the prior version's array of lists, as each instance of a list allocates a dummy node, and a change would be needed to handle no default allocator.

Lavavej's solution was to switch to using iterators to keep track of run boundaries within the original list instead of an internal array of lists. The merge logic was changed to use 3 iterator parameters, 1st parameter is iterator to start of left run, 2nd parameter is iterator to end of left run == iterator to start of right run, 3rd parameter is iterator to end of right run. The merge process uses std::list::splice to move nodes within the original list during merge operations. This has the added benefit of being exception safe. If a caller's compare function throws an exception, the list will be re-ordered, but no loss of data will occur (assuming splice can't fail). With the prior scheme, some (or most) of the data would be in the internal array of lists if an exception occurred, and data would be lost from the original list.

I changed bottom up merge sort to use an array of iterators instead of an array of lists, where array[i] is an iterator to the start of a sorted run with 2^i nodes, or it is empty (using std::list::end to indicate empty, since iterators can't be null). Similar to the top down approach, the array of iterators is only used to keep track of sorted run boundaries within the original linked list, with the same merge logic as top down that uses std::list::splice to move nodes within the original linked list.

A single scan of the list is done, building up sorted runs to the left of the current scan.next position according to the sorted run boundaries in the array, until all nodes are merged into the sorted runs. Then the sorted runs are merged, resulting in a sorted list.

For example, for a list with 7 nodes, after the scan:

2                       1           0          array index
run0->run0->run0->run0->run1->run1->run2->end

Then the 3 sorted runs are merged right to left via merge(left, right), so that the sort is stable.

If a linked list is large and the nodes are scattered, there will be a lot of cache misses, and top down will be about 40% to 50% slower than bottom up depending on the processor. Then again, if there's enough memory, it would usually be faster to move the list to an array or vector, sort the array or vector, then create a new list from the sorted array or vector.

Example C++ code:

template <typename T>
typename std::list<T>::iterator Merge(std::list<T> &ll,
                    typename std::list<T>::iterator li,
                    typename std::list<T>::iterator ri,
                    typename std::list<T>::iterator ei);

// iterator array size
#define ASZ 32

template <typename T>
void SortList(std::list<T> &ll)
{
    if (ll.size() < 2)                  // return if nothing to do
        return;
    typename std::list<T>::iterator ai[ASZ]; // array of iterator (bgn lft)
    typename std::list<T>::iterator ri;      // right    iterator (end lft, bgn rgt)
    typename std::list<T>::iterator ei;      // end      iterator (end rgt)
    size_t i;
    for (i = 0; i < ASZ; i++)           // "clear" array
        ai[i] = ll.end();
    // merge nodes into array of runs
    for (ei = ll.begin(); ei != ll.end();) {
        ri = ei++;
        for (i = 0; (i < ASZ) && ai[i] != ll.end(); i++) {
            ri = Merge(ll, ai[i], ri, ei);
            ai[i] = ll.end();
        }
        if(i == ASZ)
            i--;
        ai[i] = ri;
    }
    // merge array of runs into single sorted list
    // ei = ll.end();                              
    for(i = 0; (i < ASZ) && ai[i] == ei; i++);
    ri = ai[i++];
    while(1){
        for( ; (i < ASZ) && ai[i] == ei; i++);
        if (i == ASZ)
            break;
        ri = Merge(ll, ai[i++], ri, ei);
    }
}

template <typename T>
typename std::list<T>::iterator Merge(std::list<T> &ll,
                             typename std::list<T>::iterator li,
                             typename std::list<T>::iterator ri,
                             typename std::list<T>::iterator ei)
{
    typename std::list<T>::iterator ni;
    (*ri < *li) ? ni = ri : ni = li;
    while(1){
        if(*ri < *li){
            ll.splice(li, ll, ri++);
            if(ri == ei)
                return ni;
        } else {
            if(++li == ri)
                return ni;
        }
    }
}

Example replacement code for VS2019's std::list::sort(), in include file list. The merge logic was made into a separate internal function, since it's now used in two places. The call to _Sort from std::list::sort() is _Sort(begin(), end(), _Pred, this->_Mysize());, where _Pred is a pointer to the compare function (defaults to std::less()).

private:
    template <class _Pr2>
    iterator _Merge(_Pr2 _Pred, iterator _First, iterator _Mid, iterator _Last){
        iterator _Newfirst = _First;
        for (bool _Initial_loop = true;;
            _Initial_loop       = false) { // [_First, _Mid) and [_Mid, _Last) are sorted and non-empty
            if (_DEBUG_LT_PRED(_Pred, *_Mid, *_First)) { // consume _Mid
                if (_Initial_loop) {
                    _Newfirst = _Mid; // update return value
                }
                splice(_First, *this, _Mid++);
                if (_Mid == _Last) {
                    return _Newfirst; // exhausted [_Mid, _Last); done
                }
            }
            else { // consume _First
                ++_First;
                if (_First == _Mid) {
                    return _Newfirst; // exhausted [_First, _Mid); done
                }
            }
        }
    }

    template <class _Pr2>
    void _Sort(iterator _First, iterator _Last, _Pr2 _Pred,
        size_type _Size) { // order [_First, _Last), using _Pred, return new first
                           // _Size must be distance from _First to _Last
        if (_Size < 2) {
            return;        // nothing to do
        }
        const size_t _ASZ = 32;         // array size
        iterator _Ai[_ASZ];             // array of   iterator to run (bgn lft)
        iterator _Mi;                   // middle     iterator to run (end lft, bgn rgt)
        iterator _Li;                   // last (end) iterator to run (end rgt)
        size_t _I;                      // index to _Ai
        for (_I = 0; _I < _ASZ; _I++)   // "empty" array
            _Ai[_I] = _Last;            //   _Ai[] == _Last => empty entry
        // merge nodes into array of runs
        for (_Li = _First; _Li != _Last;) {
            _Mi = _Li++;
            for (_I = 0; (_I < _ASZ) && _Ai[_I] != _Last; _I++) {
                _Mi = _Merge(_Pass_fn(_Pred), _Ai[_I], _Mi, _Li);
                _Ai[_I] = _Last;
            }
            if (_I == _ASZ)
                _I--;
            _Ai[_I] = _Mi;
        }
        // merge array of runs into single sorted list
        for (_I = 0; _I < _ASZ && _Ai[_I] == _Last; _I++);
        _Mi = _Ai[_I++];
        while (1) {
            for (; _I < _ASZ && _Ai[_I] == _Last; _I++);
            if (_I == _ASZ)
                break;
            _Mi = _Merge(_Pass_fn(_Pred), _Ai[_I++], _Mi, _Last);
        }
    }

I noticed this change back in July, 2016 and emailed P.J. Plauger about this change on August 1, 2016. A snippet of his reply:

Interestingly enough, our change log doesn't reflect this change. That probably means it was "suggested" by one of our larger customers and got by me on the code review. All I know now is that the change came in around the autumn of 2015. When I reviewed the code, the first thing that struck me was the line:
    iterator _Mid = _STD next(_First, _Size / 2);
which, of course, can take a very long time for a large list.

The code looks a bit more elegant than what I wrote in early 1995(!), but definitely has worse time complexity. That version was modeled after the approach by Stepanov, Lee, and Musser in the original STL. They are seldom found to be wrong in their choice of algorithms.

I'm now reverting to our latest known good version of the original code.