c++algorithm performance optimization iterator

Is performance of std::rotate hampered by the iterator interface?

While standard algorithms should be as optimized as possible based on the iterator category, it seems to me like a lot of performance is left on the table by not being able to consider the underlying container.

The best example I can think of is std::rotate which has to be implemented using element-wise swaps leading to O(n) runtime, where n = distance(first, last), while the underlying containers might provide a better interface:

std::(forward_)list - O(1) using splice for all cases
std::deque - O(d) where d = min(distance(first, middle), distance(middle, last)) using repeated pop_front and push_back (or other way around) when rotating entire container, so O(1) for common case of rotating a single step
std::vector - technically always O(n) due to removal or insertion at beginning, but might be better optimized for trivially copyable types by copying multiple elements at once

Is my reasoning correct here? Can the cases above be optimized with std::ranges::rotate?

Solution

The STL generally provides the best generic algorithms currently available. A common misunderstanding is to think they are the best, period, which is largely false...

Most STL containers or algorithms can easily be outperformed in terms of speed or memory usage, provided you lock down certain aspects specific to your needs: fixed size known at compile time (allows static allocation), particular use cases (e.g. circular buffer of size 2^N, where the index can wrap using a binary AND), specific data types (enabling tweaks in alignment, allocation, etc.), operation without mutexes, and who knows what else. And this remains true even when using STL specializations, custom allocators, traits, ranges, or any other technique that lets you "finely" tune the containers used. The underlying issue is that the added code is not always "simple", nor necessarily optimized correctly during compilation. And the worst part is that you might waste an infernal amount of time trying to customize the STL and benchmarking your attempts, only to give up and fall back to plan B - custom code.

In short, you're trading off generality, adaptability, interface multiplicity, and overall maintainability in exchange for increased performance and/or reduced memory footprint, with the (potentially major) drawback of reduced or even near-zero maintainability. The gain might be negligible, even invisible to the naked eye (e.g. config file parsing), or critical (20% faster execution in a critical loop). It's your project that will determine whether optimization makes sense. If your custom code lies in a "buried" part of the project, never touched and testable very thoroughly (and thus, likely bug-free), then having specific code isn't a problem: back in the day, people even inserted Assembly code directly for this reason, code that probably still exists and will never be touched again.

But what is certain is that trying to optimize from the start is a very bad idea: in 95% of cases, you'll waste time (and thus money), sacrifice maintainability (costing even more money if fixes or evolutions are needed), for an anecdotal gain. So do your job with standard STL elements, and IF a performance issue is identified, then consider the specific algorithm that might (or might not...) solve it.

Because it's crucial to remember that these kinds of optimizations are, in general, MARGINAL – except in critical loops executed millions or billions of times. The real optimizations, the ones that yield an order-of-magnitude improvement, come from changing the underlying algorithm, not just the containers used to implement it.