c++c++20string-view

Is `std::map<std::string, int>` faster than `std::map<std::string_view, int>`?


I wanted to index substrings of a std::string whose lifetime lasts for the entire program, and relate each substring to an int using a good old std::map. The substrings are delimited by blank spaces ' '. I generated the string and stored it to a file using this python code

import random
all_chars = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ 123456789!\"·$%&/()=?¿¡'|@#¬"
for i in range(0, 50000000):
    r = int(random.random() * len(all_chars))
    print(all_chars[r], end = '')

So, I wanted to compare the efficiency to split a std::string between using a std::string and std::string_view to store the substrings via these two (simple) functions.

[[nodiscard]]
std::vector<std::string_view> split_sv(const std::string& s, char c) noexcept {
    std::vector<std::string_view> res;
    std::size_t b = 0;
    std::size_t p = s.find(c);
    while (p != std::string_view::npos) {
        res.push_back( std::string_view{&s[b], p - b} );
        b = p + 1;
        p = s.find(c, b);
    }
    return res;
}

[[nodiscard]]
std::vector<std::string> split_s(const std::string& s, char c) noexcept {
    std::vector<std::string> res;
    std::size_t b = 0;
    std::size_t p = s.find(c);
    while (p != std::string::npos) {
        res.push_back( s.substr(b, i - b) );
        b = p + 1;
        p = s.find(c, b);
    }
    return res;
}

The split algorithm is clearly faster for std::string_view than it is for std::string (as expected):

std::string_view  28.6 ms
std::string       107.6 ms

(flags -std=c++20 -O3, gcc 11.4.0)

However, storing these substrings into a std::map is slower for std::string_view than for std::string.

{
    const std::vector<std::string_view> tokens = split(s, ' ');
    std::map<std::string_view, int> m;
    for (const std::string_view& t : tokens) {
        m.insert( {t, 0} );
    }
}
{
    const std::vector<std::string> tokens = split(s, ' ');
    std::map<std::string, int> m;
    for (const std::string& t : tokens) {
        m.insert( {t, 0} );
    }
}

The execution time of only the for loops above is:

std::string_view  445.9 ms
std::string       411.7 ms

(flags -std=c++20 -O3, gcc 11.4.0)

The difference is not that large, but I expected the results for std::string_view to be faster in both portions of the program.

Why is std::map<std::string, int> faster in gcc 11.4.0? Is std::map<std::string, int> generally faster than std::map<std::string_view, int>?

Full C++ code in compiler explorer and also in quick-bench.

I've run the code with 3 different string seeds each with a different number of blank spaces (NBS): one with just one single blank space, one with 10 blank spaces, and another with 20 blank spaces. The table below shows the execution time (in seconds) for every different seed string. I thought it would be interesting since the quick-bench results do not agree with this.

NBS Number tokens/
Average split
length (chars)
Function map

string_view
map

string
unordered_map

string_view
unordered_map

string
1 617864 / 83.97 split 0.010 0.153 0.023 0.025
store 0.441 0.408 0.106 0.167
10 4988259 / 9.35 split 0.594 0.883 0.469 0.206
store 4.430 3.806 1.361 1.470
20 8062256 / 5.20 split 0.696 0.981 0.559 0.298
store 6.568 5.590 1.820 1.892

Solution

  • This is totally related to cache locality.

    When you store data as std::strings, the only necessary data will be placed locally in heap for the very limited number of strings you actually use. (For smaller strings SSO (Small String Optimization) also comes to play and these strings would be placed even "more locally"). This is quite cache-friendly.

    When you work with the tree in the 'std::map' most of these strings will be addressed frequently which will increase cache usage.

    When you use std::string_view every reference to the actual data has to be done in a huge array (your initial huge string) and will be sparse. Since your array is large and doesn't fit into caches, intensive random jumping across large memory array will lead to inefficient cache usage.

    Some details on the cause

    Cache is usually loaded in lines (let’s say 64 bytes) and with values stored with high locality the chance to get many useful values with one load is higher. On the opposite, with lesser locality you will need to load more pages which will eventually lead to Thrashing and cache performance degradation.

    On other causes and comparisons

    Regarding other performance comparisons and concerns in the original question. I would suggest opening new questions on these specific topics, as SO recommends, since they are related to very subtle and specific performance effects and require additional measurements and research which would degrade the quality of current topic. Nevertheless, here is some input as the author requested for further research.

    The unordered_map require on average less reads from memory; while std::map require on average O(log n) reads, the unordered_map requires O(1) reads (even if you have some elements in a bulk of items for the same hash code), so effect of sparse memory addressing decreases dramatically. On the other hand, it comes with the price for hash codes calculation (low) and extra memory for hash table and reads from it (not so high). So, you have at least three opposite influencers here, but overall in most cases now the effect of sparse memory addressing is quite low and it comes to get benefits from std::string_view way to avoid excessive memory allocations and object construction of 'std::string'. On the other hand, for small strings SSO saves 'std::string' here. So, this is a balance of at least five influencers which comes to play in this case.

    Implementations from different libraries and compilers could differ in many terms, including approach, source code implementation, advanced CPU instructions usage, cache utilization, etc., so they definitely worth putting in separate questions with lots of research. It is not enough to measure only on one type of CPU/RAM, etc. to say that something of such matter is faster than something else; it could be much slower on another architecture. And it would be nice to separate measurements with different numbers of tokens and average length, since they would affect different influencers and in order to have full picture, they must be separated; performance would depend differently from both. Again, these effects are too subtle and too specific to cover in one question.