c++constantsstdstringconst-castcopy-on-write

Unexpected behavior involving const_cast


I came up with the following example, which exposes some unexpected behavior. I would expect that after push_back, whatever is in the vector is there. It looks like the compiler somehow decided to re-use memory used by str.

Could someone explain what is happening in this example? Is this valid c++ code?

The original problem arises from code responsible for serializing / deserializing messages and it uses const_cast to remove constness. After noticing some unexpected behavior with that code, I created this simplified example, which tries to demonstrate the issue.

#include <vector>
#include <iostream>
#include <string>
using namespace std;
int main()
{
    auto str = std::string("XYZ"); // mutable string
    const auto& cstr(str);         // const ref to it

    vector<string> v;
    v.push_back(cstr);

    cout << v.front() << endl;  // XYZ is printed as expected

    *const_cast<char*>(&cstr[0])='*'; // this will modify the first element in the VECTOR (is this expected?)
    str[1]='#';  //

    cout << str << endl;  // prints *#Z as expected
    cout << cstr << endl; // prints *#Z as expected
    cout << v.front() << endl; // Why *YZ is printed, not XYZ and not *#Z ?

    return 0;
}

Solution

  • Understanding the bug

    The unexpected behavior occurs because of quirks in a depreciated implementation of std::string. Older versions of GCC implemented std::string using copy-on-write semantics. It's a clever idea, but it causes bugs like the one you're seeing. What that means is that GCC tried to define std::string so that the internal string buffer only got copied if the new std::string was modified. For example:

    std::string A = "Hello, world";
    std::string B = A; // No copy occurs (yet)
    A[3] = '*'; // Copy occurs now because A got modified.
    

    When you take a constant pointer, however, no copy occurs because the library assumes that the string will not be modified through that pointer:

    std::string A = "Hello, world"; 
    std::string B = A;
    std::string const& A_ref = A;
    
    const_cast<char&>(A_ref[3]) = '*'; // No copy occurs (your bug)
    

    As you've noticed, copy-on-write semantics tends to cause bugs. Because of this, and because copying a string is pretty cheap (all things considered), the copy copy-on-write implementation of std::string was depreciated and removed in GCC 5.

    So why are you seeing this bug if you're using GCC 5? It's likely that you're compiling and linking an older version of the C++ standard library (one where copy-on-write is still the implementation of std::string). This is what's causing the bug for you.

    Check which version of the C++ standard library you're compiling against, and if possible, update your compiler.

    How can I tell which implemenation of std::string my compiler is using?

    If your compiler is using the old implementation of std::string, then sizeof(std::string) is the same as sizeof(char*) because std::string is implemented as a pointer to a block of memory. The block of memory is the one that actually contains things like the size and capacity of the string.

    struct string { //Old data layout
        size_t* _data; 
        size_t size() const {
            return *(data - SIZE_OFFSET); 
        }
        size_t capacity() const {
            return *(data - CAPACITY_OFFSET); 
        }
        char const* data() const {
            return (char const*)_data; 
        }
    };
    

    On the other hand, if you're using the newer implementation of std::string, then sizeof(std::string) should be 32 bytes (on 64 bit systems). This is because the newer implementation stores the size and capacity of the string within the std::string itself, rather than in the data it points to:

    struct string { // New data layout
        char* _data;
        size_t _size;
        size_t _capacity; 
        size_t _padding; 
        // ...
    }; 
    

    What's good about the new implementation? The new implementation has a number of benefits:

    We can see below that GDB uses the old implementation of std::string, because sizeof(std::string) returns 8 bytes:

    enter image description here