I came up with the following example, which exposes some unexpected behavior. I would expect that after push_back, whatever is in the vector is there. It looks like the compiler somehow decided to re-use memory used by str.
Could someone explain what is happening in this example? Is this valid c++ code?
The original problem arises from code responsible for serializing / deserializing messages and it uses const_cast to remove constness. After noticing some unexpected behavior with that code, I created this simplified example, which tries to demonstrate the issue.
#include <vector>
#include <iostream>
#include <string>
using namespace std;
int main()
{
auto str = std::string("XYZ"); // mutable string
const auto& cstr(str); // const ref to it
vector<string> v;
v.push_back(cstr);
cout << v.front() << endl; // XYZ is printed as expected
*const_cast<char*>(&cstr[0])='*'; // this will modify the first element in the VECTOR (is this expected?)
str[1]='#'; //
cout << str << endl; // prints *#Z as expected
cout << cstr << endl; // prints *#Z as expected
cout << v.front() << endl; // Why *YZ is printed, not XYZ and not *#Z ?
return 0;
}
The unexpected behavior occurs because of quirks in a depreciated implementation of std::string
. Older versions of GCC implemented std::string
using copy-on-write semantics. It's a clever idea, but it causes bugs like the one you're seeing. What that means is that GCC tried to define std::string
so that the internal string buffer only got copied if the new std::string
was modified. For example:
std::string A = "Hello, world";
std::string B = A; // No copy occurs (yet)
A[3] = '*'; // Copy occurs now because A got modified.
When you take a constant pointer, however, no copy occurs because the library assumes that the string will not be modified through that pointer:
std::string A = "Hello, world";
std::string B = A;
std::string const& A_ref = A;
const_cast<char&>(A_ref[3]) = '*'; // No copy occurs (your bug)
As you've noticed, copy-on-write semantics tends to cause bugs. Because of this, and because copying a string is pretty cheap (all things considered), the copy copy-on-write implementation of std::string
was depreciated and removed in GCC 5.
So why are you seeing this bug if you're using GCC 5? It's likely that you're compiling and linking an older version of the C++ standard library (one where copy-on-write is still the implementation of std::string
). This is what's causing the bug for you.
Check which version of the C++ standard library you're compiling against, and if possible, update your compiler.
std::string
my compiler is using?sizeof(std::string) == 32
(when compiling for 64 bit)sizeof(std::string) == 8
(when compiling for 64 bit)If your compiler is using the old implementation of std::string
, then sizeof(std::string)
is the same as sizeof(char*)
because std::string
is implemented as a pointer to a block of memory. The block of memory is the one that actually contains things like the size and capacity of the string.
struct string { //Old data layout
size_t* _data;
size_t size() const {
return *(data - SIZE_OFFSET);
}
size_t capacity() const {
return *(data - CAPACITY_OFFSET);
}
char const* data() const {
return (char const*)_data;
}
};
On the other hand, if you're using the newer implementation of std::string
, then sizeof(std::string)
should be 32 bytes (on 64 bit systems). This is because the newer implementation stores the size and capacity of the string within the std::string
itself, rather than in the data it points to:
struct string { // New data layout
char* _data;
size_t _size;
size_t _capacity;
size_t _padding;
// ...
};
What's good about the new implementation? The new implementation has a number of benefits:
std::string
is 32 bytes, we can take advantage of Small String Optimization. Small String Optimization allows strings less than 16 characters long to be stored within the space normally taken up by _capacity
and _padding
. This avoids heap allocations, and is faster for most use cases. We can see below that GDB uses the old implementation of std::string
, because sizeof(std::string)
returns 8 bytes: