C++11 thread_local destructor behaviour

I have following situation: In a header "test.hpp" I define:

class ObjectA {
    public:
        ObjectA();
        ~ObjectA();
        static ObjectA & get_A();
};
class ObjectB {
    public:
        ~ObjectB();
        static ObjectB & get_B();
        void do_cleanup();
};

And in separate compilation units I implement ObjectB:

#include "test.hpp"
#include <iostream>
ObjectB::~ObjectB() {
    std::cout<<"ObjectB dtor"<<std::endl;
}
ObjectB & ObjectB::get_B() {
    thread_local ObjectB b_instance;
    return b_instance;
}
void ObjectB::do_cleanup() {
    std::cout<<"Clearing up B garbage..."<<std::endl;
}

ObjectA:

#include "test.hpp"
#include <iostream>
ObjectA::ObjectA() {
    ObjectB::get_B(); <--dummy call to initialize thread_local ObjectB;
}
ObjectA::~ObjectA() {
     std::cout<<"ObjectA dtor"<<std::endl;
     ObjectB::get_B().do_cleanup(); // <-- is this undefined behaviour??
}
ObjectA & ObjectA::get_A() {
     thread_local ObjectA a_instance;
     return a_instance;
}

And finally a test main():

#include <thread>
#include "test.hpp"
int main() {
    std::thread check([](){
    ObjectA::get_A(); //<--dummy call just to initialize thread_local object.
    });
    check.join();
    return 0;
}

Is above program well behaved or is accessing objectB, which has thread_local storage from ObjectA destructor which also has thread_local storage undefined behaviour? If so, why is it breaking and how do I fix it?

most related question I found

[edit, @Soonts answer]

In real use case, the A class is template, a quite complex one and B class is just large. A objects hold references to B's using shared_ptr<> and B's thread_locals are accessed as-needed basis. (A's are constructed in main thread and passed to workers) ObjectB::get_B() thus may not be called by worker threads before ObjectA::get_A() gets called.

Solution

The spec says couple of things about lifetimes:

Storage class specifiers

thread storage duration. The storage for the object is allocated when the thread begins and deallocated when the thread ends. Each thread has its own instance of the object.

Termination

If the completion of the constructor or dynamic initialization of an object with thread storage duration is sequenced before that of another, the completion of the destructor of the second is sequenced before the initiation of the destructor of the first.

Now back to your code.

You construct A, in the constructor you construct B. Therefore, the completion of the B constructor happens before the completion of the A constructor. According to the above, when thread is about to quit, it will first destroy A then B. According to the letter of the spec your code is OK.

Practically speaking, I’m not sure C++ compilers implement spec to that level of detail. If I were writing that code, I wouldn’t use thread_local objects this way. Instead, I would put B in non-static field of A. It’s just simpler and IMO more reliable than relying on such nuances of the language standard.