How does memory usage of thread_local scale with number of threads?

I presume C/C++ standards do not say anything about complexity so I am curious about specific implementations (I presume they all have same behavior).

Assume I have the following C++ function.

void fn() {
    thread_local char arr[1024*1024]{};
    // do something with arr
}

And my program has 80 threads, 47 of them at least once run fn().

Does the memory usage of my program grows around 47 times some constant, 80 times some constant, or is there some other formula for this?

note: there is this Java question that got closed for some reason, but IDK if Java uses same primitives as C/C++.

Solution

This is likely largely implementation dependant though you can verify the behaviour of your implementation fairly easily. For example running the following program on windows (using a debug visual studio build to avoid optimisations removing the unused code):

#include <iostream>
#include <array>
#include <thread>

struct Foo
{
    std::array<char, 1'000'000'000> data;
};

void bar()
{
    thread_local Foo foo;
    for (int i = 0; i < foo.data.size(); i++)
    {
        foo.data[i] = i;
    }
    std::this_thread::sleep_for(std::chrono::seconds(1000));
}

int main()
{
    std::thread thread1([]
    {
        bar();
    });

    std::thread thread2([]
    {
        std::this_thread::sleep_for(std::chrono::seconds(1000));
    });

    thread1.join();
    thread2.join();
}

Uses 3GB of memory (1GB for the two threads and 1GB for the main thread). Removing thread2 drops the memory usage to 2GB. On Linux this behaviour is likely to be different as it has over allocation and unused memory pages are not allocated until they're used.

You can avoid this by using smart pointers to only allocate the memory when its actually used, for example changing bar to:

void bar()
{
    thread_local std::unique_ptr<Foo> foo = std::make_unique<Foo>();
    for (int i = 0; i < foo->data.size(); i++)
    {
        foo->data[i] = i;
    }
    std::this_thread::sleep_for(std::chrono::seconds(1000));
}

Reduces the memory usage to 1GB as only thread1 actually allocates the large array, thread2 and the main thread only have to store the unique_ptr.