It seems like there is a bug in the windows implementation of std::async. Under heavy load (on the order of 1000 threads launched async per second), async tasks are never scheduled and waiting on the returned futures leads to deadlocks. See this piece of code (modified with launch policy deferred instead of async):
BundlingChunk(size_t numberOfInputs, Bundler* parent, ChunkIdType chunkId)
: m_numberOfInputs(numberOfInputs), m_parent(parent), m_chunkId(chunkId)
{
const BundlerChunkDescription& chunk = m_parent->m_chunks[m_chunkId];
const ChunkInfo& original = chunk.m_original;
auto& deserializers = m_parent->m_deserializers;
// Fetch all chunks in parallel.
std::vector<std::map<ChunkIdType, std::shared_future<ChunkPtr>>> chunks;
chunks.resize(chunk.m_secondaryChunks.size());
static std::atomic<unsigned long long int> chunksInProgress = 0;
for (size_t i = 0; i < chunk.m_secondaryChunks.size(); ++i)
{
for (const auto& c : chunk.m_secondaryChunks[i])
{
const auto chunkCreationLambda = ([this, c, i] {
chunksInProgress++;
ChunkPtr chunk = m_parent->m_weakChunkTable[i][c].lock();
if (chunk) {
chunksInProgress--;
return chunk;
}
chunksInProgress--;
return m_parent->m_deserializers[i]->GetChunk(c);
});
std::future<ChunkPtr> chunkCreateFuture = std::async(std::launch::deferred, chunkCreationLambda);
chunks[i].emplace(c, chunkCreateFuture.share());
}
}
std::vector<SequenceInfo> sequences;
sequences.reserve(original.m_numberOfSequences);
// Creating chunk mapping.
m_parent->m_primaryDeserializer->SequenceInfosForChunk(original.m_id, sequences);
ChunkPtr drivingChunk = chunks.front().find(original.m_id)->second.get();
m_sequenceToSequence.resize(deserializers.size() * sequences.size());
m_innerChunks.resize(deserializers.size() * sequences.size());
for (size_t sequenceIndex = 0; sequenceIndex < sequences.size(); ++sequenceIndex)
{
if (chunk.m_invalid.find(sequenceIndex) != chunk.m_invalid.end())
{
continue;
}
size_t currentIndex = sequenceIndex * deserializers.size();
m_sequenceToSequence[currentIndex] = sequences[sequenceIndex].m_indexInChunk;
m_innerChunks[currentIndex] = drivingChunk;
}
// Creating sequence mapping and requiring underlying chunks.
SequenceInfo s;
for (size_t deserializerIndex = 1; deserializerIndex < deserializers.size(); ++deserializerIndex)
{
auto& chunkTable = m_parent->m_weakChunkTable[deserializerIndex];
for (size_t sequenceIndex = 0; sequenceIndex < sequences.size(); ++sequenceIndex)
{
if (chunk.m_invalid.find(sequenceIndex) != chunk.m_invalid.end())
{
continue;
}
size_t currentIndex = sequenceIndex * deserializers.size() + deserializerIndex;
bool exists = deserializers[deserializerIndex]->GetSequenceInfo(sequences[sequenceIndex], s);
if (!exists)
{
if(m_parent->m_verbosity >= (int)TraceLevel::Warning)
fprintf(stderr, "Warning: sequence '%s' could not be found in the deserializer responsible for stream '%ls'\n",
m_parent->m_corpus->IdToKey(sequences[sequenceIndex].m_key.m_sequence).c_str(),
deserializers[deserializerIndex]->StreamInfos().front().m_name.c_str());
m_sequenceToSequence[currentIndex] = SIZE_MAX;
continue;
}
m_sequenceToSequence[currentIndex] = s.m_indexInChunk;
ChunkPtr secondaryChunk = chunkTable[s.m_chunkId].lock();
if (!secondaryChunk)
{
secondaryChunk = chunks[deserializerIndex].find(s.m_chunkId)->second.get();
chunkTable[s.m_chunkId] = secondaryChunk;
}
m_innerChunks[currentIndex] = secondaryChunk;
}
}
}
My version above is modified so that the async tasks are launched as deferred instead of async, which fixes the issue. Has anyone else seen something like this as of VS2017 redistributable 14.12.25810? Reproducing this issue is as easy as training CNTK model that uses the text and image readers on a machine with a GPU and SSD so that the CPU deserialization becomes the bottleneck. After about 30 minutes of training, a deadlock usually occurs. Has anyone seen a similar issue on Linux? If so, it could be a bug in the code, although I doubt it because the debug counter chunksInProgress
is always 0 after deadlock. For reference, the entire source file is located at https://github.com/Microsoft/CNTK/blob/455aef80eeff675c0f85c6e34a03cb73a4693bff/Source/Readers/ReaderLib/Bundler.cpp.
New day, better answer (much better). Read on.
I spent some time investigating the behaviour of std::async
on Windows and you're right. It's a different animal, see here.
So, if your code relies on std::async
always starting a new thread of execution and returning immediately then you can't use it. Not on Windows, anyway. On my machine, the limit seems to be 768 background threads, which would fit in, more or less, with what you have observed.
Anyway, I wanted to learn a bit more about modern C++ so I had a crack at rolling my own replacement for std::async
that can be used on Windows with the semantics deaired by the OP. I therefore humbly present the following:
AsyncTask: drop-in replacement for std::async
#include <future>
#include <thread>
template <class Func, class... Args>
std::future <std::result_of_t <std::decay_t <Func> (std::decay_t <Args>...)>>
AsyncTask (Func&& f, Args&&... args)
{
using decay_func = std::decay_t <Func>;
using return_type = std::result_of_t <decay_func (std::decay_t <Args>...)>;
std::packaged_task <return_type (decay_func f, std::decay_t <Args>... args)>
task ([] (decay_func f, std::decay_t <Args>... args)
{
return f (args...);
});
auto task_future = task.get_future ();
std::thread t (std::move (task), f, std::forward <Args> (args)...);
t.detach ();
return task_future;
};
Test program
#include <iostream>
#include <string>
int add_two_integers (int a, int b)
{
return a + b;
}
std::string append_to_string (const std::string& s)
{
return s + " addendum";
}
int main ()
{
auto /* i.e. std::future <int> */ f1 = AsyncTask (add_two_integers , 1, 2);
auto /* i.e. int */ i = f1.get ();
std::cout << "add_two_integers : " << i << std::endl;
auto /* i.e. std::future <std::string> */ f2 = AsyncTask (append_to_string , "Hello world");
auto /* i.e. std::string */ s = f2.get (); std::cout << "append_to_string : " << s << std::endl;
return 0;
}
Output
add_two_integers : 3
append_to_string : Hello world addendum
Live demo here (gcc) and here (clang).
I learnt a lot from writing this and it was a lot of fun. I'm fairly new to this stuff, so all comments welcome. I'll be happy to update this post if I've got anything wrong.