It seems like there is a bug in the windows implementation of std::async. Under heavy load (on the order of 1000 threads launched async per second), async tasks are never scheduled and waiting on the returned futures leads to deadlocks. See this piece of code (modified with launch policy deferred instead of async):
BundlingChunk(size_t numberOfInputs, Bundler* parent, ChunkIdType chunkId)
: m_numberOfInputs(numberOfInputs), m_parent(parent), m_chunkId(chunkId)
{
const BundlerChunkDescription& chunk = m_parent->m_chunks[m_chunkId];
const ChunkInfo& original = chunk.m_original;
auto& deserializers = m_parent->m_deserializers;
// Fetch all chunks in parallel.
std::vector<std::map<ChunkIdType, std::shared_future<ChunkPtr>>> chunks;
chunks.resize(chunk.m_secondaryChunks.size());
static std::atomic<unsigned long long int> chunksInProgress = 0;
for (size_t i = 0; i < chunk.m_secondaryChunks.size(); ++i)
{
for (const auto& c : chunk.m_secondaryChunks[i])
{
const auto chunkCreationLambda = ([this, c, i] {
chunksInProgress++;
ChunkPtr chunk = m_parent->m_weakChunkTable[i][c].lock();
if (chunk) {
chunksInProgress--;
return chunk;
}
chunksInProgress--;
return m_parent->m_deserializers[i]->GetChunk(c);
});
std::future<ChunkPtr> chunkCreateFuture = std::async(std::launch::deferred, chunkCreationLambda);
chunks[i].emplace(c, chunkCreateFuture.share());
}
}
std::vector<SequenceInfo> sequences;
sequences.reserve(original.m_numberOfSequences);
// Creating chunk mapping.
m_parent->m_primaryDeserializer->SequenceInfosForChunk(original.m_id, sequences);
ChunkPtr drivingChunk = chunks.front().find(original.m_id)->second.get();
m_sequenceToSequence.resize(deserializers.size() * sequences.size());
m_innerChunks.resize(deserializers.size() * sequences.size());
for (size_t sequenceIndex = 0; sequenceIndex < sequences.size(); ++sequenceIndex)
{
if (chunk.m_invalid.find(sequenceIndex) != chunk.m_invalid.end())
{
continue;
}
size_t currentIndex = sequenceIndex * deserializers.size();
m_sequenceToSequence[currentIndex] = sequences[sequenceIndex].m_indexInChunk;
m_innerChunks[currentIndex] = drivingChunk;
}
// Creating sequence mapping and requiring underlying chunks.
SequenceInfo s;
for (size_t deserializerIndex = 1; deserializerIndex < deserializers.size(); ++deserializerIndex)
{
auto& chunkTable = m_parent->m_weakChunkTable[deserializerIndex];
for (size_t sequenceIndex = 0; sequenceIndex < sequences.size(); ++sequenceIndex)
{
if (chunk.m_invalid.find(sequenceIndex) != chunk.m_invalid.end())
{
continue;
}
size_t currentIndex = sequenceIndex * deserializers.size() + deserializerIndex;
bool exists = deserializers[deserializerIndex]->GetSequenceInfo(sequences[sequenceIndex], s);
if (!exists)
{
if(m_parent->m_verbosity >= (int)TraceLevel::Warning)
fprintf(stderr, "Warning: sequence '%s' could not be found in the deserializer responsible for stream '%ls'\n",
m_parent->m_corpus->IdToKey(sequences[sequenceIndex].m_key.m_sequence).c_str(),
deserializers[deserializerIndex]->StreamInfos().front().m_name.c_str());
m_sequenceToSequence[currentIndex] = SIZE_MAX;
continue;
}
m_sequenceToSequence[currentIndex] = s.m_indexInChunk;
ChunkPtr secondaryChunk = chunkTable[s.m_chunkId].lock();
if (!secondaryChunk)
{
secondaryChunk = chunks[deserializerIndex].find(s.m_chunkId)->second.get();
chunkTable[s.m_chunkId] = secondaryChunk;
}
m_innerChunks[currentIndex] = secondaryChunk;
}
}
}
My version above is modified so that the async tasks are launched as deferred instead of async, which fixes the issue. Has anyone else seen something like this as of VS2017 redistributable 14.12.25810? Reproducing this issue is as easy as training CNTK model that uses the text and image readers on a machine with a GPU and SSD so that the CPU deserialization becomes the bottleneck. After about 30 minutes of training, a deadlock usually occurs. Has anyone seen a similar issue on Linux? If so, it could be a bug in the code, although I doubt it because the debug counter chunksInProgress
is always 0 after deadlock. For reference, the entire source file is located at https://github.com/Microsoft/CNTK/blob/455aef80eeff675c0f85c6e34a03cb73a4693bff/Source/Readers/ReaderLib/Bundler.cpp.
concurrency::task
. pity you never checked. it has always been this implementation . – Gusellappltasks.h
. I see, however, that this is built on top of Windows' native ThreadPool API, see here, so what makes you think it's such a lemon? – Hedistd::async
on Windows and concluded that this is governed by the way the underlying threadpool behaves (and indeed is designed to behave), so I wouldn't say that PPL is implemented 'badly', although how the threadpool it is built on works might not be to everyone's taste. More details here. – Hedi