c++cparallel-processingopenmp

Parallelize output using OpenMP


I've written a C++ app that has to process a lot of data. Using OpenMP I parallelized the processing phase quite well and, embarrassingly, found that the output writing is now the bottleneck. I decided to use a parallel for there as well, as the order in which I output items is irrelevant; they just need to be output as coherent chunks.

Below is a simplified version of the output code, showing all the variables except for two custom iterators in the "collect data in related" loop. My question is: is this the correct and optimal way to solve this problem? I read about the barrier pragma, do I need that?

long i, n = nrows();

#pragma omp parallel for
for (i=0; i<n; i++) {
    std::vector<MyData> related;
    for (size_t j=0; j < data[i].size(); j++)
        related.push_back(data[i][j]);
    sort(related.rbegin(), related.rend());

    #pragma omp critical
    {
        std::cout << data[i].label << "\n";
        for (size_t j=0; j<related.size(); j++)
            std::cout << "    " << related[j].label << "\n";
    }
}

(I labeled this question c as I imagine OpenMP is very similar in C and C++. Please correct me if I'm wrong.)


Solution

  • One way to get around output contention is to write the thread-local output to a string stream, (can be done in parallel) and then push the contents to cout (requires synchronization).

    Something like this:

    #pragma omp parallel for
    for (i=0; i<n; i++) {
        std::vector<MyData> related;
        for (size_t j=0; j < data[i].size(); j++)
            related.push_back(data[i][j]);
        sort(related.rbegin(), related.rend());
    
        std::stringstream buf;
        buf << data[i].label << "\n";
        for (size_t j=0; j<related.size(); j++)
            buf << "    " << related[j].label << "\n";
    
        #pragma omp critical
        std::cout << buf.rdbuf();
    }
    

    This offers much more fine-grained locking and the performance should increase accordingly. On the other hand, this still uses locking. So another way would be to use an array of stream buffers, one for each thread, and pushing them to cout sequentially after the parallel loop. This has the advantage of avoiding costly locks, and the output to cout must be serialized anyway.

    On the other hand, you can even try to omit the critical section in the above code. In my experience, this works since the underlying streams have their own way of controlling concurrency. But I believe that this behaviour is strictly implementation defined and not portable.