Printing to std from parallel iterator in Rust

I have a task where I need to perform pairwise dot products of some vectors in structs, and print their result to stdout.

Below is a simplified MWE of what I am trying to do (using rayon = "1.10.0"):

use std::time::Instant;
use rayon::iter::{IntoParallelIterator, ParallelBridge, ParallelIterator};

struct Object {
    data: Vec<f64>,
}

fn dot(a: &[f64], b: &[f64]) -> f64 {
    a.iter().zip(b).map(|(a, b)| a * b).sum()
}

fn main() {
    let time = Instant::now();

    // read some data from a file and process it
    let mut objects = Vec::with_capacity(100);
    for _ in 0..objects.capacity() - 1 {
        objects.push(Object {
            data: vec![0.001; 1_000_000],
        });
    }

    // perform pairwise comparisons
    let pairs = (0..objects.len() - 1).flat_map(|i| (i + 1..objects.len()).map(move |j| (i, j)));

    pairs
        .par_bridge()
        .into_par_iter()
        .for_each(|(i, j)| {
            let c = dot(&objects[i].data, &objects[j].data);
            println!("{:2} {:2} -> {:<.20}", i, j, c)
        });

    let duration = time.elapsed();
    println!("time elapsed: {:.9?}", duration);
}

This program runs for 1.25 s, and when I remove the println! macro, it only takes 151 ms. This is comparable to the actual program I am working on, where the duration goes from 26 s to 2.5 s. So including println! increases the runtime by around 1000%.

In case anyone wanted to try this also, using a sequential loop takes around 6 s for me:

for (i, j) in pairs {
    let c = dot(&objects[i].data, &objects[j].data);
    println!("{:2} {:2} -> {:<.20}", i, j, c);
}

From my understanding, I am assuming that threads are spending a long time waiting for their turn to write to stdout? If so, what alternative approach can I take?

As all my comparisons output a string of the same length and format, "{i} {j} -> {dot}\n", I also considered pre-allocating a buffer of size (n * (n - 1)) / 2 * 32 (number of comparisons * length of formatted string), and this ran a little slower at 1.43 s. Here is that code:

let n = objects.len();
let comparisons = (n * (n - 1)) / 2;
let mut buffer = vec![0u8; comparisons * 32];

let pairs = (0..objects.len() - 1).flat_map(|i| (i + 1..objects.len()).map(move |j| (i, j)));

buffer
    .chunks_mut(32) // split buffer every 32 characters (every line of stdout)
    .zip(pairs)
    .par_bridge()
    .into_par_iter()
    .for_each(|(chunk, (i, j))| {
        let c = dot(&objects[i].data, &objects[j].data);

        // write to chunk instead of printing individual lines
        let bytes = format!("{:2} {:2} -> {:<.20}\n", i, j, c);
        let bytes = bytes.as_bytes();

        chunk.copy_from_slice(&bytes[..32]); // copy formatted string to non-overlapping chunk
    });

println!("{}", String::from_utf8(buffer).unwrap());

What I like about this approach is that all my output is in the correct order, like in the sequential version.

I appreciate any help that anyone can provide. Thank you for reading!

Solution

This program runs for 1.25 s, and when I remove the println! macro, it only takes 151 ms.

This is where the mistake lies. I discovered this through a few different methods.

Experiments

First, I tried running flamegraph. This showed a lot of time spent in iterators and rayon, but an extremely thin slice formatting and writing to stdout.

Second, I tried to replicate your 1.2s -> 0.1s result. I was trying this on Windows, but both printing and not printing was taking the same 1.2s. However, changing to a sequential iterator did show the big difference (4s -> 0.1s). Switching to Linux also showed the big difference for both parallel and sequential iterators. I wasn't expecting these results to be OS-dependent.

Third, I changed the code to use black_box.

pairs.par_bridge().into_par_iter().for_each(|(i, j)| {
    let c = dot(&objects[i].data, &objects[j].data);
    std::hint::black_box(c);
    // println!("{:2} {:2} -> {:<.20}", i, j, c)
});

All versions with black_box ran as slow as the printing versions.

Conclusion

So actually, printing isn't taking that long. What happens when you remove the print is the compiler optimizes out the entire inner loop since it doesn't have any effect.

If you want to learn more about speeding up code in general, check out the Rust Performance Book.