MPI with C slower if more processes are used

I am learning MPI with C and I wrote a code based on the one presented in this link: http://condor.cc.ku.edu/~grobe/docs/intro-MPI-C.shtml.

In this code a vector containing 1e8 values are summed. However, I am observing that when using more processes the run time is getting bigger. The code is given bellow:

/*
Based on the code presented at http://condor.cc.ku.edu/~grobe/docs/intro-MPI-C.shtml

Code which splits a vector and send information to other processes.
In case of main vector does not split equally to all processes, the leftover is passed to process id 1.
Process id 0 is the root process. Therefore it does not count while passing information.

Each process will calculate the partial sum of vector values and send it back to root process, which will calculate the total sum.
Since the processes are independent, the printing order will be different at each run.

compile as: mpicc -o vector_sum vector_send.c -lm
run as: time mpirun -n x vector_sum

x = number of splits desired + root process. For example: if * = 3, the vector will be splited in two.
*/

#include<stdio.h>
#include<mpi.h>
#include<math.h>

#define vec_len 100000000
double vec1[vec_len];
double vec2[vec_len];

int main(int argc, char* argv[]){
    // defining program variables
    int i;
    double sum, partial_sum;

    // defining parallel step variables
    int my_id, num_proc, ierr, an_id, root_process; // id of process and total number of processes
    int num_2_send, num_2_recv, start_point, vec_size, rows_per_proc, leftover;

    ierr = MPI_Init(&argc, &argv);

    root_process = 0;

    ierr = MPI_Comm_size(MPI_COMM_WORLD, &num_proc);
    ierr = MPI_Comm_rank(MPI_COMM_WORLD, &my_id);

    if(my_id == root_process){
        // Root process: Define vector size, how to split vector and send information to workers
        vec_size = 1e8; // size of main vector

        for(i = 0; i < vec_size; i++){
            //vec1[i] = pow(-1.0,i+2)/(2.0*(i+1)-1.0); // defining main vector...  Correct answer for total sum = 0.78539816339
            vec1[i] = pow(i,2)+1.0; // defining main vector... 
            //printf("Main vector position %d: %f\n", i, vec1[i]); // uncomment if youwhish to print the main vector
        }

        rows_per_proc = vec_size / (num_proc - 1); // average values per process: using (num_proc - 1) because proc 0 does not count as a worker.
        rows_per_proc = floor(rows_per_proc); // getting the maximum integer possible.
        leftover = vec_size - (num_proc - 1)*rows_per_proc; // counting the leftover.

        // spliting and sending the values
        
        for(an_id = 1; an_id < num_proc; an_id++){
            if(an_id == 1){ // worker id 1 will have more values if there is any leftover.
                num_2_send = rows_per_proc + leftover; // counting the amount of data to be sent.
                start_point = (an_id - 1)*num_2_send; // defining initial position in the main vector (data will be sent from here)
            }
            else{
                num_2_send = rows_per_proc;
                start_point = (an_id - 1)*num_2_send + leftover; // starting point for other processes if there is leftover.
            }
            
            ierr = MPI_Send(&num_2_send, 1, MPI_INT, an_id, 1234, MPI_COMM_WORLD); // sending the information of how many data is going to workers.
            ierr = MPI_Send(&vec1[start_point], num_2_send, MPI_DOUBLE, an_id, 1234, MPI_COMM_WORLD); // sending pieces of the main vector.
        }

        sum = 0;
        for(an_id = 1; an_id < num_proc; an_id++){
            ierr = MPI_Recv(&partial_sum, 1, MPI_DOUBLE, an_id, 4321, MPI_COMM_WORLD, MPI_STATUS_IGNORE); // recieving partial sum.
            sum = sum + partial_sum;
        }

        printf("Total sum = %f.\n", sum);

    }
    else{
        // Workers:define which operation will be carried out by each one
        ierr = MPI_Recv(&num_2_recv, 1, MPI_INT, root_process, 1234, MPI_COMM_WORLD, MPI_STATUS_IGNORE); // recieving the information of how many data worker must expect.
        ierr = MPI_Recv(&vec2, num_2_recv, MPI_DOUBLE, root_process, 1234, MPI_COMM_WORLD, MPI_STATUS_IGNORE); // recieving main vector pieces.
        
        partial_sum = 0;
        for(i=0; i < num_2_recv; i++){
            //printf("Position %d from worker id %d: %d\n", i, my_id, vec2[i]); // uncomment if youwhish to print position, id and value of splitted vector
            partial_sum = partial_sum + vec2[i];
        }

        printf("Partial sum of %d: %f\n",my_id, partial_sum);

        ierr = MPI_Send(&partial_sum, 1, MPI_DOUBLE, root_process, 4321, MPI_COMM_WORLD); // sending partial sum to root process.
        
    }

    ierr = MPI_Finalize();
    
}

Obs.: Compile as


mpicc -o vector_sum vector_send.c -lm

and run as:

time mpirun -n x vector_sum

with x = 2 and 5. You will see that with x=5 it takes more time to run.

Did I do something wrong? I did not expected it to be slower, since the summation of each chunk is independent. Or it is a matter of how the program is sending the information for each process? It seems to me that the loops for sending the information for each process is the responsible for this longer time.

Solution

The new code using MPI_Reduce() is faster and simpler than the previous one:

/*
Based on the code presented at http://condor.cc.ku.edu/~grobe/docs/intro-MPI-C.shtml

Code which calculate the sum of a vector using parallel computation.
In case of main vector does not split equally to all processes, the leftover is passed to process id 0.
Process id 0 is the root process. However, it will also perform part of calculations.

Each process will generate and calculate the partial sum of the vector values. It will be used MPI_Reduce() to calculate the total sum.
Since the processes are independent, the printing order will be different at each run.

compile as: mpicc -o vector_sum vector_sum.c -lm
run as: time mpirun -n x vector_sum

x = number of splits desired + root process. For example: if x = 3, the vector will be splited in two.

Acknowledgements: I would like to thanks Gilles Gouaillardet (https://stackoverflow.com/users/8062491/gilles-gouaillardet) for the helpful suggestion.
*/

#include<stdio.h>
#include<mpi.h>
#include<math.h>

#define vec_len 100000000
double vec2[vec_len];

int main(int argc, char* argv[]){
    // defining program variables
    int i;
    double sum, partial_sum;

    // defining parallel step variables
    int my_id, num_proc, ierr, an_id, root_process;
    int vec_size, rows_per_proc, leftover, num_2_gen, start_point;

    vec_size = 1e8; // defining the main vector size
    
    ierr = MPI_Init(&argc, &argv);

    root_process = 0;

    ierr = MPI_Comm_size(MPI_COMM_WORLD, &num_proc);
    ierr = MPI_Comm_rank(MPI_COMM_WORLD, &my_id);

    rows_per_proc = vec_size/num_proc; // getting the number of elements for each process.
    rows_per_proc = floor(rows_per_proc); // getting the maximum integer possible.
    leftover = vec_size - num_proc*rows_per_proc; // counting the leftover.

    if(my_id == 0){
        num_2_gen = rows_per_proc + leftover; // if there is leftover, it is calculate in process 0
        start_point = my_id*num_2_gen; // the corresponding position on the main vector
    }
    else{
        num_2_gen = rows_per_proc;
        start_point = my_id*num_2_gen + leftover; // the corresponding position on the main vector
    }

    partial_sum = 0;
    for(i = start_point; i < start_point + num_2_gen; i++){
        vec2[i] = pow(i,2) + 1.0; // defining vector values
        partial_sum += vec2[i]; // calculating partial sum
    }

    printf("Partial sum of process id %d: %f.\n", my_id, partial_sum);

    MPI_Reduce(&partial_sum, &sum, 1, MPI_DOUBLE, MPI_SUM, root_process, MPI_COMM_WORLD); // calculating total sum

    if(my_id == root_process){
        printf("Total sum is %f.\n", sum);
    }

    ierr = MPI_Finalize();
    return 0;
    
}