cmultithreadingparallel-processingopenmphaar-wavelet

Parallelize C code for 2D Haar wavelet transform with OpenMP


This is my first question. I'm trying to parallelize with openMP a 2d haar transform function in C. I obtained it here and modified accordingly. The program takes a black&white image, puts it into a matrix and computes one level of the haar wavelet transform. In the end it normalizes the values and writes the transformed image on the disk.

This is a resulting image 1 level of HDT

My problem is that the parallelized version runs quite slower than the serial one. For now I attach here a snippet from the main part I want to parallelize (later on I can put all the surrounding code):

void haar_2d ( int m, int n, double u[] )
// m & n are the dimentions (every image is a perfect square)
//u is the input array in **(non column-major!)** row-major order</del>
int i;
int j;
int k;
double s;
double *v;

int tid, nthreads, chunk;

s = sqrt ( 2.0 );

v = ( double * ) malloc ( m * n * sizeof ( double ) );

for ( j = 0; j < n; j++ )
{
    for ( i = 0; i < m; i++ )
    {
        v[i+j*m] = u[i+j*m];
    }
}
/*
Determine K, the largest power of 2 such that K <= M.
*/
k = 1;
while ( k * 2 <= m )
{
    k = k * 2;
}

/*   Transform all columns.  */

while ( n/2 < k ) // just 1 level of transformation
{
    k = k / 2;

    clock_t begin = clock();

    #pragma omp parallel shared(s,v,u,n,m,nthreads,chunk) private(i,j,tid)
    {
        tid = omp_get_thread_num();
        printf("Thread %d starting...\n",tid);

        #pragma omp for schedule (dynamic)
        for ( j = 0; j < n; j++ )
        {
            for ( i = 0; i < k; i++ )
            {               
                v[i  +j*m] = ( u[2*i+j*m] + u[2*i+1+j*m] ) / s;
                v[k+i+j*m] = ( u[2*i+j*m] - u[2*i+1+j*m] ) / s;
            }
        }

    #pragma omp for schedule (dynamic)
    for ( j = 0; j < n; j++ )
    {
        for ( i = 0; i < 2 * k; i++ )
        {
            u[i+j*m] = v[i+j*m];
        }
    }
}//end parallel

clock_t end = clock();
double time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
printf ( "Time for COLUMNS: %f ms\n", time_spent * 1000);

}//end while

// [...]code for rows
free ( v );

return;}

The timings more or less are:

Time for COLUMNS: 160.519000 ms // parallel
Time for COLUMNS: 62.842000 ms // serial

I have tried to re-arrange the pragmas in lots of different ways eg with static schedule, with sections, task and so on, also re-arranging the data scopes of the variables and dynamically allocating inside parallel regions. I thought it would be simple to parallelize a 2-level for, but now it has been two days that I'm struggling. Seeking for your help guys, I've already checked out near all the related questions here, but still not able to go on or, at least, understand the reasons. Thank you in advance. (CPU Intel Core i3-4005U CPU @ 1.70GHz × 4 threads, 2 cores )

UPDATE:

1) What about m & n, it is supposed to implement also rectangled images one day, so I just left it there.

2) I figured out that u is actually a normal array with a linearized matrix inside, that is row by row (I use PGM images).

3) The memcpy is a better option, so now I'm using it.

What about the main topic, I've tried to divide the job over n by spawning a task for each chunk and the result is a littel bit faster thatn the serial code. Now I know that the input matrix u is in good row-major order, the 2 fors seem to proceed accordingly, but I'm not sure about the timings: using both omp_get_wtime() and clock() I don't know how to measure the speedup. I did tests with different image sizes, from 16x16 up to 4096x4096, and the parallel version seems to be slower with clock() and faster with omp_get_wtime() and gettimeofday(). Do you have some suggestions of how to handle it correctly with OpenMP, or at least how to measure correctly the speedup?

while ( n/2 < k )
{
    k = k / 2;
    double start_time = omp_get_wtime();
    // clock_t begin = clock();
    #pragma omp parallel shared(s,v,u,n,m,nthreads,chunk) private(i,j,tid) firstprivate(k)
    {
        nthreads = omp_get_num_threads();

         #pragma omp single
         {
          printf("Number of threads = %d\n", nthreads);

          int chunk = n/nthreads;
          printf("Chunks size = %d\n", chunk);
          printf("Thread %d is starting the tasks.\n", omp_get_thread_num());

          int h;

          for(h=0;h<n;h = h + chunk){
          printf("FOR CYCLE i=%d\n", h);

            #pragma omp task shared(s,v,u,n,m,nthreads,chunk) private(i,j,tid) firstprivate(h,k)
            {
                tid = omp_get_thread_num();
                 printf("Thread %d starts at %d position\n", tid , h);

                for ( j = h; j < h + chunk; j++ )
                {
                    for ( i = 0; i < k; i++ )
                    {
                        v[i  +j*m] = ( u[2*i+j*m] + u[2*i+1+j*m] ) / s;
                        v[k+i+j*m] = ( u[2*i+j*m] - u[2*i+1+j*m] ) / s;
                    }
                }
            }// end task
        }//end launching for
        #pragma omp taskwait
        }//end single
        }//end parallel region

        // clock_t end = clock();
        // double time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
        // printf ( "COLUMNS: %f ms\n", time_spent * 1000);

        double time = omp_get_wtime() - start_time;
        printf ( "COLUMNS: %f ms\n", time*1000);

    for ( j = 0; j < n; j++ )
    {
        for ( i = 0; i < 2 * k; i++ )
        {
            u[i+j*m] = v[i+j*m];
        }
    }
 }//end while

Solution

  • The problem was that I was using clock() instead of omp_get_wtime(), thanks to Z boson.