cparallel-processingcilkcilk-plus

A curious case in parallel programming


I have a parallel program which sometimes runs and sometimes just gives segmentation fault. The executable when forced to run with 3 threads runs fine (basically it also run with single thread which is just serial) but it gives segmentation fault when forced to run with any other thread value. Here is the scenario:

From main.c inside main function:

cilk_for ( line_count = 0; line_count != no_of_lines ; ++line_count )
{
     //some stuff here
     for ( j=line_count+1; j<no_of_lines; ++j )
     {
         //some stuff here
         final_result[line_count][j] = bf_dup_eleminate ( table_bloom[line_count], file_names[j], j );
         //some stuff here
     }
     //some stuff here
}

bf_dup_eleminate function from bloom-filter.c file:

int bf_dup_eleminate ( const bloom_filter *bf, const char *file_name, int j )
{
    int count=-1;
    FILE *fp = fopen (file_name, "rb" );
    if (fp)
    {
        count = bf_dup_eleminate_read ( bf, fp, j);
        fclose ( fp );
    }
    else
    {
        printf ( "Could not open file\n" );
    }
    return count;
}

bf_dup_eleminate_read from bloom-filter.c file:

int bf_dup_eleminate_read ( const bloom_filter *bf, FILE *fp, int j )
{
    //some stuff here
    printf ( "before while loop. j is %d ** workder id: **********%d***********\n", j, __cilkrts_get_worker_number());
    while (/*somecondition*/)
    {/*some stuff*/}
    //some stuff
}

I had this error reported from intel inspector is:

ID | Problem                         |  Sources       
P1 | Unhandled application exception | bloom-filter.c

and the call stack is:

exec!bf_dup_eleminate_read - bloom-filter.c:550
exec!bf_dup_eleminate - bloom-filter.c:653
exec!__cilk_for_001.10209 - main.c:341

Similarly gdb also report the error at the same location and it is:

Now gdb tells me that you have the following error

0x0000000000406fc4 in bf_dup_eleminate_read (bf=<error reading variable: Cannot access memory at address 0x7ffff7edba58>, fp=<error reading variable: Cannot access memory at address 0x7ffff7edba50>, j=<error reading variable: Cannot access memory at address 0x7ffff7edba4c>) at bloom-filter.c:536

Line 536 is int bf_dup_eleminate_read ( const bloom_filter *bf, FILE *fp, int j )

Additional details:

Now my bloomfilter is a structture defined as

struct bloom_filter
{
    int64_t m;      //size of bloom filter.
    int32_t k;      //number of hash functions.
    uint8_t *array;
    int64_t no_of_elements_added;
    int64_t expected_no_of_elements;
};

and memory for it is allocated as follows:

    bloom_filter *bf = (bloom_filter *)malloc( sizeof(bloom_filter));
    if ( bf != NULL )
    {
        bf->m = filter_size*8;      /* Size of bloom filter */
        bf->k = num_hashes;
        bf->expected_no_of_elements = expected_no_of_elements;
        bf->no_of_elements_added = (int64_t)0;
        bf->array = (uint8_t *)malloc(filter_size);
        if ( bf->array == NULL )
        {
            free(bf);
            return NULL;
        }
    }  

There is only one copy of bloom_filter and each thread is supposed to access the same(as I am not modifying anything only reading).

Could anyone please help me because I am stuck here for last 4 days and I just can't think a way out. The worst part is it is running for 3 threads!!!

Note: cilk_for is just a keyword to spawn threads in cilk.


Solution

  • When a debugger tells you an error like this:

    0x0000000000406fc4 in bf_dup_eleminate_read (
        bf=<error reading variable: Cannot access memory at address 0x7ffff7edba58>,
        fp=<error reading variable: Cannot access memory at address 0x7ffff7edba50>,
        j=<error reading variable: Cannot access memory at address 0x7ffff7edba4c>
    ) at bloom-filter.c:536
    
    536: int bf_dup_eleminate_read ( const bloom_filter *bf, FILE *fp, int j )
    

    it usually indicates that the function entry code (called the function "prologue") is crashing. In short, your stack has become corrupted and the CPU is crashing when it is calculating the addresses of the three local variables and allocating space for them on the stack.

    Things I would check for or try to fix this error (none of which are guaranteed to work, and some of which you may have tried already):

    1. Make sure that you are not overrunning any space used by any local variables you have declared in other parts of your program.

    2. Make sure you're not writing to pointers that have been declared as local variables and then returned from a function in other parts of your program.

    3. Make sure that each thread has enough stack space to handle all the local variables you declare. Are you declaring any large stack-based buffers? The default per-thread stack size depends on the compiler settings, or in this case the cilk library. Try increasing the per-thread stack size at compilation time and see if the crash goes away.

    With a bit of luck, one of the above should enable you to narrow down the source of the problem.