A curious case in parallel programming
Asked Answered
C

1

5

I have a parallel program which sometimes runs and sometimes just gives segmentation fault. The executable when forced to run with 3 threads runs fine (basically it also run with single thread which is just serial) but it gives segmentation fault when forced to run with any other thread value. Here is the scenario:

From main.c inside main function:

cilk_for ( line_count = 0; line_count != no_of_lines ; ++line_count )
{
     //some stuff here
     for ( j=line_count+1; j<no_of_lines; ++j )
     {
         //some stuff here
         final_result[line_count][j] = bf_dup_eleminate ( table_bloom[line_count], file_names[j], j );
         //some stuff here
     }
     //some stuff here
}

bf_dup_eleminate function from bloom-filter.c file:

int bf_dup_eleminate ( const bloom_filter *bf, const char *file_name, int j )
{
    int count=-1;
    FILE *fp = fopen (file_name, "rb" );
    if (fp)
    {
        count = bf_dup_eleminate_read ( bf, fp, j);
        fclose ( fp );
    }
    else
    {
        printf ( "Could not open file\n" );
    }
    return count;
}

bf_dup_eleminate_read from bloom-filter.c file:

int bf_dup_eleminate_read ( const bloom_filter *bf, FILE *fp, int j )
{
    //some stuff here
    printf ( "before while loop. j is %d ** workder id: **********%d***********\n", j, __cilkrts_get_worker_number());
    while (/*somecondition*/)
    {/*some stuff*/}
    //some stuff
}

I had this error reported from intel inspector is:

ID | Problem                         |  Sources       
P1 | Unhandled application exception | bloom-filter.c

and the call stack is:

exec!bf_dup_eleminate_read - bloom-filter.c:550
exec!bf_dup_eleminate - bloom-filter.c:653
exec!__cilk_for_001.10209 - main.c:341

Similarly gdb also report the error at the same location and it is:

Now gdb tells me that you have the following error

0x0000000000406fc4 in bf_dup_eleminate_read (bf=<error reading variable: Cannot access memory at address 0x7ffff7edba58>, fp=<error reading variable: Cannot access memory at address 0x7ffff7edba50>, j=<error reading variable: Cannot access memory at address 0x7ffff7edba4c>) at bloom-filter.c:536

Line 536 is int bf_dup_eleminate_read ( const bloom_filter *bf, FILE *fp, int j )

Additional details:

Now my bloomfilter is a structture defined as

struct bloom_filter
{
    int64_t m;      //size of bloom filter.
    int32_t k;      //number of hash functions.
    uint8_t *array;
    int64_t no_of_elements_added;
    int64_t expected_no_of_elements;
};

and memory for it is allocated as follows:

    bloom_filter *bf = (bloom_filter *)malloc( sizeof(bloom_filter));
    if ( bf != NULL )
    {
        bf->m = filter_size*8;      /* Size of bloom filter */
        bf->k = num_hashes;
        bf->expected_no_of_elements = expected_no_of_elements;
        bf->no_of_elements_added = (int64_t)0;
        bf->array = (uint8_t *)malloc(filter_size);
        if ( bf->array == NULL )
        {
            free(bf);
            return NULL;
        }
    }  

There is only one copy of bloom_filter and each thread is supposed to access the same(as I am not modifying anything only reading).

Could anyone please help me because I am stuck here for last 4 days and I just can't think a way out. The worst part is it is running for 3 threads!!!

Note: cilk_for is just a keyword to spawn threads in cilk.

Candi answered 2/7, 2012 at 0:23 Comment(7)
it looks like you have a defect in the bloom_filter allocation where after doing the malloc() for bf->array you are checking for NULL of bf rather than bf->array. This is not your problem, just something I noticed.Twopenny
I admit that it was a mistake but I ran the updated code and it didn't work. Can you please suggest some other possibility for such an errorCandi
It appears from the error you are accessing memory that does not belong to you. The other immediate change that I would make is in the cilk_for() conditional statement, I would use line_count < no_of_lines rather than the !=. Can I assume that no_of_lines is actually the number of concurrent threads to run? It appears that what you are doing is to divide the work among a number of threads each of which will use bf_dup_eleminate() to do part of the job however the code there does not make sense in how it is indexing the table final_result.Twopenny
Also check this article out data race for cilk_for as well as this article correcting race conditions for cilk_forTwopenny
Now I am also thinking the same and the error reported also pretty clearly states that but why is this happening. Is it the case that memory allocated by a thread should only be used by that thread(i don't think so as it makes less sense). As far as the number of threads are concerned the cilk_for loop makes log(no_of_lines) to base 2 tasks and then assign it to the number of threads enforced by the programmer for the program to use.Candi
By the way, in your bloom_filter memory allocation listing, I'm sure you wanted to write if ( bf->array == NULL ) { free(bf); ... } as your last statement, didn't you... :-)Couplet
Yes, I corrected that. It was stupid of me to make that mistake, I don't know how I forgot that. Well I arrived at a conclusion that whenever two threads are accessing same memory location the error is happening so I guess that when you do not explicitly specify that you are accessing that location for read-only purposes you are granted a exclusive lock. I think that is the problem. Am I correct??Candi
C
8

When a debugger tells you an error like this:

0x0000000000406fc4 in bf_dup_eleminate_read (
    bf=<error reading variable: Cannot access memory at address 0x7ffff7edba58>,
    fp=<error reading variable: Cannot access memory at address 0x7ffff7edba50>,
    j=<error reading variable: Cannot access memory at address 0x7ffff7edba4c>
) at bloom-filter.c:536

536: int bf_dup_eleminate_read ( const bloom_filter *bf, FILE *fp, int j )

it usually indicates that the function entry code (called the function "prologue") is crashing. In short, your stack has become corrupted and the CPU is crashing when it is calculating the addresses of the three local variables and allocating space for them on the stack.

Things I would check for or try to fix this error (none of which are guaranteed to work, and some of which you may have tried already):

  1. Make sure that you are not overrunning any space used by any local variables you have declared in other parts of your program.

  2. Make sure you're not writing to pointers that have been declared as local variables and then returned from a function in other parts of your program.

  3. Make sure that each thread has enough stack space to handle all the local variables you declare. Are you declaring any large stack-based buffers? The default per-thread stack size depends on the compiler settings, or in this case the cilk library. Try increasing the per-thread stack size at compilation time and see if the crash goes away.

With a bit of luck, one of the above should enable you to narrow down the source of the problem.

Couplet answered 2/7, 2012 at 4:1 Comment(2)
Thax man!! you are awesome. I was stuck here for the past 5 days and I had tried every thing. I learnt 3 new softwares and i did whatever I can do but it didn't help. It was because of the third point and the crash goes away..:). Thanks a lot againCandi
@Aman, I was in a similar situation. aps2012, you've saved me many minutes of frustration :) moving a couple of large arrays from function-local to global scope fixed the problem :D. I was getting segfaults at the function declaration.Chordophone

© 2022 - 2024 — McMap. All rights reserved.