cmultithreadingposixmutexmatmul

The problem of mutex lock in parallel matrix multiplication calculation


I am designing a parallel matrix multiplication algorithm using the POSIX threads (pthread) library, which is an improvement on the CPU version of gpt.c by Andrej Karpathy.

First, I ran it directly without multi-threading, and the time was 3.746 seconds. Then I tried running it with 4 threads, but without applying a mutex lock. The output was correct and it took 1.68 seconds, but it specifically showed that I only used 2 CPUs to run it. Later, I applied a mutex lock, the output was correct, it took 4.58 seconds, and the CPU utilization was not as good as a single thread.

My question is in this code, is there no critical section? In the compute function, shouldn't val and wrow be critical sections? Also, I set 4 threads, but why are there only 2 threads working? Besides,How to check the critical section in C code? What are some useful tools?

Here is Original implementation

void matmul_forward(float* out,
                    float* inp, float* weight, float* bias,
                    int B, int T, int C, int OC) {
    // most of the running time is spent here and in matmul_backward
    // OC is short for "output channels"
    // inp is (B,T,C), weight is (OC, C), bias is (OC)
    // out will be (B,T,OC)
    for (int b = 0; b < B; b++) {
        for (int t = 0; t < T; t++) {
            float* out_bt = out + b * T * OC + t * OC;
            float* inp_bt = inp + b * T * C + t * C;
            for (int o = 0; o < OC; o++) {
                float val = (bias != NULL) ? bias[o] : 0.0f;
                float* wrow = weight + o*C;
                for (int i = 0; i < C; i++) {
                    val += inp_bt[i] * wrow[i];
                }
                out_bt[o] = val;
            }
        }
    }
}

And I got the time1.txt

403
1130
288
962
534
3133
394
345

 Performance counter stats for './gpt-64 31373 312':

           3744.88 msec task-clock                #    1.000 CPUs utilized          
                 4      context-switches          #    0.001 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
              3406      page-faults               #    0.910 K/sec                  
   <not supported>      cycles                                                      
   <not supported>      instructions                                                
   <not supported>      branches                                                    
   <not supported>      branch-misses                                               

       3.746514034 seconds time elapsed

       3.595022000 seconds user
       0.150209000 seconds sys


Then I change it using multi-threading without mutex lock

void* compute(void* arg) {
    ThreadData* data = (ThreadData*)arg;
   // pthread_mutex_lock(data->mutex);
    for (int o = data->start; o < data->end; o++) {
        
        float val = (data->bias != NULL) ? data->bias[o] : 0.0f;
        float* wrow = data->weight + o * data->C;

        for (int i = 0; i < data->C; i++) {
            val += data->inp[i] * wrow[i];
        }
        data->out[o] = val; 
        
    }
    //pthread_mutex_unlock(data->mutex);
    return NULL;
}

void matmul_forward(float* out,
                    float* inp, float* weight, float* bias,
                    int B, int T, int C, int OC) {
    int num_threads = 4; 
    pthread_t threads[num_threads];
    ThreadData thread_data[num_threads];
    pthread_mutex_t mutex;
   // pthread_mutex_init(&mutex, NULL);

    for (int b = 0; b < B; b++) {
        for (int t = 0; t < T; t++) {
            float* out_bt = out + b * T * OC + t * OC;
            float* inp_bt = inp + b * T * C + t * C;
            // Assign output channels to each thread
            for (int i = 0; i < num_threads; i++) {
                thread_data[i].out = out_bt;
                thread_data[i].inp = inp_bt;
                thread_data[i].weight = weight;
                thread_data[i].bias = bias;
                thread_data[i].B = B;
                thread_data[i].T = T;
                thread_data[i].C = C;
                thread_data[i].OC = OC;
                thread_data[i].b = b;
                thread_data[i].t = t;
                thread_data[i].mutex = &mutex;
                if(i!=num_threads-1){
                    thread_data[i].start = i * (OC / num_threads);
                    thread_data[i].end = (i + 1) * (OC / num_threads);
                }else{
                    thread_data[i].start = i*(OC/num_threads);
                    thread_data[i].end = OC;
                }
                pthread_create(&threads[i], NULL, compute, &thread_data[i]);
            }
            for (int i = 0; i < num_threads; i++) {
                pthread_join(threads[i], NULL);
            }
        }
    }

   // pthread_mutex_destroy(&mutex); 
}

And I got the time4.txt

403
1130
288
962
534
3133
394
345

 Performance counter stats for './gpt-64 31373 312':

           4037.94 msec task-clock                #    2.390 CPUs utilized          
              5411      context-switches          #    0.001 M/sec                  
                22      cpu-migrations            #    0.005 K/sec                  
              4467      page-faults               #    0.001 M/sec                  
   <not supported>      cycles                                                      
   <not supported>      instructions                                                
   <not supported>      branches                                                    
   <not supported>      branch-misses                                               

       1.689577665 seconds time elapsed

       0.864298000 seconds user
       0.134603000 seconds sys



Here you can see the time elapsed decreased from 3.7 seconds to 1.68 second (by diff time1.txt and time4.txt)

After that, I implement mutex lock version

void* compute(void* arg) {
    ThreadData* data = (ThreadData*)arg;
    pthread_mutex_lock(data->mutex);
    for (int o = data->start; o < data->end; o++) {
        
        float val = (data->bias != NULL) ? data->bias[o] : 0.0f;
        float* wrow = data->weight + o * data->C;

        for (int i = 0; i < data->C; i++) {
            val += data->inp[i] * wrow[i];
        }
        data->out[o] = val; 
        
    }
    pthread_mutex_unlock(data->mutex);
    return NULL;
}

void matmul_forward(float* out,
                    float* inp, float* weight, float* bias,
                    int B, int T, int C, int OC) {
    int num_threads = 4; 
    pthread_t threads[num_threads];
    ThreadData thread_data[num_threads];
    pthread_mutex_t mutex;
    pthread_mutex_init(&mutex, NULL);

    for (int b = 0; b < B; b++) {
        for (int t = 0; t < T; t++) {
            float* out_bt = out + b * T * OC + t * OC;
            float* inp_bt = inp + b * T * C + t * C;
            // Assign output channels to each thread
            for (int i = 0; i < num_threads; i++) {
                thread_data[i].out = out_bt;
                thread_data[i].inp = inp_bt;
                thread_data[i].weight = weight;
                thread_data[i].bias = bias;
                thread_data[i].B = B;
                thread_data[i].T = T;
                thread_data[i].C = C;
                thread_data[i].OC = OC;
                thread_data[i].b = b;
                thread_data[i].t = t;
                thread_data[i].mutex = &mutex;
                if(i!=num_threads-1){
                    thread_data[i].start = i * (OC / num_threads);
                    thread_data[i].end = (i + 1) * (OC / num_threads);
                }else{
                    thread_data[i].start = i*(OC/num_threads);
                    thread_data[i].end = OC;
                }
                pthread_create(&threads[i], NULL, compute, &thread_data[i]);
            }
            for (int i = 0; i < num_threads; i++) {
                pthread_join(threads[i], NULL);
            }
        }
    }

    pthread_mutex_destroy(&mutex); 
}

And I got the time4_2.txt

403
1130
288
962
534
3133
394
345

 Performance counter stats for './gpt-64 31373 312':

           4211.43 msec task-clock                #    0.919 CPUs utilized          
             12100      context-switches          #    0.003 M/sec                  
               100      cpu-migrations            #    0.024 K/sec                  
              5970      page-faults               #    0.001 M/sec                  
   <not supported>      cycles                                                      
   <not supported>      instructions                                                
   <not supported>      branches                                                    
   <not supported>      branch-misses                                               

       4.581137082 seconds time elapsed

       1.216625000 seconds user
       0.231582000 seconds sys



I use OpenMP method, and I got new question by comparing my own multi-threads implement

Here is my OpenMP implement

void matmul_forward(float* out,
                    float* inp, float* weight, float* bias,
                    int B, int T, int C, int OC) {
    // most of the running time is spent here and in matmul_backward
    // OC is short for "output channels"
    // inp is (B,T,C), weight is (OC, C), bias is (OC)
    // out will be (B,T,OC)
    #pragma omp parallel for collapse(2)
    for (int b = 0; b < B; b++) {
        for (int t = 0; t < T; t++) {
            float* out_bt = out + b * T * OC + t * OC;
            float* inp_bt = inp + b * T * C + t * C;
            #pragma omp parallel for
            for (int o = 0; o < OC; o++) {
                float val = (bias != NULL) ? bias[o] : 0.0f;
                float* wrow = weight + o*C;
                for (int i = 0; i < C; i++) {
                    val += inp_bt[i] * wrow[i];
                }
                out_bt[o] = val;
            }
        }
    }
}

And I use diff to compare OpenMP implement and my own multi-threads implement

⋊> /h/o/o/gpt on main ⨯ diff -u time_omp.txt  time4.txt                                                  (base) 16:20:21
--- time_omp.txt        2024-10-31 16:20:13.782282431 +0800
+++ time4.txt   2024-10-31 10:53:45.901382289 +0800
@@ -9,18 +9,18 @@

  Performance counter stats for './gpt-64 31373 312':

-          42922.72 msec task-clock                #   15.987 CPUs utilized
-               493      context-switches          #    0.011 K/sec
-                21      cpu-migrations            #    0.000 K/sec
-              4553      page-faults               #    0.106 K/sec
+           4037.94 msec task-clock                #    2.390 CPUs utilized
+              5411      context-switches          #    0.001 M/sec
+                22      cpu-migrations            #    0.005 K/sec
+              4467      page-faults               #    0.001 M/sec
    <not supported>      cycles
    <not supported>      instructions
    <not supported>      branches
    <not supported>      branch-misses

-       2.684908823 seconds time elapsed
+       1.689577665 seconds time elapsed

-      42.639439000 seconds user
-       0.301765000 seconds sys
+       0.864298000 seconds user
+       0.134603000 seconds sys

As you can see, the omp methods use my nearly 16 CPUs, but the effect is not as good as when I use 2.3 CPUs, which means that my implementation is worse than the omp library function.

And in this comparison document, why is the task-clock of 16 threads nearly 10 times that of 2 threads?

So is there a better way to reduce this phenomenon in the omp method, at least better than my manual implementation?


Solution

  • My question is in this code, is there no critical section?

    I take you to be asking whether the program needs any critical sections, where execution is restricted to one thread at a time. That depends on how you split up the work among threads, but in the most natural divisions of work for this particular code, no two worker threads will ever have conflicting accesses to the same scalar object. The workers and the main thread potentially have conflicting accesses, but the main thread joining all the workers before accessing the computation results provides adequate synchronization for those accesses.

    In the compute function, shouldn't val and wrow be critical sections?

    Individual objects are not critical sections. Moreover, val and wrow are local variables, so each execution of compute(), whether in the same or different threads, has different val and wrow objects to work with. You are not providing any way for the val and wrow objects of one execution of compute() to be accessed anywhere else, so there cannot be any conflicting accesses to these objects. They themselves are not a synchronization concern, but see also below.

    All the potential synchronization concerns in the original code revolve around the objects accessible indirectly through pointers out, inp, weight, and bias, including via the various derived pointer values stored in out_bt, inp_bt, and wrow. This is a different consideration from the values of these pointers themselves. Among these, however, only out is used for writes (via various derived pointers), so as long as there is no hidden aliasing, the objects accessed through out are the only ones to which the worker threads could have conflicting accesses.

    But if you scrutinize the computation, you should see that over the course of the computation, out and its derivatives are used only for writes, never reads, and that no scalar object is written more than once though it. There are then no conflicting accesses here, either, so no need for any synchronization.

    Also, I set 4 threads, but why are there only 2 threads working?

    You are misreading the perf output. You start four threads at a time and successfully join all of them, so they all run. You should verify that the output is what you expected, but supposing so, all four threads must have done their work. In a four-threaded computation in which perf reports "2.390 CPUs utilized", you should understand that as the average CPU utilization for the calculation being 2.390 / 4 = 60% (approximately; the leading and trailing serial parts factor in, too).

    That's considerably worse utilization than the single-threaded case achieves, presumably owing to the fairly large overhead of starting and joining threads. You could improve that by using a thread pool instead of starting a new thread for each unit of work. Or you could just split up the computation much more coarsely -- at the B level, say, so that you start only num_threads threads in total (in addition to the program's initial thread).

    Besides,How to check the critical section in C code? What are some useful tools?

    "Critical section" -- you keep using that term. I don't think it means what you think it means.

    You should start with your own analysis, especially of potential data races. An example of such an analysis is given above, though this is a particularly simple case. In a program that uses multiple synchronization objects, you should also be doing your own analysis of opportunities for deadlock, but that's not necessary here.

    Having done your own analyses, and implemented any needed mitigations they reveal, you can then consider engaging tools aimed at identifying any thread-related issues you may have missed. Which of these are available to you depend on your C toolchain and your operating environment. Some of them may be Valgrind's "Helgrind" module, Valgrind's "DRD" module, or ThreadSanitizer. That is not an exhaustive list.

    As you can see, the omp methods use my nearly 16 CPUs, but the effect is not as good as when I use 2.3 CPUs, which means that my implementation is worse than the omp library function.

    That interpretation is completely backward. Your OMP version consumed almost 43 CPU seconds to complete the computation in 2.68 wall seconds. Your hand-rolled multi-thread version, on the other hand, consumed 4 CPU seconds to complete the computation in 1.69 wall seconds. So which of those is better?

    And in this comparison document, why is the task-clock of 16 threads nearly 10 times that of 2 threads?

    Because your OMP version is suffering from way more overhead than your hand-rolled version. Unlike your hand-rolled code, OMP does use a thread pool, so its overhead must have a different source than the the other version's. Most likely, that takes the form of a lot of (unneeded in this case) synchronization.

    Note also that your OMP version splits up the computation somewhat differently, with a nest of two parallel regions. Nesting regions is probably counterproductive here, and I'd recommend choosing just one. In fact, since you're open to using OMP, I would start here:

    void matmul_forward(float* out,
                        float* inp, float* weight, float* bias,
                        int B, int T, int C, int OC) {
        // most of the running time is spent here and in matmul_backward
        // OC is short for "output channels"
        // inp is (B,T,C), weight is (OC, C), bias is (OC)
        // out will be (B,T,OC)
    
        # omp parallel for
        for (int b = 0; b < B; b++) {
            for (int t = 0; t < T; t++) {
                float* out_bt = out + b * T * OC + t * OC;
                float* inp_bt = inp + b * T * C + t * C;
                for (int o = 0; o < OC; o++) {
                    float val = (bias != NULL) ? bias[o] : 0.0f;
                    float* wrow = weight + o*C;
                    for (int i = 0; i < C; i++) {
                        val += inp_bt[i] * wrow[i];
                    }
                    out_bt[o] = val;
                }
            }
        }
    }
    

    You could experiment with adding collapse(2), but if your B is large relative to the number of threads you want to use then I don't expect collapseing to help much, and it might even hurt.

    Similarly, for your hand-rolled version, I would pull the multi-threading out to the B level (only). If you want to split the work four ways (say), then creating only four threads in total will save you a lot of overhead. It may cost you a little efficiency if that leaves the work unevenly divided, but you have a lot of room for improvement, so you might still be better off overall even then.