I'm currently working in a program that multiplies matrices in c, it receives the size of the matrices in the same line of execution,the program works for matrices below size 900 but when reaching matrices of more than size 900 I'm receiving the error segmentation fault(core dumped)
After checking many ressources available on internet I'm still unable to solve the problem, here is the code that I'm using:
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <sys/time.h>
int main(int argc, char* argv[])
{
int tam = atoi(argv[1]);
int first[tam][tam];
int second[tam][tam];
int third[tam][tam];
int i,j,k,l,f;
srand(time(NULL));
omp_set_num_threads(omp_get_num_procs());
for (i= 0; i< tam; i++)
for (j= 0; j< tam; j++)
{
l = rand();
f = rand();
first[i][j] = l;
second[i][j] = f;
}
#pragma omp parallel for private(i,j,k) shared(first,second,third)
for (i = 0; i < tam; ++i) {
for (j = 0; j < tam; ++j) {
for (k = 0; k < tam; ++k) {
third[i][j] += first[i][k] * second[k][j];
}
}
}
}
You might have noticed, that the 900 was not a fixed threshold for going into [segfault] crashes.
Yet, let's show one of the several ways forward: put the data on HEAP, instead of letting it get allocated on STACK:
The TiO.RUN online run-verifiable demo is ready to click and run here ( Array sizes above ~ 8000 x 8000 x 8-B x 3-matrices start to get soft-killed by the platform, as runtimes grow beyond the online-platform demo quota of about one minute CPU-time ).
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <sys/time.h>
#define cTAM 1234
static int m3[cTAM][cTAM], // HEAP-allocated data does not devastate STACK-resources
m2[cTAM][cTAM], // HEAP-allocated data does not devastate STACK-resources
m1[cTAM][cTAM]; // HEAP-allocated data does not devastate STACK-resources
int main(int argc, char* argv[])
{
int tam = cTAM; // proxy for not accessible command-line parameter passing
// int first[ tam][tam];
// int second[tam][tam];
// int third[ tam][tam];
int i,j,k,l,f;
srand(time(NULL));
// ------------------------------------ // root-cause problem is NOT related to the art of OpenMP
// omp_set_num_threads(omp_get_num_procs());
for ( i = 0; i < tam; i++ )
for ( j = 0; j < tam; j++ )
{
l = rand();
f = rand();
// first[i][j] = l;
// second[i][j] = f;
m1[i][j] = l;
m2[i][j] = f;
}
// ------------------------------------ // root-cause problem is NOT related to the art of OpenMP
// #pragma omp parallel for private(i,j,k) shared(first,second,third)
for ( i = 0; i < tam; ++i ) {
for ( j = 0; j < tam; ++j ) {
for ( k = 0; k < tam; ++k ) {
// third[i][j] += first[i][k] * second[k][j];
m3[i][j] += ( m1[i][k]
+ m2[ k][j]
);
}
}
}
return( 1 );
}
first[][]
( or m1[][]
above ) will become transposed, as the iteration will follow the row-wise memory-blocks already pre-fetched into cache-line ( data-access costs will drop from about ~1E2 [ns]
down to ~0.5 [ns]
on cache-hits, i.e. ~ 200 X
)It would be worth reading more and learning the technical details about cache-hierarchies, coalescing memory-access-patterns and about smart cache-line tricks to increase the cache-line re-use, instead of re-fetching the data that was once already fetched by a poor order of index iterations. You had the same problem with the GPU-kernel code, so indeed worth to spend a few days on a deep dive into these technology tricks.