I am new to the parallelizing paradigm. I already have the algorithm working in its serial form but I can't parallelize it.
I have asked around and some have told me that the way my program is written can't be parallelized.
Here is my serial code, any help or recommendation will be kindly appreciated.
import numpy as np #Importar la librería de numpy para usar sus funciones
if __name__ == "__main__":
#Numero de elementos debe ser par
N = 1_000_000_000#int(input("Número de iteraciones/intervalos (número par): "))
#Declara la funcion a integrar
f = lambda x : x*0.5
#Limites
a = 3#int(input("Dame el limite inferior: "))
b = 10#int(input("Dame el limite superior: "))
#delta de x
dx = (b-a)/N
#División de intervalos
x = np.linspace(a,b,N+1)
y = f(x)
#Suma de cada uno de los intervalos
resultado = dx/3 * np.sum(y[0:-1:2] + 4*y[1::2] + y[2::2])
print("N = " + str(N))
print("El valor de la integral es: ")
print(resultado)
Q : "Is it possible to parallelize this program (Simpson's Rule) in Python?"
A more important side of the same coin is not how to parallelise a program ( or its part ), but at what add-on costs' penalties that will actually happen. From several options how to parallelise, the best one should be selected. Here, due to large data/memory-footprints ( an adverse O( n )-scaling in SpaceDOMAIN, due to sizes above V * 8 * 1E9 [B]
(sum of memory-footprints of all objects used, incl. interim storage variables), that next, indirectly, increases the O( n )-scaling in TimeDOMAIN (duration), due to memory-I/O volume & bottle-necking of all RAM-I/O channels available ) a kind of fine-grain parallelism, called vectorisation, seems to fit best, as it adds almost zero add-on costs, yet helping reduce the RAM-I/O costs for an extremely low-complexity f(x)
-lambda, which makes memory pre-fetches of in-contiguous blocks (due to index-hopping) almost impossible to latency-mask.
There are at least two places, where a code-conversion will help, most on CPU micro-architectures, for which their native vector-instructions can get harnessed in most recent versions for indeed an HPC-grade parallel execution of numpy
-vectorised code.
A performance tuning mandatory disclaimer :
Detailed profiling will show you, for each particular target code-execution platform { x86 | ARM | ... } and its actual UMA / NUMA memory-I/O & CPU-core cache-hierarchy, where your code loses most of the time (and how successfully or how poor does the actual processor cache-hierarchy mask the real-costs of accessing vast footprints of RAM with all the adverse effects of the physical RAM memory-I/O costs as these do not, for 1E9-sizes, fit into CPU-core cache/registers ).
a )
we may spend less time in producing/storing interim objects and use the brave numpy
-code smart-vectorised trick:
y = f( np.linspace( a, # lo-bound
b, # hi-bound
N + 1 # steps
) # no storage of a temporary, yet HUGE object x ~ 8[GB] in RAM
) # +ask lambda to in-flight compute & create just y = f( x[:] )
If in doubt, feel free to read more about the costs of various kinds of computing / storage related access latencies.
b )
we may reduce, at least some parts of the repetitive memory-access patterns, that ( as expressed above, due to sizes of about ~1E9, cannot remain in-cache and will have to be re-executed and re-fetched from physical RAM ) still meet the computation:
# proper fusing of the code against non-even N is left for clarity of vectorised
# and left to the kind user
#_________________________________________________________________ double steps
resultado = ( 4 * np.sum( f( np.linspace( a + dx, # lo-bound
b - dx, # hi-bound
( N / 2 ) + 1 # inner double steps
)
) #--------- lambda on a smaller, RAM/cache-compact object
) #----------- numpy-controls the summation over contiguous RAM
+ 2 * np.sum( f( np.linspace( a + dx + dx, # lo-bound
b - dx - dx, # hi-bound
( N / 2 ) + 1 # inner double steps
)
) #--------- lambda on a smaller, RAM/cache-compact object
) #----------- numpy-controls the summation overcontiguous RAM
+ f( a )
+ f( b )
) * dx / 3
While the mock-up code does not aspire to solve all corner cases, the core-benefit arrives from using RAM-contiguous memory-layouts, that are very efficiently numpy.sum()
-ed and from avoiding of replicated-visits to memory-areas, re-visited just due to imperatively dictated (non-contiguous) index-jumping (numpy
can optimise some of its own indexing so as to maximise memory-access pattern/cache-hits, yet the "outer", coded index-jumping is almost always beyond of the reach of such smart, but hard-wired, numpy
-optimisation know-how ( the less a silicon-based thinking or a clairvoyance ;o) )