Different ways to optimize with GPU PyOpenCL a python code : extern function inside kernel GPU/PyOpenCL

I have used the following command to profile my Python code :

python2.7 -m cProfile -o X2_non_flat_multiprocessing_dummy.prof X2_non_flat.py

Then, I can visualize globally the repartition of different greedy functions :

As you can see, a lot of time is spent into Pobs_C and interpolate routine which corresponds to the following code snippet :

def Pobs_C(z, zi, zj, h_p, wm_p, wDE_p, w0_p, wa_p, C_IAp, A_IAp, n_IAp, B_IAp, E_T, R_T, DG_T_fid, DG_T, WGT_T, WT_T, WIAT_T, cl, P_dd_spec, RT500):
    cc = 0
    P_dd_ok = np.zeros(len(z_pk))
    while cc < len(z_pk):
        if ((cl+0.5)/RT500[cc] < 35 and (cl+0.5)/RT500[cc] > 0.0005):
            P_dd_ok[cc] = P_dd_spec[cc]((cl+0.5)/RT500[cc])
        cc=cc+1

    P_dd_ok = CubicSpline(z_pk, P_dd_ok)
    if paramo == 8:
        P_dd_ok = P_dd_ok(z)*(DG_T(z)/DG_T_fid(z))**2
    else:
        P_dd_ok = P_dd_ok(z)

    if paramo != 9 or paramo != 10 or paramo != 11:
        C_gg = c/(100.*h_p)*0.5*delta_zpm*np.sum((F_dd_GG(z[1:], zi, zj, h_p, wm_p, wDE_p, w0_p, wa_p, E_T(z[1:]), R_T(z[1:]), WGT_T[aa][1:], WGT_T[bb][1:], DG_T(z[1:]), P_dd_ok[1:]) + F_dd_GG(z[:-1], zi, zj, h_p, wm_p, wDE_p, w0_p, wa_p, E_T(z[:-1]), R_T(z[:-1]), WGT_T[aa][:-1], WGT_T[bb][:-1], DG_T(z[:-1]), P_dd_ok[:-1]))) + P_shot_GC(zi, zj)
    else:
        C_gg = 0.
    if paramo < 12:
        C_ee = c/(100.*h_p)*0.5*delta_zpm*(np.sum(F_dd_LL(z[1:], zi, zj, h_p, wm_p, wDE_p, w0_p, wa_p, E_T(z[1:]), R_T(z[1:]), WT_T[aa][1:], WT_T[bb][1:], DG_T(z[1:]), P_dd_ok[1:]) + F_dd_LL(z[:-1], zi, zj, h_p, wm_p, wDE_p, w0_p, wa_p, E_T(z[:-1]), R_T(z[:-1]), WT_T[aa][:-1], WT_T[bb][:-1], DG_T(z[:-1]), P_dd_ok[:-1])) + np.sum(F_IA_d(z[1:], zi, zj, h_p, wm_p, wDE_p, w0_p, wa_p, C_IAp, A_IAp, n_IAp, B_IAp, E_T(z[1:]), R_T(z[1:]), DG_T(z[1:]), WT_T[aa][1:], WT_T[bb][1:], WIAT_T[aa][1:], WIAT_T[bb][1:], P_dd_ok[1:]) + F_IA_d(z[:-1], zi, zj, h_p, wm_p, wDE_p, w0_p, wa_p, C_IAp, A_IAp, n_IAp, B_IAp, E_T(z[:-1]), R_T(z[:-1]), DG_T(z[:-1]), WT_T[aa][:-1], WT_T[bb][:-1], WIAT_T[aa][:-1], WIAT_T[bb][:-1], P_dd_ok[:-1])) + np.sum(F_IAIA(z[1:], zi, zj, h_p, wm_p, wDE_p, w0_p, wa_p, C_IAp, A_IAp, n_IAp, B_IAp, E_T(z[1:]), R_T(z[1:]), DG_T(z[1:]), WIAT_T[aa][1:], WIAT_T[bb][1:], P_dd_ok[1:]) + F_IAIA(z[:-1], zi, zj, h_p, wm_p, wDE_p, w0_p, wa_p, C_IAp, A_IAp, n_IAp, B_IAp, E_T(z[:-1]), R_T(z[:-1]), DG_T(z[:-1]), WIAT_T[aa][:-1], WIAT_T[bb][:-1], P_dd_ok[:-1]))) + P_shot_WL(zi, zj)
    else:
        C_ee = 0.
    C_gl = c/(100.*h_p)*0.5*delta_zpm*np.sum((F_dd_GL(z[1:], zi, zj, h_p, wm_p, wDE_p, w0_p, wa_p, C_IAp, A_IAp, n_IAp, B_IAp, E_T(z[1:]), R_T(z[1:]), DG_T(z[1:]), WGT_T[aa][1:], WT_T[bb][1:], WIAT_T[bb][1:], P_dd_ok[1:]) + F_dd_GL(z[:-1], zi, zj, h_p, wm_p, wDE_p, w0_p, wa_p, C_IAp, A_IAp, n_IAp, B_IAp, E_T(z[:-1]), R_T(z[:-1]), DG_T(z[:-1]), WGT_T[aa][:-1], WT_T[bb][:-1], WIAT_T[bb][:-1], P_dd_ok[:-1])))
    return C_gg, C_ee, C_gl

1) main question: Is there a way to implement a GPU/OpenCL layer in this routine, especially for CubicSpline or the whole Pobs_C function. What are the alternatives that would allow me to reduce the time passed into Pobs_C and its inner function CubicSpline ?

I have few notions with OpenCL (not PyOpenCL) like for example the map-reduce method or solving Heat 2D equation with classical kernel.

2) previous feedback: I know that we can't have optimization by thinking naively that calling an extern function inside a kernel brings a higher speed-up since GPU can achieve a lot of calls. Instead, I should rather put all the content of the different functions allow to get optimization : Do you agree with that and confirm it ? So, can I declare inside the kernel code a call to an extern function (I mean a function not inside kernel, i.e the classical part code (called Host code ?) ?

3) optional question: Maybe can I declare this extern function inside the kernel: is it possible by doing explicitly this declaration inside ? Indeed, that could avoid to copy all the content of all functions potentially GPU-parallelized.

PS: Sorry if this is a general topic but it will allow me to see clearer about the available ways to include GPU/OpenCL in my code above and then optimize it.

Solution

Is there a way to implement a GPU/OpenCL layer in this routine, especially for CubicSpline or the whole Pobs_C function

In all probability, no. The majority of time in the profiling seems to be in 12 million polynomial evaluations, and each evaluation call is only taking 6 microseconds on the CPU. It is unclear whether there would be significant embarrassing parallelism to expose in that operation. And GPUs are only useful for performing embarrassingly parallel tasks.

So, can I declare inside the kernel code a call to an extern function (I mean a function not inside kernel, i.e the classical part code (called Host code ?) ?

No. That is impossible. And it is difficult to fathom what benefit that could possibly impart given that the Python code has to run on the host CPU anyway.

Maybe can I declare this extern function inside the kernel : is it possible by doing explicitely[sic] this declaration inside ?

No.