cudapycuda

Why PyCUDA is faster than C CUDA in this example


I am exploring to move from OpenCL to CUDA, and did a few tests to benchmark the speed of CUDA in various implementations. To my surprise, in the examples below, the PyCUDA implementation is about 20% faster than the C CUDA example.

I read many posts talking about "release build" of C CUDA code. I did try having -Xptxas -O3 in the makefile and that really did not make a difference. I also tried to adjust the block size, with which the kernel was executed. Unfortunately, it did not help improve the speed, either.

My questions here are:

While I appreciate general comments, I am looking for actionable suggestions that I can validate on my machine. Thanks!

import pycuda.autoinit
import pycuda.driver as drv
import numpy as np

from pycuda.compiler import SourceModule
import time


mod = SourceModule(
    """
__global__ void saxpy(int n, const float a, float *x, float *y)
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n){
        y[i] = a * x[i] + y[i];
    }
}
"""
)

saxpy = mod.get_function("saxpy")

N = 1 << 25
time_elapse = 0.0

for i in range(100):
    # print(i)
    # print(N)

    x = np.ones(N).astype(np.float32)
    y = 2 * np.ones(N).astype(np.float32)
    start = time.time()
    saxpy(
        np.int32(N),
        np.float32(2.0),
        drv.In(x),
        drv.InOut(y),
        block=(512, 1, 1),
        grid=(int(N / 512) + 1, 1),
    )
    time_elapse += (time.time() - start)


print(time_elapse )
print(y[-100:-1])
print(y.sum())
print(N * 4.0)


#include <stdio.h>
#include <time.h>
#define DIM 512



__global__ void saxpy(int n, float a, float *x, float *y)
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n)
        y[i] = a * x[i] + y[i];
}

int main(int num_iterations)
{
    double start;
    double cputime;
    int N = 1 << 25;
    float *x, *y, *d_x, *d_y;
    int i, j;
    for (j = 0; j < num_iterations; j++)
    {
        x = (float *)malloc(N * sizeof(float));
        y = (float *)malloc(N * sizeof(float));

        cudaMalloc(&d_x, N * sizeof(float));
        cudaMalloc(&d_y, N * sizeof(float));

        for (i = 0; i < N; i++)
        {
            x[i] = 1.0f;
            y[i] = 2.0f;
        }

        cudaMemcpy(d_x, x, N * sizeof(float), cudaMemcpyHostToDevice);
        cudaMemcpy(d_y, y, N * sizeof(float), cudaMemcpyHostToDevice);

        // Perform SAXPY on 1M elements
        start = clock();
        saxpy<<<(N + DIM) / DIM, DIM>>>(N, 2.0f, d_x, d_y);
        cputime += ((double)(clock() - start) / CLOCKS_PER_SEC);
        cudaMemcpy(y, d_y, N * sizeof(float), cudaMemcpyDeviceToHost);

        // float maxError = 0.0f;
        // for (int i = 0; i < N; i++){
        //     maxError = max(maxError, abs(y[i] - 4.0f));
        //     //printf("y[%d]: %f\n", i,y[i]);
        // }
        // printf("Max error: %f\n", maxError);

        cudaFree(d_x);
        cudaFree(d_y);
        free(x);
        free(y);
    }

 
    printf("cpu time is %f\n", cputime);
    return 0;
}

I saved the above file as cuda_example.cu and compile it with the following commands in a makefile:

nvcc -arch=sm_61 -Xptxas -O3,-v -o main cuda_example.cu


Solution

  • If I execute your CUDA-C code as is, and set num_iterations to 300 like this:

    int num_iterations =300;
    

    then the execution of your program takes about 60s on a Geforce GTX 1650. Your code is extremely inefficient, as you copy data back and forth between GPU and device at every iteration. So, lets restrict the loop to just the kernel execution:

    #include <stdio.h>
    #include <time.h>
    #define DIM 512
    
    __global__ void saxpy(int n, float a, float *x, float *y)
    {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n)
        y[i] = a * x[i] + y[i];
    }
    
    int main()
    {
    double start = clock();
    int N = 1 << 25;
    float *x, *y, *d_x, *d_y;
    int i, j;
    
    int num_iterations = 300;
    x = (float *)malloc(N * sizeof(float));
    y = (float *)malloc(N * sizeof(float));
    
    cudaMalloc(&d_x, N * sizeof(float));
    cudaMalloc(&d_y, N * sizeof(float));
    
    for (i = 0; i < N; i++)
    {
       x[i] = 1.0f;
       y[i] = 2.0f;
    }
    cudaMemcpy(d_x, x, N * sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(d_y, y, N * sizeof(float), cudaMemcpyHostToDevice);
    
    for (j = 0; j < num_iterations; j++){
        saxpy<<<(N + DIM) / DIM, DIM>>>(N, 2.0f, d_x, d_y);
        cudaDeviceSynchronize();
    }
    cudaMemcpy(y, d_y, N * sizeof(float), cudaMemcpyDeviceToHost);
    cudaFree(d_x);
    cudaFree(d_y);
    free(x);
    free(y);
    
    double cputime = ((double)(clock() - start) / CLOCKS_PER_SEC);
    printf("cpu time is %f\n", cputime);
    return 0;
    }
    

    If I do that, then the execution time becomes 1.36 seconds. Doing sth similar to the PyCUDA code I got about 19s of execution time.