c++cudagpu

How do i copy memory from CPU to GPU using CUDA C++?


I want to use my gpu instead of cpu for threading but im not really sure how to do that. i tried doing something like this:

int data_array = readfile();
int array_size = data_array.size();
int iterations = 25;
vector<person> result_array;
run_on_GPU<<<8, 32>>>(data_array, result_array, array_size, iterations);
cudaDeviceSynchronize();

for (int i = 0; i < result_array.size(); i++) {
    if (results_array[i] == condition) break;

    output_file << results_array[i].encoded << endl;
}

I want something like this, i tried using chatGpt but it still didn't run.

The program did not work and I got something like this:

CUDA Error: invalid argument at launch. Error in file <secret :)> at line 48: cudaDeviceSynchronize() returned error 11 (cudaErrorInvalidConfiguration)


Solution

  • So it seems you forgot to actually allocate some memory before running the processes. You should first do something like this: Instead of DataClass and ResultClass datatypes use your own, theese are just for an example.

    DataClass* device_entries = NULL;
    ResultClass* device_results = NULL;
    
    cudaMalloc(&device_entries, entry_count * sizeof(DataClass));
    cudaMalloc(&device_results, entry_count * sizeof(ResultClass));
    

    entry_count is the size of your data array.

    Then after that you need to copy the actual array to the gpu using theese lines:

    cudaMemcpy(device_entries, &entries[0], entry_count * sizeof(DataClass), cudaMemcpyHostToDevice);
    cudaDeviceSynchronize();
    

    the cudaMemcpyHostToDevice as it sounds, copies the memory from host to device(the gpu). We will use the same thing but the other way around later.

    The block_count and block_size is your own choice, but you should use a multiple of 32 for the block_size variable. So to the next part will look something like this:

    run_on_GPU<<<block_count, block_size>>>(device_entries, device_results, entry_count);
    cudaDeviceSynchronize();
    

    The run_on_GPU method should look something like this:

    __global__ void run_on_GPU(DataClass* entries, ResultClass* results, size_t entry_count)
    

    To get the results from the gpu you have to copy the memory back from device to host:

    ResultClass* results = (ResultClass*)malloc(entry_count * sizeof(ResultClass));
    cudaMemcpy(results, device_results, entry_count * sizeof(ResultClass), cudaMemcpyDeviceToHost);
    

    Bofore ending the program also dont forget to free up the used up memory:

    free(results);
    cudaFree(device_entries);
    cudaFree(device_results);