I want to use my gpu instead of cpu for threading but im not really sure how to do that. i tried doing something like this:
int data_array = readfile();
int array_size = data_array.size();
int iterations = 25;
vector<person> result_array;
run_on_GPU<<<8, 32>>>(data_array, result_array, array_size, iterations);
cudaDeviceSynchronize();
for (int i = 0; i < result_array.size(); i++) {
if (results_array[i] == condition) break;
output_file << results_array[i].encoded << endl;
}
I want something like this, i tried using chatGpt but it still didn't run.
The program did not work and I got something like this:
CUDA Error: invalid argument at launch. Error in file <secret :)> at line 48: cudaDeviceSynchronize() returned error 11 (cudaErrorInvalidConfiguration)
So it seems you forgot to actually allocate some memory before running the processes.
You should first do something like this:
Instead of DataClass
and ResultClass
datatypes use your own, theese are just for an example.
DataClass* device_entries = NULL;
ResultClass* device_results = NULL;
cudaMalloc(&device_entries, entry_count * sizeof(DataClass));
cudaMalloc(&device_results, entry_count * sizeof(ResultClass));
entry_count is the size of your data array.
Then after that you need to copy the actual array to the gpu using theese lines:
cudaMemcpy(device_entries, &entries[0], entry_count * sizeof(DataClass), cudaMemcpyHostToDevice);
cudaDeviceSynchronize();
the cudaMemcpyHostToDevice
as it sounds, copies the memory from host to device(the gpu). We will use the same thing but the other way around later.
The block_count
and block_size
is your own choice, but you should use a multiple of 32 for the block_size
variable.
So to the next part will look something like this:
run_on_GPU<<<block_count, block_size>>>(device_entries, device_results, entry_count);
cudaDeviceSynchronize();
The run_on_GPU method should look something like this:
__global__ void run_on_GPU(DataClass* entries, ResultClass* results, size_t entry_count)
To get the results from the gpu you have to copy the memory back from device to host:
ResultClass* results = (ResultClass*)malloc(entry_count * sizeof(ResultClass));
cudaMemcpy(results, device_results, entry_count * sizeof(ResultClass), cudaMemcpyDeviceToHost);
Bofore ending the program also dont forget to free up the used up memory:
free(results);
cudaFree(device_entries);
cudaFree(device_results);