The GPU device that I have is GeForce GT 750M, which I found is compute capability 3.0. I downloaded the CUDA code found here: (https://github.com/ChenglongChen/word2vec_cbow. Its makefile had the flag -arch=sm_35
in makefile
.
Since my device is compute capability 3.0, I changed the flag to -arch=sm_30
. It compiled fine, but when I run the code, I get the following error:
word2vec.cu 449 : unspecified launch failure
word2vec.cu 449 : unspecified launch failure
It shows it multiple times, because there are multiple CPU threads launching the CUDA kernel. Please note that the threads do not use different streams to launch the kernel, so the kernel launches are all in order.
Now, when I let the flag be, i.e. -arch=sm_35
, then the code runs fine. Can someone please explain why the code won't run when I set the flag to match my device?
Unfortunately your conclusion that the code works when compiled for sm_35 and run on an sm_30 GPU is incorrect. The culprit is this:
void cbow_cuda(long window, long negative, float alpha, long sentence_length,
int *sen, long layer1_size, float *syn0, long hs, float *syn1,
float *expTable, int *vocab_codelen, char *vocab_code,
int *vocab_point, int *table, long table_size,
long vocab_size, float *syn1neg){
int blockSize = 256;
int gridSize = (sentence_length)/(blockSize/32);
size_t smsize = (blockSize/32)*(2*layer1_size+3)*sizeof(float);
//printf("sm size is %d\n", smsize);
//fflush(stdout);
cbow_kernel<1><<<gridSize, blockSize, smsize>>>
(window, negative, alpha, sentence_length, sen,
layer1_size, syn0, syn1, expTable, vocab_codelen,
vocab_code, vocab_point, table, table_size,
vocab_size, syn1neg);
}
This code will silently fail if the kernel launch fails because of incomplete API error checking. And the kernel launch does fail if you build for sm_35 and run on sm_30. If you change the code of that function to this (adding kernel launch error checking):
void cbow_cuda(long window, long negative, float alpha, long sentence_length,
int *sen, long layer1_size, float *syn0, long hs, float *syn1,
float *expTable, int *vocab_codelen, char *vocab_code,
int *vocab_point, int *table, long table_size,
long vocab_size, float *syn1neg){
int blockSize = 256;
int gridSize = (sentence_length)/(blockSize/32);
size_t smsize = (blockSize/32)*(2*layer1_size+3)*sizeof(float);
//printf("sm size is %d\n", smsize);
//fflush(stdout);
cbow_kernel<1><<<gridSize, blockSize, smsize>>>
(window, negative, alpha, sentence_length, sen,
layer1_size, syn0, syn1, expTable, vocab_codelen,
vocab_code, vocab_point, table, table_size,
vocab_size, syn1neg);
checkCUDAError( cudaPeekAtLastError() );
}
and compile and run it for sm_35, you should get this on an sm_30 device:
~/cbow/word2vec_cbow$ make
nvcc word2vec.cu -o word2vec -O3 -Xcompiler -march=native -w -Xptxas="-v" -arch=sm_35 -lineinfo
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function '_Z11cbow_kernelILx1EEvllflPKilPVfS3_PKfS1_PKcS1_S1_llS3_' for 'sm_35'
ptxas info : Function properties for _Z11cbow_kernelILx1EEvllflPKilPVfS3_PKfS1_PKcS1_S1_llS3_
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 34 registers, 448 bytes cmem[0], 8 bytes cmem[2]
~/cbow/word2vec_cbow$ ./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 7 -negative 1 -hs 1 -sample 1e-3 -threads 1 -binary 1 -save-vocab voc #> out 2>&1
Starting training using file text8
Vocab size: 71290
Words in train file: 16718843
vocab size = 71290
cbow.cu 114 : invalid device function
ie. the kernel launch failed because no appropriate device code was found in the CUDA cubin payload in your application. This also answers your earlier question about why the output of this code is incorrect. The analysis kernel simply never runs on your hardware when built with the default options.
If I build this code for sm_30 and run it on a GTX 670 with 2gb of memory (compute capability 3.0), I get this:
~/cbow/word2vec_cbow$ make
nvcc word2vec.cu -o word2vec -O3 -Xcompiler -march=native -w -Xptxas="-v" -arch=sm_30 -lineinfo
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function '_Z11cbow_kernelILx1EEvllflPKilPVfS3_PKfS1_PKcS1_S1_llS3_' for 'sm_30'
ptxas info : Function properties for _Z11cbow_kernelILx1EEvllflPKilPVfS3_PKfS1_PKcS1_S1_llS3_
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 34 registers, 448 bytes cmem[0], 12 bytes cmem[2]
~/cbow/word2vec_cbow$ ./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 7 -negative 1 -hs 1 -sample 1e-3 -threads 1 -binary 1 -save-vocab voc #> out 2>&1
Starting training using file text8
Vocab size: 71290
Words in train file: 16718843
vocab size = 71290
Alpha: 0.000009 Progress: 100.00% Words/thread/sec: 1217.45k
ie. the code runs correctly to completion without any errors. I can't tell you why you are not able to get the code to run on your hardware because I cannot reproduce your error on my hardware. You will need to do some debugging on your own to find the root cause of that.