debuggingcudagpgpuword2vec

CUDA code runs when compiled with sm_35, but fails with sm_30


The GPU device that I have is GeForce GT 750M, which I found is compute capability 3.0. I downloaded the CUDA code found here: (https://github.com/ChenglongChen/word2vec_cbow. Its makefile had the flag -arch=sm_35 in makefile.

Since my device is compute capability 3.0, I changed the flag to -arch=sm_30. It compiled fine, but when I run the code, I get the following error:

word2vec.cu 449 : unspecified launch failure

word2vec.cu 449 : unspecified launch failure

It shows it multiple times, because there are multiple CPU threads launching the CUDA kernel. Please note that the threads do not use different streams to launch the kernel, so the kernel launches are all in order.

Now, when I let the flag be, i.e. -arch=sm_35, then the code runs fine. Can someone please explain why the code won't run when I set the flag to match my device?


Solution

  • Unfortunately your conclusion that the code works when compiled for sm_35 and run on an sm_30 GPU is incorrect. The culprit is this:

    void cbow_cuda(long window, long negative, float alpha, long sentence_length, 
                   int *sen, long layer1_size, float *syn0, long hs, float *syn1, 
                   float *expTable, int *vocab_codelen, char *vocab_code,
                   int *vocab_point, int *table, long table_size, 
                   long vocab_size, float *syn1neg){
        int blockSize = 256;
        int gridSize = (sentence_length)/(blockSize/32);
        size_t smsize = (blockSize/32)*(2*layer1_size+3)*sizeof(float);
    //printf("sm size is %d\n", smsize);
    //fflush(stdout);
        cbow_kernel<1><<<gridSize, blockSize, smsize>>>
                       (window, negative, alpha, sentence_length, sen,
                        layer1_size, syn0, syn1, expTable, vocab_codelen,
                        vocab_code, vocab_point, table, table_size,
                        vocab_size, syn1neg);
    }
    

    This code will silently fail if the kernel launch fails because of incomplete API error checking. And the kernel launch does fail if you build for sm_35 and run on sm_30. If you change the code of that function to this (adding kernel launch error checking):

    void cbow_cuda(long window, long negative, float alpha, long sentence_length, 
                   int *sen, long layer1_size, float *syn0, long hs, float *syn1, 
                   float *expTable, int *vocab_codelen, char *vocab_code,
                   int *vocab_point, int *table, long table_size, 
                   long vocab_size, float *syn1neg){
        int blockSize = 256;
        int gridSize = (sentence_length)/(blockSize/32);
        size_t smsize = (blockSize/32)*(2*layer1_size+3)*sizeof(float);
    //printf("sm size is %d\n", smsize);
    //fflush(stdout);
        cbow_kernel<1><<<gridSize, blockSize, smsize>>>
                       (window, negative, alpha, sentence_length, sen,
                        layer1_size, syn0, syn1, expTable, vocab_codelen,
                        vocab_code, vocab_point, table, table_size,
                        vocab_size, syn1neg);
        checkCUDAError( cudaPeekAtLastError() );
    }
    

    and compile and run it for sm_35, you should get this on an sm_30 device:

    ~/cbow/word2vec_cbow$ make
    nvcc word2vec.cu -o word2vec -O3 -Xcompiler -march=native -w  -Xptxas="-v" -arch=sm_35 -lineinfo
    ptxas info    : 0 bytes gmem
    ptxas info    : Compiling entry function '_Z11cbow_kernelILx1EEvllflPKilPVfS3_PKfS1_PKcS1_S1_llS3_' for 'sm_35'
    ptxas info    : Function properties for _Z11cbow_kernelILx1EEvllflPKilPVfS3_PKfS1_PKcS1_S1_llS3_
        0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
    ptxas info    : Used 34 registers, 448 bytes cmem[0], 8 bytes cmem[2]
    
    ~/cbow/word2vec_cbow$ ./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 7 -negative 1 -hs 1 -sample 1e-3 -threads 1 -binary 1 -save-vocab voc #> out 2>&1
    Starting training using file text8
    Vocab size: 71290
    Words in train file: 16718843
    vocab size = 71290
    cbow.cu 114 : invalid device function
    

    ie. the kernel launch failed because no appropriate device code was found in the CUDA cubin payload in your application. This also answers your earlier question about why the output of this code is incorrect. The analysis kernel simply never runs on your hardware when built with the default options.

    If I build this code for sm_30 and run it on a GTX 670 with 2gb of memory (compute capability 3.0), I get this:

    ~/cbow/word2vec_cbow$ make
    nvcc word2vec.cu -o word2vec -O3 -Xcompiler -march=native -w  -Xptxas="-v" -arch=sm_30 -lineinfo
    ptxas info    : 0 bytes gmem
    ptxas info    : Compiling entry function '_Z11cbow_kernelILx1EEvllflPKilPVfS3_PKfS1_PKcS1_S1_llS3_' for 'sm_30'
    ptxas info    : Function properties for _Z11cbow_kernelILx1EEvllflPKilPVfS3_PKfS1_PKcS1_S1_llS3_
        0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
    ptxas info    : Used 34 registers, 448 bytes cmem[0], 12 bytes cmem[2]
    
    ~/cbow/word2vec_cbow$ ./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 7 -negative 1 -hs 1 -sample 1e-3 -threads 1 -binary 1 -save-vocab voc #> out 2>&1
    Starting training using file text8
    Vocab size: 71290
    Words in train file: 16718843
    vocab size = 71290
    Alpha: 0.000009  Progress: 100.00%  Words/thread/sec: 1217.45k
    

    ie. the code runs correctly to completion without any errors. I can't tell you why you are not able to get the code to run on your hardware because I cannot reproduce your error on my hardware. You will need to do some debugging on your own to find the root cause of that.