openglcudavideo-streamingnvidianvenc

nvEncRegisterResource() fails with -23


I've hit a complete brick wall in my attempt to use NVEnc to stream OpenGL frames as H264. I've been at this particular issue for close to 8 hours without any progress.

The problem is the call to nvEncRegisterResource(), which invariably fails with code -23 (enum value NV_ENC_ERR_RESOURCE_REGISTER_FAILED, documented as "failed to register the resource" - thanks NVidia).

I'm trying to follow a procedure outlined in this document from the University of Oslo (page 54, "OpenGL interop"), so I know for a fact that this is supposed to work, though unfortunately said document does not provide the code itself.

The idea is fairly straightforward:

  1. map the texture produced by the OpenGL frame buffer object into CUDA;
  2. copy the texture into a (previously allocated) CUDA buffer;
  3. map that buffer as an NVEnc input resource
  4. use that input resource as the source for the encoding

As I said, the problem is step (3). Here are the relevant code snippets (I'm omitting error handling for brevity.)

// Round up width and height
priv->encWidth = (_resolution.w + 31) & ~31, priv->encHeight = (_resolution.h + 31) & ~31;

// Allocate CUDA "pitched" memory to match the input texture (YUV, one byte per component)
cuErr = cudaMallocPitch(&priv->cudaMemPtr, &priv->cudaMemPitch, 3 * priv->encWidth, priv->encHeight);

This should allocate on-device CUDA memory (the "pitched" variety, though I've tried non-pitched too, without any change in the outcome.)

// Register the CUDA buffer as an input resource
NV_ENC_REGISTER_RESOURCE regResParams = { 0 };
regResParams.version = NV_ENC_REGISTER_RESOURCE_VER;
regResParams.resourceType = NV_ENC_INPUT_RESOURCE_TYPE_CUDADEVICEPTR;
regResParams.width  = priv->encWidth;
regResParams.height = priv->encHeight;
regResParams.bufferFormat = NV_ENC_BUFFER_FORMAT_YUV444_PL;
regResParams.resourceToRegister = priv->cudaMemPtr;
regResParams.pitch = priv->cudaMemPitch;
encStat = nvEncApi.nvEncRegisterResource(priv->nvEncoder, &regResParams);
//                 ^^^ FAILS
priv->nvEncInpRes = regResParams.registeredResource;

This is the brick wall. No matter what I try, nvEncRegisterResource() fails.

I should note that I rather think (though I may be wrong) that I've done all the required initializations. Here is the code that creates and activates the CUDA context:

// Pop the current context
cuRes = cuCtxPopCurrent(&priv->cuOldCtx);

// Create a context for the device
priv->cuCtx = nullptr;
cuRes = cuCtxCreate(&priv->cuCtx, CU_CTX_SCHED_BLOCKING_SYNC, priv->cudaDevice);

// Push our context
cuRes = cuCtxPushCurrent(priv->cuCtx);

.. followed by the creation of the encoding session:

// Create an NV Encoder session
NV_ENC_OPEN_ENCODE_SESSION_EX_PARAMS nvEncSessParams = { 0 };
nvEncSessParams.apiVersion = NVENCAPI_VERSION;
nvEncSessParams.version = NV_ENC_OPEN_ENCODE_SESSION_EX_PARAMS_VER;
nvEncSessParams.deviceType = NV_ENC_DEVICE_TYPE_CUDA;
nvEncSessParams.device = priv->cuCtx; // nullptr
auto encStat = nvEncApi.nvEncOpenEncodeSessionEx(&nvEncSessParams, &priv->nvEncoder);

And finally, the code initializing the encoder:

// Configure the encoder via preset 
NV_ENC_PRESET_CONFIG presetConfig = { 0 };
GUID codecGUID = NV_ENC_CODEC_H264_GUID;
GUID presetGUID = NV_ENC_PRESET_LOW_LATENCY_DEFAULT_GUID;
presetConfig.version = NV_ENC_PRESET_CONFIG_VER;
presetConfig.presetCfg.version = NV_ENC_CONFIG_VER;
encStat = nvEncApi.nvEncGetEncodePresetConfig(priv->nvEncoder, codecGUID, presetGUID, &presetConfig);

NV_ENC_INITIALIZE_PARAMS initParams = { 0 };
initParams.version = NV_ENC_INITIALIZE_PARAMS_VER;
initParams.encodeGUID = codecGUID;
initParams.encodeWidth  = priv->encWidth;
initParams.encodeHeight = priv->encHeight;
initParams.darWidth  = 1;
initParams.darHeight = 1;
initParams.frameRateNum = 25;   // TODO: make this configurable
initParams.frameRateDen = 1;    // ditto
//   .max_surface_count = (num_mbs >= 8160) ? 32 : 48;
//   .buffer_delay ? necessary
initParams.enableEncodeAsync = 0;
initParams.enablePTD = 1;
initParams.presetGUID = presetGUID;
memcpy(&priv->nvEncConfig, &presetConfig.presetCfg, sizeof(priv->nvEncConfig));
initParams.encodeConfig = &priv->nvEncConfig;
encStat = nvEncApi.nvEncInitializeEncoder(priv->nvEncoder, &initParams);

All the above initializations report success.

I'd be extremely grateful to anyone who can get me past this hurdle.


EDIT: here is the complete code to reproduce the problem. The only observable difference to the original code is that cuPopContext() returns an error (which can be ignored) here - probably my original program creates such a context as a side effect of using OpenGL. Otherwise, the code behaves exactly as the original does. I've built the code with Visual Studio 2013. You must link the following library file (adapt path if not on C:): C:\Program Files (x86)\NVIDIA GPU Computing Toolkit\CUDA\v7.5\lib\Win32\cuda.lib

You must also make sure that C:\Program Files (x86)\NVIDIA GPU Computing Toolkit\CUDA\v7.5\include\ (or similar) is in the include path.

NEW EDIT: modified the code to only use the CUDA driver interface, instead of mixing with the runtime API. Still the same error code.

#ifdef _WIN32
#include <Windows.h>
#endif
#include <cassert>
#include <GL/gl.h>
#include <iostream>
#include <string>

#include <stdexcept>
#include <string>

#include <cuda.h>
//#include <cuda_runtime.h>
#include <cuda_gl_interop.h>
#include <nvEncodeAPI.h>

// NV Encoder API ---------------------------------------------------

#if defined(_WIN32)
#define LOAD_FUNC(l, s) GetProcAddress(l, s)
#define DL_CLOSE_FUNC(l) FreeLibrary(l)
#else
#define LOAD_FUNC(l, s) dlsym(l, s)
#define DL_CLOSE_FUNC(l) dlclose(l)
#endif

typedef NVENCSTATUS(NVENCAPI* PNVENCODEAPICREATEINSTANCE)(NV_ENCODE_API_FUNCTION_LIST *functionList);

struct NVEncAPI : public NV_ENCODE_API_FUNCTION_LIST {
public:
    // ~NVEncAPI() { cleanup(); }

    void init() {
#if defined(_WIN32)
        if (sizeof(void*) == 8) {
            nvEncLib = LoadLibrary(TEXT("nvEncodeAPI64.dll"));
        }
        else {
            nvEncLib = LoadLibrary(TEXT("nvEncodeAPI.dll"));
        }
        if (nvEncLib == NULL) throw std::runtime_error("Failed to load NVidia Encoder library: " + std::to_string(GetLastError()));
#else
        nvEncLib = dlopen("libnvidia-encode.so.1", RTLD_LAZY);
        if (nvEncLib == nullptr)
            throw std::runtime_error("Failed to load NVidia Encoder library: " + std::string(dlerror()));
#endif
        auto nvEncodeAPICreateInstance = (PNVENCODEAPICREATEINSTANCE) LOAD_FUNC(nvEncLib, "NvEncodeAPICreateInstance");

        version = NV_ENCODE_API_FUNCTION_LIST_VER;
        NVENCSTATUS encStat = nvEncodeAPICreateInstance(static_cast<NV_ENCODE_API_FUNCTION_LIST *>(this));
    }

    void cleanup() {
#if defined(_WIN32)
        if (nvEncLib != NULL) {
            FreeLibrary(nvEncLib);
            nvEncLib = NULL;
        }
#else
        if (nvEncLib != nullptr) {
            dlclose(nvEncLib);
            nvEncLib = nullptr;
        }
#endif
    }

private:

#if defined(_WIN32)
    HMODULE nvEncLib;
#else
    void* nvEncLib;
#endif
    bool init_done;
};

static NVEncAPI nvEncApi;

// Encoder class ----------------------------------------------------

class Encoder {
public:
    typedef unsigned int uint_t;
    struct Size { uint_t w, h; };

    Encoder() { 
        CUresult cuRes = cuInit(0);
        nvEncApi.init(); 
    }

    void init(const Size & resolution, uint_t texture) {

        NVENCSTATUS encStat;
        CUresult cuRes;

        texSize = resolution;
        yuvTex = texture;

        // Purely for information
        int devCount = 0;
        cuRes = cuDeviceGetCount(&devCount);

        // Initialize NVEnc
        initEncodeSession();            // start an encoding session
        initEncoder();

        // Register the YUV texture as a CUDA graphics resource
        // CODE COMMENTED OUT AS THE INPUT TEXTURE IS NOT NEEDED YET (TO MY UNDERSTANDING) AT SETUP TIME
        //cudaGraphicsGLRegisterImage(&priv->cudaInpTexRes, priv->yuvTex, GL_TEXTURE_2D, cudaGraphicsRegisterFlagsReadOnly);

        // Allocate CUDA "pitched" memory to match the input texture (YUV, one byte per component)
        encWidth = (texSize.w + 31) & ~31, encHeight = (texSize.h + 31) & ~31;
        cuRes = cuMemAllocPitch(&cuDevPtr, &cuMemPitch, 4 * encWidth, encHeight, 16);

        // Register the CUDA buffer as an input resource
        NV_ENC_REGISTER_RESOURCE regResParams = { 0 };
        regResParams.version = NV_ENC_REGISTER_RESOURCE_VER;
        regResParams.resourceType = NV_ENC_INPUT_RESOURCE_TYPE_CUDADEVICEPTR;
        regResParams.width = encWidth;
        regResParams.height = encHeight;
        regResParams.bufferFormat = NV_ENC_BUFFER_FORMAT_YUV444_PL;
        regResParams.resourceToRegister = (void*) cuDevPtr;
        regResParams.pitch = cuMemPitch;
        encStat = nvEncApi.nvEncRegisterResource(nvEncoder, &regResParams);
        assert(encStat == NV_ENC_SUCCESS); // THIS IS THE POINT OF FAILURE
        nvEncInpRes = regResParams.registeredResource;
    }

    void cleanup() { /* OMITTED */ }

    void encode() {
        // THE FOLLOWING CODE WAS NEVER REACHED YET BECAUSE OF THE ISSUE.
        // INCLUDED HERE FOR REFERENCE.

        CUresult cuRes;
        NVENCSTATUS encStat;

        cuRes = cuGraphicsResourceSetMapFlags(cuInpTexRes, CU_GRAPHICS_MAP_RESOURCE_FLAGS_READ_ONLY);

        cuRes = cuGraphicsMapResources(1, &cuInpTexRes, 0);

        CUarray mappedArray;
        cuRes = cuGraphicsSubResourceGetMappedArray(&mappedArray, cuInpTexRes, 0, 0);

        cuRes = cuMemcpyDtoA(mappedArray, 0, cuDevPtr, 4 * encWidth * encHeight);

        NV_ENC_MAP_INPUT_RESOURCE mapInputResParams = { 0 };
        mapInputResParams.version = NV_ENC_MAP_INPUT_RESOURCE_VER;
        mapInputResParams.registeredResource = nvEncInpRes;
        encStat = nvEncApi.nvEncMapInputResource(nvEncoder, &mapInputResParams);

        // TODO: encode...

        cuRes = cuGraphicsUnmapResources(1, &cuInpTexRes, 0);
    }

private:
    struct PrivateData;

    void initEncodeSession() {

        CUresult cuRes;
        NVENCSTATUS encStat;

        // Pop the current context
        cuRes = cuCtxPopCurrent(&cuOldCtx); // THIS IS ALLOWED TO FAIL (it doesn't

        // Create a context for the device
        cuCtx = nullptr;
        cuRes = cuCtxCreate(&cuCtx, CU_CTX_SCHED_BLOCKING_SYNC, 0);

        // Push our context
        cuRes = cuCtxPushCurrent(cuCtx);

        // Create an NV Encoder session
        NV_ENC_OPEN_ENCODE_SESSION_EX_PARAMS nvEncSessParams = { 0 };
        nvEncSessParams.apiVersion = NVENCAPI_VERSION;
        nvEncSessParams.version = NV_ENC_OPEN_ENCODE_SESSION_EX_PARAMS_VER;
        nvEncSessParams.deviceType = NV_ENC_DEVICE_TYPE_CUDA;
        nvEncSessParams.device = cuCtx;
        encStat = nvEncApi.nvEncOpenEncodeSessionEx(&nvEncSessParams, &nvEncoder);
    }

    void Encoder::initEncoder()
    {
        NVENCSTATUS encStat;

        // Configure the encoder via preset 
        NV_ENC_PRESET_CONFIG presetConfig = { 0 };
        GUID codecGUID = NV_ENC_CODEC_H264_GUID;
        GUID presetGUID = NV_ENC_PRESET_LOW_LATENCY_DEFAULT_GUID;
        presetConfig.version = NV_ENC_PRESET_CONFIG_VER;
        presetConfig.presetCfg.version = NV_ENC_CONFIG_VER;
        encStat = nvEncApi.nvEncGetEncodePresetConfig(nvEncoder, codecGUID, presetGUID, &presetConfig);

        NV_ENC_INITIALIZE_PARAMS initParams = { 0 };
        initParams.version = NV_ENC_INITIALIZE_PARAMS_VER;
        initParams.encodeGUID = codecGUID;
        initParams.encodeWidth = texSize.w;
        initParams.encodeHeight = texSize.h;
        initParams.darWidth = texSize.w;
        initParams.darHeight = texSize.h;
        initParams.frameRateNum = 25;
        initParams.frameRateDen = 1;
        initParams.enableEncodeAsync = 0;
        initParams.enablePTD = 1;
        initParams.presetGUID = presetGUID;
        memcpy(&nvEncConfig, &presetConfig.presetCfg, sizeof(nvEncConfig));
        initParams.encodeConfig = &nvEncConfig;
        encStat = nvEncApi.nvEncInitializeEncoder(nvEncoder, &initParams);
    }

    //void cleanupEncodeSession();
    //void cleanupEncoder;

    Size                    texSize;

    GLuint                  yuvTex;
    uint_t                  encWidth, encHeight;
    CUdeviceptr             cuDevPtr;
    size_t                  cuMemPitch;
    NV_ENC_CONFIG           nvEncConfig;
    NV_ENC_INPUT_PTR        nvEncInpBuf;
    NV_ENC_REGISTERED_PTR   nvEncInpRes;
    CUdevice                cuDevice;
    CUcontext               cuCtx, cuOldCtx;
    void                    *nvEncoder;
    CUgraphicsResource      cuInpTexRes;
};


int main(int argc, char *argv[])
{
    Encoder encoder;

    encoder.init({1920, 1080}, 0); // OMITTED THE TEXTURE AS IT IS NOT NEEDED TO REPRODUCE THE ISSUE

    return 0;
}

Solution

  • After comparing the NVidia sample NvEncoderCudaInterop with my minimal code, I finally found the item that makes the difference between success and failure: its the pitch parameter of the NV_ENC_REGISTER_RESOURCE structure passed to nvEncRegisterResource().

    I haven't seen it documented anywhere, but there's a hard limit on that value, which I've determined experimentally to be at 2560. Anything above that will result in NV_ENC_ERR_RESOURCE_REGISTER_FAILED.

    It does not appear to matter that the pitch I was passing was calculated by another API call, cuMemAllocPitch().

    (Another thing that was missing from my code was "locking" and unlocking the CUDA context to the current thread via cuCtxPushCurrent() and cuCtxPopCurrent(). Done in the sample via a RAII class.)


    EDIT:

    I have worked around the problem by doing something for which I had another reason: using NV12 as input format for the encoder instead of YUV444.

    With NV12, the pitch parameter drops below the 2560 limit because the byte size per row is equal to the width, so in my case 1920 bytes.

    This was necessary (at the time) because my graphics card was a GTX 760 with a "Kepler" GPU, which (as I was initially unaware) only supports NV12 as input format for NVEnc. I have since upgraded to a GTX 970, but as I just found out, the 2560 limit is still there.

    This makes me wonder just how exactly one is expected to use NVEnc with YUV444. The only possibility that comes to my mind is to use non-pitched memory, which seems bizarre. I'd appreciate comments from people who've actually used NVEnc with YUV444.


    EDIT #2 - PENDING FURTHER UPDATE:

    New information has surfaced in the form of another SO question: NVencs Output Bitstream is not readable

    It is quite possible that my answer so far was wrong. It seems now that the pitch should not only be set when registering the CUDA resource, but also when actually sending it to the encoder via nvEncEncodePicture(). I cannot check this right now, but I will next time I work on that project.