androidandroid-ndkvulkanrenderscript

Why is NDK slower then Renderscript on a non parallelizable operation?


Like most of RenderScript (RS) users i was caught by surprise about it's deprecation. Understandable but nonetheless frustrating.

A bit of context first.

Two image processing blocks of my algorithm rely on RS: canny & distance transform.

Canny was "straightforward" enough to migrate to Vulkan and i even achieved the same results as Renderscript (sometimes Vulkan faster speedwise).

The distance transform algorithm [Rosenfeld and Pfaltz 1966] is non parallelizable so it's current implementation in RenderScript is purely serial with the usage of invoke(). Down below the RS code is all normal with usage of RS Allocations, set/get, etc...

Because i need to find a replacement for RS and Vulkan is not suitable for non parallel operations i thought NDK should be comparable with RS speed-wise. I actually thought it would be faster given the fact you don't need to copy from/to Allocations <-> Java.

After implementing the NDK C++ equivalent RS code i was surprised to see NDK is 2 to 3 times slower.

What i've been constantly thinking is why this is the case. Are RenderScript Allocations optimal speed-wise for memory access? Is there some hidden magic going on in RenderScript?

How can a simple for loop with invoke() and Allocations be faster than the same for loop in NDK C++?

(tested in several Android smartphones with same result - 2/3x slower)

Update I

Some code added as required by solidpixel.

kernel.rs

#pragma version(1)
#pragma rs java_package_name(distancetransform)

rs_allocation inAlloc;
uint32_t width;
uint32_t height;
uint max_value;

uint __attribute__((kernel)) initialize(uint32_t x, uint32_t y) {

    if(rsGetElementAt_uint(inAlloc,x,y)==1) {
        return 0;
    } else{
        return max_value;
    }
    
}

uint __attribute__((kernel)) clear(uint32_t x, uint32_t y) {
    return 0;
}

//SEQUENCIAL NO MAP X,Y

void first_pass_() {
    
    int i,j;
    
    for (i=1;i<height-1;i++){
        for (j=1;j<width-1;j++){
            uint c00 = rsGetElementAt_uint(inAlloc,j-1,i-1)+4;
            uint c01 = rsGetElementAt_uint(inAlloc,j,i-1)+3;
            uint c02 = rsGetElementAt_uint(inAlloc,j+1,i-1)+4;
            uint c10 = rsGetElementAt_uint(inAlloc,j-1,i)+3;
            uint c11 = rsGetElementAt_uint(inAlloc,j,i);
        
            uint min_a = min(c00,c01);
            uint min_b = min(c02,c10);
            uint min_ab = min(min_a,min_b);
            uint min_sum = min(min_ab,c11);
            
            rsSetElementAt_uint(inAlloc,min_sum,j,i);
        }
    }
}

void second_pass_() {
    
    int i,j;
    
    for (i=height-2;i>0;i--){
        for (j=width-2;j>0;j--){
            uint c00 = rsGetElementAt_uint(inAlloc,j,i);
            uint c01 = rsGetElementAt_uint(inAlloc,j+1,i)+3;
            uint c02 = rsGetElementAt_uint(inAlloc,j-1,i+1)+4;
            uint c10 = rsGetElementAt_uint(inAlloc,j,i+1)+3;
            uint c11 = rsGetElementAt_uint(inAlloc,j+1,i+1)+4;
            
            uint min_a = min(c00,c01);
            uint min_b = min(c02,c10);
            uint min_ab = min(min_a,min_b);
            uint min_sum = min(min_ab,c11);
            
            rsSetElementAt_uint(inAlloc,min_sum,j,i);
        }
    }
}

java*

public void distanceTransform(IntBuffer edgeBuffer) {
        
        long total_0 = System.nanoTime();
        
        edgeBuffer.get(_input);
        edgeBuffer.rewind();
        _allocK.copyFrom(_input);
        _script.forEach_initialize(_allocK);
        
        _script.invoke_first_pass_();
        _script.invoke_second_pass_();
        
        _allocK.copyTo(_result);
        
        _distMapBuffer.put(_result);
        _distMapBuffer.rewind();
        
        long total_1 = System.nanoTime();
        Log.d(TAG,"total call time = "+((total_1-total_0)*0.000001)+"ms");
    }

(*)Not relevant for the question but for completion: edgeBuffer and distMapBuffer are Java NIO buffers for efficient binding purposes to other languages.

ndk.cpp

extern "C" JNIEXPORT void JNICALL Java_distanceTransform(
        JNIEnv* env, jobject /* this */,jobject edgeMap, jobject distMap) {
    auto* dt = (int32_t*)env->GetDirectBufferAddress(distMap);
    auto* edgemap = (int32_t*)env->GetDirectBufferAddress(edgeMap);

    auto s_init = std::chrono::high_resolution_clock::now();

    int32_t i, j;
    int32_t size = h*w;
    int32_t max_val = w+h;
    for (i = 0; i < size; i++) {
        if (edgemap[i]!=0) {
            dt[i] = 0;
        } else {
            dt[i] = max_val;
        }
    }

    auto e_init = std::chrono::high_resolution_clock::now();
    auto elapsed_init = std::chrono::duration_cast<std::chrono::nanoseconds>(e_init - s_init);
    __android_log_print(ANDROID_LOG_INFO, LOG_TAG, "Time init = %f", elapsed_init.count() * 1e-9);

    auto s_first = std::chrono::high_resolution_clock::now();

    for (i = 1; i < h-1; i++) {
        for (j = 1; j < w-1; j++) {
            int32_t c00 = dt[(i-1)*w+(j-1)]+4;
            int32_t c01 = dt[(i-1)*w+j]+3;
            int32_t c02 = dt[(i-1)*w+(j+1)]+4;
            int32_t c10 = dt[i*w+(j-1)]+3;
            int32_t c11 = dt[i*w+j];

            int32_t min_a = c00<c01?c00:c01;
            int32_t min_b = c02<c10?c02:c10;
            int32_t min_ab = min_a<min_b?min_a:min_b;
            int32_t min_sum = min_ab<c11?min_ab:c11;
            dt[i*w+j] = min_sum;
        }
    }

    auto e_first = std::chrono::high_resolution_clock::now();
    auto elapsed_first = std::chrono::duration_cast<std::chrono::nanoseconds>(e_first - s_first);
    __android_log_print(ANDROID_LOG_INFO, LOG_TAG, "Time first pass = %f", elapsed_first.count() * 1e-9);

    auto s_second = std::chrono::high_resolution_clock::now();

    for (i = h-2; i > 0; i--) {
        for (j = w-2; j > 0; j--) {
            int32_t c00 = dt[i*w+(j+1)]+3;
            int32_t c01 = dt[(i+1)*w+(j-1)]+4;
            int32_t c02 = dt[(i+1)*w+j]+3;
            int32_t c10 = dt[(i+1)*w+(j+1)]+4;
            int32_t c11 = dt[i*w+j];

            int32_t min_a = c00<c01?c00:c01;
            int32_t min_b = c02<c10?c02:c10;
            int32_t min_ab = min_a<min_b?min_a:min_b;
            int32_t min_sum = min_ab<c11?min_ab:c11;
            dt[i*w+j] = min_sum;
        }
    }

    auto e_second = std::chrono::high_resolution_clock::now();
    auto elapsed_second = std::chrono::duration_cast<std::chrono::nanoseconds>(e_second - s_second);
    __android_log_print(ANDROID_LOG_INFO, LOG_TAG, "Time second pass = %f", elapsed_second.count() * 1e-9);
}

Solution

  • Mirroring my comment from our internal bug tracker:

    The problem is that the "debug" build variant in Android Studio is compiled with -O0. If you optimize more aggressively, NDK is faster.

    It turns out to be a bit tricky to change this. If you do set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O2"), it gets inserted BEFORE -O0, and so has no effect. Instead, per Turn on compiler optimization for Android Studio debug build via Cmake, do this: target_compile_options(dt-ndk-jni PRIVATE "$<$<CONFIG:DEBUG>:-O2>"). Then, -O2 goes AFTER -O0 and overrides it.

    You can see what flags are being passed by looking at app/.cxx/cmake/debug/arm64-v8a/compile_commands.json

    Here are the results I got on a Pixel 6 pro, making sure that the phone was awake when running the benchmark so everything ran on a performance core.

    With -O0:

    With -Os:

    With -O2:

    With -O2 and the phone asleep, I got:

    Edit: Using the "release" build variant will also optimize the build, but using that may not always be an option.