unicode collation NIF running slower than Pure Erlang implementation

I'm trying to optimise existing unicode collation library(written in Erlang) by rewriting it as a NIF implementation. Prime reason is because collation is CPU intensive operation.

Link to implementation: https://github.com/abhi-bit/merger

Unicode collation of 1M rows via Pure Erlang based priority queue:

erlc *.erl; ERL_LIBS="..:$ERL_LIBS" erl -noshell -s perf_couch_skew main 1000000 -s init stop
Queue size: 1000000
12321.649 ms

Unicode collation of 1M rows via NIF based binomial heap:

erlc *.erl; ERL_LIBS="..:$ERL_LIBS" erl -noshell -s perf_merger main 1000000 -s init stop
Queue size: 1000000
15871.965 ms

This is unusual, I was expecting it to be probably ~10X faster.

I turned on eprof/fprof but they aren't of much use when it comes to NIF modules, below is what eprof said about the prominent functions

FUNCTION                                         CALLS      %     TIME  [uS / CALLS]
--------                                         -----    ---     ----  [----------]
merger:new/0                                         1   0.00        0  [      0.00]
merger:new/2                                         1   0.00        0  [      0.00]
merger:size/1                                   100002   0.31    19928  [      0.20]
merger:in/3                                     100000   3.29   210620  [      2.11]
erlang:put/2                                   2000000   6.63   424292  [      0.21]
merger:out/1                                    100000  14.35   918834  [      9.19]

I'm sure, NIF implementation could be made faster because I've a pure C implementation of unicode collation based on binary Heap using dynamic array and that's much much faster.

$ make
gcc -I/usr/local/Cellar/icu4c/55.1/include  -L/usr/local/Cellar/icu4c/55.1/lib  min_heap.c collate_json.c kway_merge.c kway_merge_test.c -o output -licui18n -licuuc -licudata
./output
Merging 1 arrays each of size 1000000
mergeKArrays took 84.626ms

Specific questions I've here:

How much slowdown is expected because of Erlang <-> C communication in a NIF module? In this case, slowdown is probably 30x or more between pure C and NIF implementation
What tools could be useful to debug NIF related slowdown(like in this case)? I tried using perf top to see the function call, top ones(some hex addresses were showing) were coming from "beam.smp".
What are possible areas that I should look at optimising a NIF? For example: I've heard that one should keep data being transferred between Erlang to C and vice-versa minimal, are there more such areas to consider?

Solution

The overhead of calling a NIF is tiny. When the Erlang runtime loads a module that loads a NIF, it patches the module's beam code with an emulator instruction to call into the NIF. The instruction itself performs just a small amount of setup prior to calling the C function implementing the NIF. This is not the area that's causing your performance issues.

Profiling a NIF is much the same as profiling any other C/C++ code. Judging from your Makefile it appears you're developing this code on OS X. On that platform, assuming you have XCode installed, you can use the Instruments application with the CPU Samples instrument to see where your code is spending most of its time. On Linux, you can use the callgrind tool of valgrind together with an Erlang emulator built with valgrind support to measure your code.

What you'll find if you use these tools on your code is, for example, that perf_merger:main/1 spends most of its time in merger_nif_heap_get, which in turn spends a noticeable amount of time in CollateJSON. That function seems to call convertUTF8toUChar and createStringFromJSON quite a bit. Your NIF also seems to perform a lot of memory allocation. These are the areas you should focus on to speed up your code.