armneonodroid

Error compiling NEON code under ARM


I am trying to port SSE4 optimized code to NEON optimized with following header: https://github.com/jratcliff63367/sse2neon/blob/master/SSE2NEON.h

Got a compilation error during compiling on ODROID-xu4 this code: https://github.com/k06a/creepMiner/tree/feature/neon

[  2%] Building CXX object CMakeFiles/creepMiner.dir/src/shabal/mshabal/mshabal_neon.cpp.o
In file included from /root/creepMiner-neon/src/shabal/mshabal/sse2neon.hpp:123:0,
                 from /root/creepMiner-neon/src/shabal/mshabal/mshabal_neon.cpp:22:
/usr/lib/gcc/arm-linux-gnueabihf/7/include/arm_neon.h: In function '__m128i _mm_set1_epi32(int)':
/usr/lib/gcc/arm-linux-gnueabihf/7/include/arm_neon.h:6733:1: error: inlining failed in call to always_inline 'int32x4_t vdupq_n_s32(int32_t)': target specific option mismatch
 vdupq_n_s32 (int32_t __a)
 ^~~~~~~~~~~
In file included from /root/creepMiner-neon/src/shabal/mshabal/mshabal_neon.cpp:22:0:
/root/creepMiner-neon/src/shabal/mshabal/sse2neon.hpp:230:7: note: called from here
     (x)
       ^
/root/creepMiner-neon/src/shabal/mshabal/sse2neon.hpp:383:12: note: in expansion of macro 'vreinterpretq_m128i_s32'
     return vreinterpretq_m128i_s32(vdupq_n_s32(_i));
            ^~~~~~~~~~~~~~~~~~~~~~~
CMakeFiles/creepMiner.dir/build.make:878: recipe for target 'CMakeFiles/creepMiner.dir/src/shabal/mshabal/mshabal_neon.cpp.o' failed
make[2]: *** [CMakeFiles/creepMiner.dir/src/shabal/mshabal/mshabal_neon.cpp.o] Error 1
CMakeFiles/Makefile2:67: recipe for target 'CMakeFiles/creepMiner.dir/all' failed
make[1]: *** [CMakeFiles/creepMiner.dir/all] Error 2
Makefile:151: recipe for target 'all' failed
make: *** [all] Error 2

Source file have following specific options:

-marm -march=armv7-a+simd -mtune=cortex-a15.cortex-a7

CMakeLists.txt:

if (USE_NEON AND NOT MINIMAL_BUILD)
    add_definitions(-DUSE_NEON)
    set(SOURCE_FILES ${SOURCE_FILES} src/shabal/mshabal/mshabal_neon.cpp)
    if (UNIX OR APPLE)
        set_source_files_properties(src/shabal/mshabal/mshabal_neon.cpp PROPERTIES COMPILE_FLAGS -marm)
        set_source_files_properties(src/shabal/mshabal/mshabal_neon.cpp PROPERTIES COMPILE_FLAGS -march=armv7-a+simd)
        set_source_files_properties(src/shabal/mshabal/mshabal_neon.cpp PROPERTIES COMPILE_FLAGS -mtune=cortex-a15.cortex-a7)
    elseif (MSVC)
        set_source_files_properties(src/shabal/mshabal/mshabal_neon.cpp PROPERTIES COMPILE_FLAGS /arch:ARMv7)
    endif ()
endif ()

Looks like current architecture do not support vdupq_n_s32, but it should, because of armv7 support.

Processor info:

$ cat /proc/cpuinfo

Gives following:

processor   : 0
model name  : ARMv7 Processor rev 3 (v7l)
BogoMIPS    : 90.00
Features    : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae 
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x0
CPU part    : 0xc07
CPU revision    : 3

processor   : 1
model name  : ARMv7 Processor rev 3 (v7l)
BogoMIPS    : 90.00
Features    : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae 
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x0
CPU part    : 0xc07
CPU revision    : 3

processor   : 2
model name  : ARMv7 Processor rev 3 (v7l)
BogoMIPS    : 90.00
Features    : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae 
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x0
CPU part    : 0xc07
CPU revision    : 3

processor   : 3
model name  : ARMv7 Processor rev 3 (v7l)
BogoMIPS    : 90.00
Features    : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae 
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x0
CPU part    : 0xc07
CPU revision    : 3

processor   : 4
model name  : ARMv7 Processor rev 3 (v7l)
BogoMIPS    : 120.00
Features    : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae 
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x2
CPU part    : 0xc0f
CPU revision    : 3

processor   : 5
model name  : ARMv7 Processor rev 3 (v7l)
BogoMIPS    : 120.00
Features    : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae 
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x2
CPU part    : 0xc0f
CPU revision    : 3

processor   : 6
model name  : ARMv7 Processor rev 3 (v7l)
BogoMIPS    : 120.00
Features    : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae 
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x2
CPU part    : 0xc0f
CPU revision    : 3

processor   : 7
model name  : ARMv7 Processor rev 3 (v7l)
BogoMIPS    : 120.00
Features    : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae 
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x2
CPU part    : 0xc0f
CPU revision    : 3

Hardware    : ODROID-XU4
Revision    : 0100
Serial      : 0000000000000000

Getting native arch:

gcc -march=native -v

GIves following:

Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/arm-linux-gnueabihf/7/lto-wrapper
Target: arm-linux-gnueabihf
Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro 7.3.0-16ubuntu3' --with-bugurl=file:///usr/share/doc/gcc-7/README.Bugs --enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++ --prefix=/usr --with-gcc-major-version-only --with-as=/usr/bin/arm-linux-gnueabihf-as --with-ld=/usr/bin/arm-linux-gnueabihf-ld --program-suffix=-7 --program-prefix=arm-linux-gnueabihf- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-libitm --disable-libquadmath --disable-libquadmath-support --enable-plugin --enable-default-pie --with-system-zlib --with-target-system-zlib --enable-objc-gc=auto --enable-multiarch --enable-multilib --disable-sjlj-exceptions --with-arch=armv7-a --with-fpu=vfpv3-d16 --with-float=hard --with-mode=thumb --disable-werror --enable-multilib --enable-checking=release --build=arm-linux-gnueabihf --host=arm-linux-gnueabihf --target=arm-linux-gnueabihf
Thread model: posix
gcc version 7.3.0 (Ubuntu/Linaro 7.3.0-16ubuntu3)

Maybe this is a problem? I see only --with-arch=armv7-a --with-fpu=vfpv3-d16 support, but it should be vfpv4 support. Is it? Should I reconfigure GCC? Will this help?


Solution

  • -mfpu=neon should solve the problem.

    BTW, do you honestly expect just including the header file will do the trick?

    NEON has tons of instructions that aren't available on Intel machines, especially in terms of permutation.

    What you will get is lots of vtbl instructions that come with nasty latencies here and there that consumes cycles like crazy.

    Simply relying on someone else's generic solution cannot be called optimization IMO.