I've been struggling with this one for a while so here we go : I've been trying to match the speed of inference for an ML model I generated with Edge Impulse originnaly to Arduino, then to ESP-IDF for my ESP32-CAM device.
The algo takes ~1300ms to run on Arduino and it takes ~6600ms on ESP-IDF with -Os optim in both case. The closer I got is by setting compile optimization to -O2 which got me around 2000ms on ESP-IDF.
In both cases the CPU frequency is set at 240MHz, and I tried to figure out how exactly does Arduino compiles and mimic it to see what I could miss but I'm I'm not figuring it out.
I verified with a test sample that does matricial calcul with both volatile floats and integers to ensure that the CPU calculus capacities are the same in both envs and I got :
I have similar results on both projects and logged everything thread related and it matches (runs on core1, same priority, same cpu speed).
I ensured that memory allocation is static in both tensor flow lite lib with the flag -DTF_LITE_STATIC_MEMORY.
I ensured that there is no parallel shinanigans and OPEN_MP is disabled in both cases.
I switched compilers to check if it doesn't come from a compiler libc or something.
I tried to get as close as possible as Arduino's compiler arguments.
Here is a dump of arduino compile flags arduino compile arguments :
COLLECT_GCC_OPTIONS='-c' '-mlongcalls' '-Wno-frame-address' '-ffunction-sections' '-fdata-sections' '-Wno-error=unused-function' '-Wno-error=unused-variable' '-Wno-error=unused-but-set-variable' '-Wno-error=deprecated-declarations' '-Wno-unused-parameter' '-Wno-sign-compare' '-Wno-enum-conversion' '-gdwarf-4' '-ggdb' '-freorder-blocks' '-Wwrite-strings' '-fstack-protector' '-fstrict-volatile-bitfields' '-fno-jump-tables' '-fno-tree-switch-conversion' '-std=gnu++23' '-fexceptions' '-fno-rtti' '-w' '-Os' '-v' '-w' '-E' '-CC' '-D' 'F_CPU=240000000L' '-D' 'ARDUINO=10607' '-D' 'ARDUINO_ESP32_DEV' '-D' 'ARDUINO_ARCH_ESP32' '-D' 'ARDUINO_BOARD="ESP32_DEV"' '-D' 'ARDUINO_VARIANT="esp32"' '-D' 'ARDUINO_PARTITION_huge_app' '-D' 'ARDUINO_HOST_OS="windows"' '-D' 'ARDUINO_FQBN="esp32:esp32:esp32cam:CPUFreq=240,FlashFreq=80,FlashMode=qio,PartitionScheme=huge_app,DebugLevel=none,EraseFlash=none"' '-D' 'ESP32' '-D' 'CORE_DEBUG_LEVEL=0' '-D' 'BOARD_HAS_PSRAM' '-mfix-esp32-psram-cache-issue' '-mfix-esp32-psram-cache-strategy=memw' '-D' 'ARDUINO_USB_CDC_ON_BOOT=0' '-D' 'ESP_PLATFORM' '-D' 'IDF_VER="v5.1.4-497-gdc859c1e67-dirty"' '-D' 'MBEDTLS_CONFIG_FILE="mbedtls/esp_config.h"' '-D' 'SOC_MMU_PAGE_SIZE=CONFIG_MMU_PAGE_SIZE' '-D' 'UNITY_INCLUDE_CONFIG_H' '-D' '_GNU_SOURCE' '-D' '_POSIX_READER_WRITER_LOCKS' '-D' 'configENABLE_FREERTOS_DEBUG_OCDAWARE=1' '-D' 'TF_LITE_STATIC_MEMORY' '-I'
Here are my compile line on esp-idf :
C:\Espressif\tools\xtensa-esp32-elf\esp-12.2.0_20230208\xtensa-esp32-elf\bin\xtensa-esp32-elf-g++.exe -mlongcalls -Wno-frame-address -DNDEBUG -fdiagnostics-color=always -Wno-unused-variable -Wno-deprecated-declarations -Wno-missing-field-initializers -Wno-maybe-uninitialized -Wno-error=uninitialized -DTF_LITE_STATIC_MEMORY -mlongcalls -ffunction-sections -fdata-sections -fstrict-volatile-bitfields -fno-jump-tables -fno-tree-switch-conversion -fno-rtti -w -Wall -Werror=all -Wno-error=unused-function -Wno-error=unused-variable -Wno-error=unused-but-set-variable -Wno-error=deprecated-declarations -Wextra -Wno-unused-parameter -Wno-sign-compare -Wno-enum-conversion -gdwarf-4 -ggdb -mfix-esp32-psram-cache-issue -mfix-esp32-psram-cache-strategy=memw -Os -freorder-blocks -fmacro-prefix-map=path -fmacro-prefix-map=other_path -DconfigENABLE_FREERTOS_DEBUG_OCDAWARE=1 -std=gnu++2b -fno-exceptions -DESP32=ESP32 -MD -MT file.cpp.obj -MF file.cpp.obj.d -o file.cpp.obj
-c file.cpp
What is the more strange for me is the difference between -o2 optimization in my ESP-IDF case, but Arduino is better with -Os...
Anyway any help would be greatly appreciated, Have a good day everyone and thanks for reading me,
Aloïs
Long story short, it was the TFLITE
kernel accelerated maths functions that weren't compiled because the flag that defines the boards wasn't always passed by the CMake file.
If you find your problem looks similar, look for the EI_CLASSIFIER_TFLITE_ENABLE_ESP_NN
flag and ensure your board is always defined.