cgccesp32

Compiled 16-bit literals use 32 bits on ESP32


I implemented a decision tree with if-else statements involving unsigned 16-bit variables (features) and constants (thresholds), as shown below:

static inline int16_6 tree(const uint16_t *input) {
  if ( input[16] <= ((uint16_t)0) ) {
    if ( input[15] <= ((uint16_t)71) ) {
      if ( input[10] <= ((uint16_t)44) ) {
//...

However, when I disassemble the compiled file, each 16-bit constant seems to occupy 32 bits in ROM.

Disassembly of section .literal.tree:

00000000 <.literal.tree>:
   0:   00000d3c    
   4:   0000138c    
   8:   000018e5    
   c:   00001867    
  10:   0000110c    

// and goes on...

Question: Is it possible to make these literals use only 16 bits each in a 32-bit architecture?

My target is an ESP32 device. I ran xtensa-esp32s3-elf-objdump -dS tree.c.obj to look at the literals, and expected each spend only 16 bits of memory, but they seem to be placed into 32-bit entries.

Compiler flags are hidden inside of the SDK build system (IDF), and the sdkconfig file is huge to share, but the build stack is based on gcc and I was optimizing for speed, now for size with -Os (thanks Eric Postpischil), which reduced ROM memory occupied by this file by 1/3 and reduced average inference time. Awesome, but thresholds are still stored as 32-bit literals and the question remains.


Solution

  • No. On Xtensa (ESP32/ESP32-S3), constants that don’t fit in an instruction’s immediate field are materialized from a literal pool and fetched with L32R. A literal-pool entry is a 32-bit word, so each such constant costs 4 bytes even if the value would fit in 16 bits.

    Why you’re seeing 4 bytes:

    GCC emits L32R to load the constant into a register; L32R is a PC-relative 32-bit load from the pool. There’s no 16-bit “L16R” equivalent for literal pools on these cores. (Small values may be encoded with immediates like MOVI/ADDI, but once the value doesn’t fit, it becomes a pooled literal.)

    What you can do instead (to actually use 16-bit storage):

    Put thresholds in a table of uint16_t in .rodata (Flash) and load them at run time, instead of writing inline literals in expressions. That lets the linker pack them at 2 bytes each (modulo alignment), and the compiler can load them with 16-bit loads (l16ui) and then compare.