arm-none-eabi-as bne.n misbehaves (or I do)

I sometimes have to dabble with a bit of assembler, and am not too sure of the correct use of directives. While investigating what should be the simplest delay loop I got an unexpected result, and my question is: Do I misuse directives, or if the code below actually is a compiler error.

In case the answer is "compiler error": Please note that I know there are newer versions of arm-none-eabi-as out there. The question is not "get this code to work", but is a question of using assembler directives correctly. The target system is the plain vanilla STM32F1xx range of Cortex-m3 processors.

The following code:

        .syntax unified
        .cpu  cortex-m3
        .thumb
    
        .align 1
        .global myDelayWorks       
        .thumb_func
myDelayWorks:   
.FileLocalLabel:
        subs  r0,#1
        bne.n .FileLocalLabel
        bx    lr
            
        .align 1
        .global myDelayFails       
        .thumb_func
myDelayFails:
        subs  r0,#1
        bne.n myDelayFails
        bx    lr

compiles to the following (using arm-none-eabi-as --version GNU assembler (GNU Tools for ARM Embedded Processors) 2.24.0.20150604):

   8                myDelayWorks:   
   9                .FileLocalLabel:
  10 0000 0138              subs  r0,#1
  11 0002 FDD1              bne.n .FileLocalLabel
  12 0004 7047              bx    lr
  13                        
  14                        .align 1
  15                        .global myDelayFails       
  16                        .thumb_func
  17                myDelayFails:
  18 0006 0138              subs  r0,#1
  19 0008 FED1              bne.n myDelayFails
  20 000a 7047              bx    lr

The problem with an incorrect branch offset seems to arise because myDelayFails is declared .global.

Solution

I suppose your concern is that the displacement encoded in the machine code FED1 (which is the wrong endian and should be D1FE to match the usual notation) appears to be -4, where it should be -6? That's because it is finally computed by the linker, not the assembler.

The contents left in that field by the assembler are not directly meaningful (sometimes an offset to be included in the address computation by the linker), but there will be a relocation entry in the object file, telling the linker to insert there the correct displacement to the label myDelayFails.

So all this is normal. If you disassemble the actual executable produced by the linker, you should see the correct displacement.

It's also normal that global and non-global labels behave differently in this respect. For a non-global label in the same section, the assembler gets to see exactly where the target label is relative to the branch (even though it does not know their absolute addresses), so it can compute and insert the displacement itself. For a global label, it could be in another section whose eventual location is not known to the assembler, or defined in a different source file altogether, so the assembler leaves it to the linker.

In this case, the global label is defined in the current files, so the assembler could in principle compute the displacement itself. I'm not exactly sure why they choose to leave it to the linker instead. It might be so that the linker can redefine the symbol myDelayFails, if for whatever reason it is useful to do so.