linuxassemblyx86-64nasmcode-size

Print newline with as little code as possible with NASM


I'm learning a bit of assembly for fun and I am probably too green to know the right terminology and find the answer myself.

I want to print a newline at the end of my program.

Below works fine.

section .data
    newline db 10

section  .text
_end:
    mov rax, 1
    mov rdi, 1
    mov rsi, newline
    mov rdx, 1
    syscall

    mov rax, 60
    mov rdi, 0
    syscall

But I'm hoping to achieve the same result without defining the newline in .data. Is it possible to call sys_write directly with the byte you want, or must it always be done with a reference to some predefined data (which I assume is what mov rsi, newline is doing)?

In short, why can't I replace mov rsi, newline by mov rsi, 10?


Solution

  • You always need the data in memory to copy it to a file-descriptor. There is no system-call equivalent of C stdio fputc that takes data by value instead of by pointer.

    mov rsi, newline puts a pointer into a register (with a huge mov r64, imm64 instruction). sys_write doesn't special-case size=1 and treat its void *buf arg as a char value if it's not a valid pointer.

    There aren't any other system calls that would do the trick. pwrite and writev are both more complicated (taking a file offset as well as a pointer, or taking an array of pointer+length to gather the data in kernel space).


    There is a lot you can do to optimize this for code-size, though. See https://codegolf.stackexchange.com/questions/132981/tips-for-golfing-in-x86-x64-machine-code

    First, putting the newline character in static storage means you need to generate a static address in a register. Your options here are:


    But like I said, we can avoid static addressing entirely: Use push to put immediate data on the stack. This works even if we need zero-terminated strings, because push imm8 and push imm32 both sign-extend the immediate to 64-bit. Since ASCII uses the low half of the 0..255 range, this is equivalent to zero-extension.

    Then we just need to copy RSP to RSI, because push leave RSP pointing to the data that was pushed. mov rsi, rsp would be 3 bytes because it needs a REX prefix. If you were targeting 32-bit code or the x32 ABI (32-bit pointers in long mode) you could use 2-byte mov esi, esp. But Linux puts the stack pointer at top of user virtual address space, so on x86-64 that's 0x007ff..., right at the top of the low canonical range. So truncating a pointer to stack memory to 32 bits isn't an option; we'd get -EFAULT.

    But we can copy a 64-bit register with 1-byte push + 1-byte pop. (Assuming neither register needs a REX prefix to access.)

    default rel     ; We don't use any explicit addressing modes, but no reason to leave this out.
    
    _start:
        push   10         ; \n
    
        push   rsp
        pop    rsi        ; 2 bytes total vs. 3 for mov rsi,rsp
    
        push   1          ; _NR_write call number
        pop    rax        ; 3 bytes, vs. 5 for mov edi, 1
    
        mov    edx, eax   ; length = call number by coincidence
        mov    edi, eax   ; fd = length = call number  also coincidence
        syscall           ;   write(1, "\n", 1)
    
        mov    al, 60     ; assuming write didn't return -errno, replace the low byte and keep the high zeros
        ;xor    edi, edi    ; leave rdi = 1  from write
        syscall           ; _exit(1)
    
    .size: db $ - _start
    

    xor-zeroing is the most well-known x86 peephole optimization: it saves 3 bytes of code size, and is actually more efficient than mov edi, 0. But you only asked for the smallest code to print a newline, without specifying that it had to exit with status = 0. So we can save 2 bytes by leaving that out.

    Since we're just making an _exit system call, we don't need to clean up the stack from the 10 we pushed.

    BTW, this will crash if the write returns an error. (e.g. redirected to /dev/full, or closed with ./newline >&-, or whatever other condition.) That would leave RAX=-something, so mov al, 60 would give us RAX=0xffff...3c. Then we'd get -ENOSYS from the invalid call number, and fall off the end of _start and decode whatever is next as instructions. (Probably zero bytes which decode with [rax] as an addressing mode. Then we'd fault with a SIGSEGV.)


    objdump -d -Mintel disassembly of that code, after building with nasm -felf64 and linking with ld

    0000000000401000 <_start>:
      401000:       6a 0a                   push   0xa
      401002:       54                      push   rsp
      401003:       5e                      pop    rsi
      401004:       6a 01                   push   0x1
      401006:       58                      pop    rax
      401007:       89 c2                   mov    edx,eax
      401009:       89 c7                   mov    edi,eax
      40100b:       0f 05                   syscall 
      40100d:       b0 3c                   mov    al,0x3c
      40100f:       0f 05                   syscall 
    
    0000000000401011 <_start.size>:
      401011:       11                      .byte 0x11
    

    So the total code-size is 0x11 = 17 bytes. vs. your version with 39 bytes of code + 1 byte of static data. Your first 3 mov instructions alone are 5, 5, and 10 bytes long. (Or 7 bytes long for mov rax,1 if you use YASM which doesn't optimize it to mov eax,1).

    Running it:

    $ strace ./newline 
    execve("./newline", ["./newline"], 0x7ffd4e98d3f0 /* 54 vars */) = 0
    write(1, "\n", 1
    )                       = 1
    exit(1)                                 = ?
    +++ exited with 1 +++
    

    If this was part of a larger program:

    If you already have a pointer to some nearby static data in a register, you could do something like a 4-byte lea rsi, [rdx + newline-foo] (REX.W + opcode + modrm + disp8), assuming the newline-foo offset fits in a sign-extended disp8 and that RDX holds the address of foo.

    Then you can have newline: db 10 in static storage after all. (Put it .rodata or .data, depending on which section you already had a pointer to).