Print newline with as little code as possible with NASM

I'm learning a bit of assembly for fun and I am probably too green to know the right terminology and find the answer myself.

I want to print a newline at the end of my program.

Below works fine.

section .data
    newline db 10

section  .text
_end:
    mov rax, 1
    mov rdi, 1
    mov rsi, newline
    mov rdx, 1
    syscall

    mov rax, 60
    mov rdi, 0
    syscall

But I'm hoping to achieve the same result without defining the newline in .data. Is it possible to call sys_write directly with the byte you want, or must it always be done with a reference to some predefined data (which I assume is what mov rsi, newline is doing)?

In short, why can't I replace mov rsi, newline by mov rsi, 10?

Solution

You always need the data in memory to copy it to a file-descriptor. There is no system-call equivalent of C stdio fputc that takes data by value instead of by pointer.

mov rsi, newline puts a pointer into a register (with a huge mov r64, imm64 instruction). sys_write doesn't special-case size=1 and treat its void *buf arg as a char value if it's not a valid pointer.

There aren't any other system calls that would do the trick. pwrite and writev are both more complicated (taking a file offset as well as a pointer, or taking an array of pointer+length to gather the data in kernel space).

There is a lot you can do to optimize this for code-size, though. See https://codegolf.stackexchange.com/questions/132981/tips-for-golfing-in-x86-x64-machine-code

First, putting the newline character in static storage means you need to generate a static address in a register. Your options here are:

5-bytes mov esi, imm32 (only in Linux non-PIE executables, so static addresses are link-time constants and are known to be in the low 2GiB of virtual address space and thus work as 32-bit zero-extended or sign-extended)
7-byte lea rsi, [rel newline] Works everywhere, the only good option if you can't use the 5-byte mov-immediate.
10-byte mov rsi, imm64. This works even in PIE executables (e.g. if you link with gcc -nostdlib without -static, on a distro where PIE is the default.) But only via a runtime relocation fixup, and the code-size is terrible. Compilers never use this because it's not faster than LEA.

But like I said, we can avoid static addressing entirely: Use push to put immediate data on the stack. This works even if we need zero-terminated strings, because push imm8 and push imm32 both sign-extend the immediate to 64-bit. Since ASCII uses the low half of the 0..255 range, this is equivalent to zero-extension.

Then we just need to copy RSP to RSI, because push leave RSP pointing to the data that was pushed. mov rsi, rsp would be 3 bytes because it needs a REX prefix. If you were targeting 32-bit code or the x32 ABI (32-bit pointers in long mode) you could use 2-byte mov esi, esp. But Linux puts the stack pointer at top of user virtual address space, so on x86-64 that's 0x007ff..., right at the top of the low canonical range. So truncating a pointer to stack memory to 32 bits isn't an option; we'd get -EFAULT.

But we can copy a 64-bit register with 1-byte push + 1-byte pop. (Assuming neither register needs a REX prefix to access.)

default rel     ; We don't use any explicit addressing modes, but no reason to leave this out.

_start:
    push   10         ; \n

    push   rsp
    pop    rsi        ; 2 bytes total vs. 3 for mov rsi,rsp

    push   1          ; _NR_write call number
    pop    rax        ; 3 bytes, vs. 5 for mov edi, 1

    mov    edx, eax   ; length = call number by coincidence
    mov    edi, eax   ; fd = length = call number  also coincidence
    syscall           ;   write(1, "\n", 1)

    mov    al, 60     ; assuming write didn't return -errno, replace the low byte and keep the high zeros
    ;xor    edi, edi    ; leave rdi = 1  from write
    syscall           ; _exit(1)

.size: db $ - _start

xor-zeroing is the most well-known x86 peephole optimization: it saves 3 bytes of code size, and is actually more efficient than mov edi, 0. But you only asked for the smallest code to print a newline, without specifying that it had to exit with status = 0. So we can save 2 bytes by leaving that out.

Since we're just making an _exit system call, we don't need to clean up the stack from the 10 we pushed.

BTW, this will crash if the write returns an error. (e.g. redirected to /dev/full, or closed with ./newline >&-, or whatever other condition.) That would leave RAX=-something, so mov al, 60 would give us RAX=0xffff...3c. Then we'd get -ENOSYS from the invalid call number, and fall off the end of _start and decode whatever is next as instructions. (Probably zero bytes which decode with [rax] as an addressing mode. Then we'd fault with a SIGSEGV.)

objdump -d -Mintel disassembly of that code, after building with nasm -felf64 and linking with ld

0000000000401000 <_start>:
  401000:       6a 0a                   push   0xa
  401002:       54                      push   rsp
  401003:       5e                      pop    rsi
  401004:       6a 01                   push   0x1
  401006:       58                      pop    rax
  401007:       89 c2                   mov    edx,eax
  401009:       89 c7                   mov    edi,eax
  40100b:       0f 05                   syscall 
  40100d:       b0 3c                   mov    al,0x3c
  40100f:       0f 05                   syscall 

0000000000401011 <_start.size>:
  401011:       11                      .byte 0x11

So the total code-size is 0x11 = 17 bytes. vs. your version with 39 bytes of code + 1 byte of static data. Your first 3 mov instructions alone are 5, 5, and 10 bytes long. (Or 7 bytes long for mov rax,1 if you use YASM which doesn't optimize it to mov eax,1).

Running it:

$ strace ./newline 
execve("./newline", ["./newline"], 0x7ffd4e98d3f0 /* 54 vars */) = 0
write(1, "\n", 1
)                       = 1
exit(1)                                 = ?
+++ exited with 1 +++

If this was part of a larger program:

If you already have a pointer to some nearby static data in a register, you could do something like a 4-byte lea rsi, [rdx + newline-foo] (REX.W + opcode + modrm + disp8), assuming the newline-foo offset fits in a sign-extended disp8 and that RDX holds the address of foo.

Then you can have newline: db 10 in static storage after all. (Put it .rodata or .data, depending on which section you already had a pointer to).