I'm learning a bit of assembly for fun and I am probably too green to know the right terminology and find the answer myself.
I want to print a newline at the end of my program.
Below works fine.
section .data
newline db 10
section .text
_end:
mov rax, 1
mov rdi, 1
mov rsi, newline
mov rdx, 1
syscall
mov rax, 60
mov rdi, 0
syscall
But I'm hoping to achieve the same result without defining the newline in .data. Is it possible to call sys_write
directly with the byte you want, or must it always be done with a reference to some predefined data (which I assume is what mov rsi, newline
is doing)?
In short, why can't I replace mov rsi, newline
by mov rsi, 10
?
You always need the data in memory to copy it to a file-descriptor. There is no system-call equivalent of C stdio fputc
that takes data by value instead of by pointer.
mov rsi, newline
puts a pointer into a register (with a huge mov r64, imm64
instruction). sys_write
doesn't special-case size=1 and treat its void *buf
arg as a char value if it's not a valid pointer.
There aren't any other system calls that would do the trick. pwrite
and writev
are both more complicated (taking a file offset as well as a pointer, or taking an array of pointer+length to gather the data in kernel space).
There is a lot you can do to optimize this for code-size, though. See https://codegolf.stackexchange.com/questions/132981/tips-for-golfing-in-x86-x64-machine-code
First, putting the newline character in static storage means you need to generate a static address in a register. Your options here are:
mov esi, imm32
(only in Linux non-PIE executables, so static addresses are link-time constants and are known to be in the low 2GiB of virtual address space and thus work as 32-bit zero-extended or sign-extended)lea rsi, [rel newline]
Works everywhere, the only good option if you can't use the 5-byte mov-immediate.mov rsi, imm64
. This works even in PIE executables (e.g. if you link with gcc -nostdlib
without -static
, on a distro where PIE is the default.) But only via a runtime relocation fixup, and the code-size is terrible. Compilers never use this because it's not faster than LEA.But like I said, we can avoid static addressing entirely: Use push
to put immediate data on the stack. This works even if we need zero-terminated strings, because push imm8
and push imm32
both sign-extend the immediate to 64-bit. Since ASCII uses the low half of the 0..255 range, this is equivalent to zero-extension.
Then we just need to copy RSP to RSI, because push
leave RSP pointing to the data that was pushed. mov rsi, rsp
would be 3 bytes because it needs a REX prefix. If you were targeting 32-bit code or the x32 ABI (32-bit pointers in long mode) you could use 2-byte mov esi, esp
. But Linux puts the stack pointer at top of user virtual address space, so on x86-64 that's 0x007ff..., right at the top of the low canonical range. So truncating a pointer to stack memory to 32 bits isn't an option; we'd get -EFAULT
.
But we can copy a 64-bit register with 1-byte push
+ 1-byte pop
. (Assuming neither register needs a REX prefix to access.)
default rel ; We don't use any explicit addressing modes, but no reason to leave this out.
_start:
push 10 ; \n
push rsp
pop rsi ; 2 bytes total vs. 3 for mov rsi,rsp
push 1 ; _NR_write call number
pop rax ; 3 bytes, vs. 5 for mov edi, 1
mov edx, eax ; length = call number by coincidence
mov edi, eax ; fd = length = call number also coincidence
syscall ; write(1, "\n", 1)
mov al, 60 ; assuming write didn't return -errno, replace the low byte and keep the high zeros
;xor edi, edi ; leave rdi = 1 from write
syscall ; _exit(1)
.size: db $ - _start
xor-zeroing is the most well-known x86 peephole optimization: it saves 3 bytes of code size, and is actually more efficient than mov edi, 0
. But you only asked for the smallest code to print a newline, without specifying that it had to exit with status = 0. So we can save 2 bytes by leaving that out.
Since we're just making an _exit
system call, we don't need to clean up the stack from the 10
we pushed.
BTW, this will crash if the write
returns an error. (e.g. redirected to /dev/full
, or closed with ./newline >&-
, or whatever other condition.) That would leave RAX=-something, so mov al, 60
would give us RAX=0xffff...3c
. Then we'd get -ENOSYS
from the invalid call number, and fall off the end of _start
and decode whatever is next as instructions. (Probably zero bytes which decode with [rax]
as an addressing mode. Then we'd fault with a SIGSEGV.)
objdump -d -Mintel
disassembly of that code, after building with nasm -felf64
and linking with ld
0000000000401000 <_start>:
401000: 6a 0a push 0xa
401002: 54 push rsp
401003: 5e pop rsi
401004: 6a 01 push 0x1
401006: 58 pop rax
401007: 89 c2 mov edx,eax
401009: 89 c7 mov edi,eax
40100b: 0f 05 syscall
40100d: b0 3c mov al,0x3c
40100f: 0f 05 syscall
0000000000401011 <_start.size>:
401011: 11 .byte 0x11
So the total code-size is 0x11 = 17 bytes. vs. your version with 39 bytes of code + 1 byte of static data. Your first 3 mov
instructions alone are 5, 5, and 10 bytes long. (Or 7 bytes long for mov rax,1
if you use YASM which doesn't optimize it to mov eax,1
).
Running it:
$ strace ./newline
execve("./newline", ["./newline"], 0x7ffd4e98d3f0 /* 54 vars */) = 0
write(1, "\n", 1
) = 1
exit(1) = ?
+++ exited with 1 +++
If you already have a pointer to some nearby static data in a register, you could do something like a 4-byte lea rsi, [rdx + newline-foo]
(REX.W + opcode + modrm + disp8), assuming the newline-foo
offset fits in a sign-extended disp8 and that RDX holds the address of foo
.
Then you can have newline: db 10
in static storage after all. (Put it .rodata
or .data
, depending on which section you already had a pointer to).