I am wondering what it takes to develop a game in assembly language. For example, what are the limitations or advantages from using assembly language in game development? Also, are there any programs/softwares to aid the development of games in assembly language?
Yes, in fact it is possible. As the saying goes,
Anything is possible, if you put your mind to it.
A rather popular game from a while back, RollerCoaster Tycoon was written almost entirely in x86 assembly, although a few C functions were used to interface with the OS and DirectX.
However, as you might imagine, this can be extremely painful. Higher level languages exist to make things easier. My favorite, C
is rather commonly used for close-to-the-metal applications, and is considered fairly low level.
Here's an example of a direct translation from a simple C function (bubble sort, one of the simplest of them all) to x64 assembly:
void bubble_sort (int *a, int n) {
int i, t, s = 1;
while (s) {
s = 0;
for (i = 1; i < n; i++) {
if (a[i] < a[i - 1]) {
t = a[i];
a[i] = a[i - 1];
a[i - 1] = t;
s = 1;
}
}
}
}
bubble_sort:
push rbp
mov rbp, rsp
mov QWORD PTR [rbp-24], rdi
mov DWORD PTR [rbp-28], esi
mov DWORD PTR [rbp-8], 1
jmp .L2
.L6:
mov DWORD PTR [rbp-8], 0
mov DWORD PTR [rbp-4], 1
jmp .L3
.L5:
mov eax, DWORD PTR [rbp-4]
cdqe
lea rdx, [0+rax*4]
mov rax, QWORD PTR [rbp-24]
add rax, rdx
mov edx, DWORD PTR [rax]
mov eax, DWORD PTR [rbp-4]
cdqe
sal rax, 2
lea rcx, [rax-4]
mov rax, QWORD PTR [rbp-24]
add rax, rcx
mov eax, DWORD PTR [rax]
cmp edx, eax
jge .L4
mov eax, DWORD PTR [rbp-4]
cdqe
lea rdx, [0+rax*4]
mov rax, QWORD PTR [rbp-24]
add rax, rdx
mov eax, DWORD PTR [rax]
mov DWORD PTR [rbp-12], eax
mov eax, DWORD PTR [rbp-4]
cdqe
lea rdx, [0+rax*4]
mov rax, QWORD PTR [rbp-24]
add rdx, rax
mov eax, DWORD PTR [rbp-4]
cdqe
sal rax, 2
lea rcx, [rax-4]
mov rax, QWORD PTR [rbp-24]
add rax, rcx
mov eax, DWORD PTR [rax]
mov DWORD PTR [rdx], eax
mov eax, DWORD PTR [rbp-4]
cdqe
sal rax, 2
lea rdx, [rax-4]
mov rax, QWORD PTR [rbp-24]
add rdx, rax
mov eax, DWORD PTR [rbp-12]
mov DWORD PTR [rdx], eax
mov DWORD PTR [rbp-8], 1
.L4:
add DWORD PTR [rbp-4], 1
.L3:
mov eax, DWORD PTR [rbp-4]
cmp eax, DWORD PTR [rbp-28]
jl .L5
.L2:
cmp DWORD PTR [rbp-8], 0
jne .L6
pop rbp
ret
The choice of signed int i
for array indexing leads to a lot of extra cdqe
instructions (sign-extending 32 to 64 bit), for example. And with nothing being kept in registers across C statements, there's a huge amount of reloading. See Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? for more about how bad unoptimized code is.
Now, in this day and age, people expect their programs to run quickly (even if they used a slow algorithm like Bubble Sort). If you're going to do anything fancy or use a lot of data, you need optimized code. Here's what the same code with full optimizations looks like:
-O3 -march=corei7
aka Nehalem, see previous Godbolt link):bubble_sort:
lea eax, [rsi-2]
cmp esi, 1
lea r8, [rdi+8+rax*4]
jg .L11
rep; ret
.L11:
add rdi, 4
.L3:
mov rax, rdi
xor esi, esi
.L6:
mov edx, DWORD PTR [rax]
mov ecx, DWORD PTR [rax-4]
cmp edx, ecx
jge .L4
mov DWORD PTR [rax], ecx
mov esi, 1
mov DWORD PTR [rax-4], edx
.L4:
add rax, 4
cmp rax, r8
jne .L6
test esi, esi
jne .L3
rep; ret
(editor's note: Godbolt doesn't still have the exact GCC version that generated this, gcc-4.9.0-0909-concepts, but normal GCC 4.7.4 in C or C++ mode exactly reproduces the asm instruction choices, including defeating macro-fusion by running the second LEA between CMP and JCC. (generating an end-pointer = a+n
scaling n by sizeof(int)))
Wait a minute. It looks much shorter. Certainly. But can you tell which instructions to put where, how it has reordered the instructions, or what combinations to use? The compiler can. (And debug info will even associate each asm instruction with a source line, which Godbolt will show you with mouseover highlighting.)
But after transforming from indexing to pointer increments, some asm instructions aren't exactly implementing a source operation. And -fverbose-asm
doesn't help because the operands for most instructions are inventions of the compiler, not original C local variables. In this case it's still easy to follow if you know asm (since the algorithm is so simple, just load + compare + conditional-branch over two stores), but later GCC will vectorize bubble sort... which turns out not to be a good thing.
All in all, writing a game in assembly is probably an extremely bad idea. It's a better idea to use a compiled language such as C to write your game and possibly rewrite one or two function by hand in assembly later.
If there really are many common instances where your handwritten assembly outperforms the compiler with max optimizations, perhaps you ought to notify the developers of that compiler, so they can work with you to add the optimization you have in mind. This will allow you to just have the compiler do the same without your intervention in the future.
Compiler intrinsics will often allow you to do many of the things you do in assembly. Hell, you can even write near pure MMX/SSE/SSSE/AVX code by using compiler intrinsics, while letting the compiler still optimize it further, allocate registers well, and integrate it with the rest of your code.
(Editor's note: I do know assembly well enough to write programs in it, but it takes much longer to write efficient asm by hand than to write C which compiles to asm that's pretty close. And optimization requires taking advantage of details of known constants, and inlining, so a large codebase would be extremely hard to maintain. Changing a constant could require rewriting a lot of code if it had been a power of 2 but now isn't, for example. And a lot of code in a lot of places... Beating the compiler for a single loop is very doable for an expert, but usually still not worth doing for maintenance reasons, and because what's faster now on your current CPU might not be faster on future CPUs. Beating the compiler for all of a medium to large program is totally impractical, because it can redo inlining and constant-propagation every time you change something.
Checking the compiler-generate asm and adjusting your high-level source into something that compiles to better asm is a much better way to use asm knowledge to write efficient programs. Or just thinking about what machine operations are going to be required for the C (or whatever other language) you're writing. Sometimes compilers can be stubborn and insist on doing something less efficiently, even when you try to hand-hold them towards the asm you want, but usually it's not important enough to resort to hand-written asm even for a single function.)