outerLoop:
skipCarry:
mov eax, 2
mov inCompare, eax ;Set our inCompare to 2
mov eax, outCompare
push ecx ;Save status of these registers before entering the innerLoop
push eax
mov ecx, innerLoopCount
isComposite:
mov eax, outCompare ;Value to be checked for composite
mov edx, 0 ;Make sure edx does not have a left over value
div inCompare ;Divide outCompare by inCompare to see if we get a remainder of 0 in edx
cmp edx, 0
jne skipPrint ;If the remainder is not 0, we have not proven that the number in question is composite, so do not print
;Print out a composite value if its been proven
mov eax, outCompare
call WriteDec
mov edx, OFFSET spaces
call WriteString ;Write spaces between the composites
mov ebx, writeCounter
inc ebx ;This will help to make sure we start a new line after writing 10 composites
mov writeCounter, ebx
cmp ebx, 10 ;If counter is at 10, we start a new line
jne exitInnerLoop ;exit the inner loop because we found that the number in question is composite
call CrLf
mov writeCounter, 0
jmp exitInnerLoop ;exit the inner loop because we found that the number in question is composite
skipPrint: ;Jumped to if the number in question was not confirmed to be composite
;This is to continue testing for composite of the number in question by incrementing the divisor
mov ebx, inCompare
inc ebx
mov eax, outCompare
cmp eax, ebx
je exitInnerLoop
mov inCompare, ebx
skipIncrement: ;Will be jumped to if there are no new divisors to be tested
loop isComposite ;Retest with new divisior if there is a possibility
exitInnerLoop:
pop eax ;restore outer loop status
pop ecx
inc eax ;Next number to check for being composite
mov outCompare, eax
loop outerLoop
I get the error code saying that the jump is too long by 5 bytes referring to the outerLoop. I know that it would be simpler to just use the jmp command and keep track of the counter myself, but my homework assignment requires that the loop is used instead. Is there anyway to extend the range of the loop or is there a way to shrink the content in the loops by 5 bytes that I am not seeing?
The loops are supposed to be checking if a number is composite, just so that the functionality is known.
No, loop
/loopne
only have one encoding each, using a SHORT (signed 8-bit rel8
) relative displacement.
x86 since 386 has jcc rel32
aka NEAR conditional branches as well as jcc rel8
. If you need to jump farther than rel8
, loop
is the wrong choice for small code size. dec ecx
/ jnz
is mostly equivalent, except it clobbers some flags.
Code-size is one of the only reasons you should ever use loop
; it's slow on everything except recent AMD). (speed ~= code-size on 8086, so it's useful there. Later CPUs made it slow on purpose for compatibility with (bad?) delay loop in drivers.
If you really want to jump farther with loop
, you could do this:
loop .stay_in_loop
jmp .leave_loop
.stay_in_loop:
jmp .far_away
.leave_loop:
The .stay_in_loop
block can be after the end of the function, or somewhere else that's never reached but within 127B, which lets you omit the jmp .leave_loop
.
You can also "cheat" and put some of the loop body out-of-line. e.g. you could move the Print
block to after the end of the function (or before the beginning), and jcc
to it and jmp
back to inside the loop.
Using two nested loop
instructions is just braindead, but it seems you're required to use that. push/pop to save/restore the outer loop's ecx
isn't essential; You could use another register and mov ecx, esi
or something. Either way kinda sucks because the push or mov esi, ecx
has to be at the top of the loop. Unless you fake it by doing dec esi
yourself instead of copying back the value produced by loop
. This makes the code cleaner, but may be considered cheating / defeating the purpose of the requirement. :P
Perhaps the real goal of your assignment was to get you to optimize for code-size so a short jump can reach. That's definitely the right approach for code-quality overall in your case.
It's normal for beginner code to be painfully inefficient, so there are many easy local optimizations here. e.g. xor edx,edx
is 2 bytes (and faster) than 5 byte mov edx, 0
. Use test edx,edx
to compare against zero. It's smaller and better, and sets flags identically to cmp edx,0
Also, keep your variables in registers. An instruction like mov writeCounter, 0
assembles to mov r/m32, imm32
, using a disp32 addressing mode. So that's 8 bytes of address and data, plus 2 bytes of opcode + ModR/M. Every absolute address instead of a register operand costs you 4 bytes.
If you need to save even more code-size inside the loop, you could copy things to the stack and access them with [ebp+disp8]
addressing modes (1 byte beyond ModR/M instead of 4).
To free up more registers for use in the loop, push/pop them at the start/end of your function.
The displacement in the outer loop
is 0xc9
, or -55. So my version is under half the size of your version. (With 12 bytes of code in ResetWriteCounter
outside the loop.)
push ebx
push ebp ; we're NOT using it as a frame pointer
push edi
push esi
;;; Important to keep in registers
; edi = inCompare = trial divisor
; ebx = outCompare = prime/composite candidate
;; nice to keep in registers
; ebp = WriteCounter (but done as a down-counter with dec instead of inc/cmp)
; esi = outer loop counter
;; these variable names are kinda bad, IMO.
;; candidate / divisor would be better names.
; ecx = inner loop counter
; eax,edx = scratch for DIV
mov ebx, outCompare ; first composite candidate
mov ebp, 10 ; down-counter to trigger newlines
push ' ' ; two spaces followed by 2 zero bytes (i.e. null-terminated string)
; Use mov edx, esp instead of mov edx, OFFSET spaces in the print block
;;; note that code before this isn't part of the size that LOOP has to jump over.
outerLoop:
test bl, 1
jz Print ;; multiple of 2
;; inner loop now only has to check the odd divisors
mov edi, 3 ; first trial divisor = first odd prime (other than 1)
mov ecx, innerLoopCount ; ideally set to sqrt(ebx), e.g. cvtsi2ss xmm0, ebx / sqrtss xmm0,xmm0 / cvtss2si xmm0, ecx might be correct :P
; could instead use ecx = ebx directly if you don't care about speed, so try divisors all the way up to the candidate.
;; inner loop
isComposite:
mov eax, ebx ; edx:eax = dividend = candidate
xor edx, edx
div edi ; trial division
add edi, 2 ; next trial divisor = next odd number
test edx, edx ; equivalent to cmp edx,0 to check remainder
jz Print ; you could maybe fold this into a LOOPNZ and check ZF after the loop
;If the remainder is not 0, we have not proven that the number in question is composite, so do not print
;;cmp edi, ebx ; divisor, candidate
;;jae exitInnerLoop ; what's the point of this? isn't ecx set so that loop falls through after this many iterations?
loop isComposite ;Retest with new divisor if there is a possibility
jmp NoPrint ; else it was prime
;;; I put this outside the inner loop in the first place instead of needing to multiple jumps out of this to exitInnerLoop
Print: ;Print out a composite value if its been proven
mov eax, ebx
call WriteDec
dec ebp ; new line when the down-counter reaches zero
jz ResetWriteCounter
; print spaces only if not printing a newline
mov edx, esp ; OFFSET spaces; Use stack memory we pushed earlier
call WriteString ;Write spaces between the composites
;; tail of outer loop. Prime numbers jump here without printing.
exitInnerLoop:
NoPrint:
inc ebx ;Next number to check for being composite
; TODO: increment by 2 and don't even check the even numbers, just print them after every odd prime/composite
mov ecx, esi ; ecx = outer loop counter for the LOOP insn
dec esi ; update outer counter instead of having mov esi, ecx at the top of the loop
loop outerLoop
add esp, 4 ; clear the ' ' we pushed earlier. pop esi is smaller and about as efficiently
pop esi
pop edi
pop ebp
pop ebx
ret
; after the end of the function is a good place for infrequently-use blocks that you jump to and then jump back.
ResetWriteCounter:
mov ebp, 10 ; reset counter
; start a new line after writing 10 composites
call CrLf
; could have done this branchlessly by selecting a string of ' ',0 or 13,10,0 based on ebp being zero
; but doing the ebp update branchlessly too is worse than this highly predictable branch. Maybe if the count was a power of 2, DEC/AND
jmp NoPrint ; back where we came from, skipping the write spaces
; alternative: duplicate inc / mov / dec / loop block here
There's more you could do in terms of branch structure. e.g. I'd like to eliminate the jmp NoPrint
after falling out of the loop. That seems clunky. Putting the Print
block somewhere else would do the trick, but then Print would have to jump instead of falling through.
TODO: use ecx
as the trial divisor. Probably much slower to count down than up, though, because small prime factors are more common. Also need to avoid 1
.
I was going to just apply some of the local optimizations, but by the time I was done understanding your branch structure, I was having fun golfing it for code size. golf as in code-golf = fewest bytes wins.
See some of my x86 machine code golf answers for code-size tricks (including at the expense of speed, sometimes by a lot):
Fib(10^9)
in 105 bytes of x86 32-bit code (with Linux system calls), doing extended precision adc
with power-of-10 decimal limbs for easy truncation to keep just the most significant digits. This one is neat because there's also a performance requirement, so the inner loop is optimized for speed. (Whole thing runs in about 70 sec on my i7-6700k.)rep
strings to replace strings of N chars with base^N filler chars; in 21 bytes.n
and n^3
have the same set of decimal digits, 40 bytes of machine code. (Using bts
to build a bitmap in a register).)