loopsassemblyx86masmcode-size

Is there anyway to increase the range of the loop command when a loop command must be used


outerLoop:
    skipCarry:
        mov     eax, 2
        mov     inCompare, eax   ;Set our inCompare to 2
        mov     eax, outCompare 
        push    ecx ;Save status of these registers before entering the innerLoop
        push    eax
        mov     ecx, innerLoopCount

    isComposite:
        mov     eax, outCompare ;Value to be checked for composite
        mov     edx, 0  ;Make sure edx does not have a left over value
        div     inCompare   ;Divide outCompare by inCompare to see if we get a remainder of 0 in edx
        cmp     edx, 0
        jne     skipPrint   ;If the remainder is not 0, we have not proven that the number in question is composite, so do not print
        ;Print out a composite value if its been proven
        mov     eax, outCompare
        call    WriteDec
        mov     edx, OFFSET spaces
        call    WriteString     ;Write spaces between the composites
        mov     ebx, writeCounter
        inc     ebx     ;This will help to make sure we start a new line after writing 10 composites
        mov     writeCounter, ebx
        cmp     ebx, 10     ;If counter is at 10, we start a new line
        jne     exitInnerLoop   ;exit the inner loop because we found that the number in question is composite
        call    CrLf
        mov     writeCounter, 0
        jmp     exitInnerLoop       ;exit the inner loop because we found that the number in question is composite

        skipPrint:      ;Jumped to if the number in question was not confirmed to be composite

        ;This is to continue testing for composite of the number in question by incrementing the divisor
        mov     ebx, inCompare
        inc     ebx
        mov     eax, outCompare
        cmp     eax, ebx
        je      exitInnerLoop
        mov     inCompare, ebx

        skipIncrement:  ;Will be jumped to if there are no new divisors to be tested

        loop    isComposite     ;Retest with new divisior if there is a possibility
        exitInnerLoop:

    pop     eax     ;restore outer loop status
    pop     ecx
    inc     eax     ;Next number to check for being composite
    mov     outCompare, eax
    loop    outerLoop

I get the error code saying that the jump is too long by 5 bytes referring to the outerLoop. I know that it would be simpler to just use the jmp command and keep track of the counter myself, but my homework assignment requires that the loop is used instead. Is there anyway to extend the range of the loop or is there a way to shrink the content in the loops by 5 bytes that I am not seeing?

The loops are supposed to be checking if a number is composite, just so that the functionality is known.


Solution

  • No, loop/loopne only have one encoding each, using a SHORT (signed 8-bit rel8) relative displacement.

    x86 since 386 has jcc rel32 aka NEAR conditional branches as well as jcc rel8. If you need to jump farther than rel8, loop is the wrong choice for small code size. dec ecx / jnz is mostly equivalent, except it clobbers some flags.

    Code-size is one of the only reasons you should ever use loop; it's slow on everything except recent AMD). (speed ~= code-size on 8086, so it's useful there. Later CPUs made it slow on purpose for compatibility with (bad?) delay loop in drivers.


    If you really want to jump farther with loop, you could do this:

        loop  .stay_in_loop
    
        jmp   .leave_loop
    .stay_in_loop:
        jmp .far_away
    .leave_loop:
    

    The .stay_in_loop block can be after the end of the function, or somewhere else that's never reached but within 127B, which lets you omit the jmp .leave_loop.


    You can also "cheat" and put some of the loop body out-of-line. e.g. you could move the Print block to after the end of the function (or before the beginning), and jcc to it and jmp back to inside the loop.


    Using two nested loop instructions is just braindead, but it seems you're required to use that. push/pop to save/restore the outer loop's ecx isn't essential; You could use another register and mov ecx, esi or something. Either way kinda sucks because the push or mov esi, ecx has to be at the top of the loop. Unless you fake it by doing dec esi yourself instead of copying back the value produced by loop. This makes the code cleaner, but may be considered cheating / defeating the purpose of the requirement. :P

    Perhaps the real goal of your assignment was to get you to optimize for code-size so a short jump can reach. That's definitely the right approach for code-quality overall in your case.

    It's normal for beginner code to be painfully inefficient, so there are many easy local optimizations here. e.g. xor edx,edx is 2 bytes (and faster) than 5 byte mov edx, 0. Use test edx,edx to compare against zero. It's smaller and better, and sets flags identically to cmp edx,0

    Also, keep your variables in registers. An instruction like mov writeCounter, 0 assembles to mov r/m32, imm32, using a disp32 addressing mode. So that's 8 bytes of address and data, plus 2 bytes of opcode + ModR/M. Every absolute address instead of a register operand costs you 4 bytes.

    If you need to save even more code-size inside the loop, you could copy things to the stack and access them with [ebp+disp8] addressing modes (1 byte beyond ModR/M instead of 4).

    To free up more registers for use in the loop, push/pop them at the start/end of your function.


    Cleaned-up version of the code in the question

    The displacement in the outer loop is 0xc9, or -55. So my version is under half the size of your version. (With 12 bytes of code in ResetWriteCounter outside the loop.)

        push   ebx
        push   ebp       ; we're NOT using it as a frame pointer
        push   edi
        push   esi
    
    
    ;;; Important to keep in registers
    ; edi = inCompare = trial divisor
    ; ebx = outCompare = prime/composite candidate
    
    ;; nice to keep in registers
    ; ebp = WriteCounter  (but done as a down-counter with dec instead of inc/cmp)
    ; esi = outer loop counter
    
    ;; these variable names are kinda bad, IMO.
    ;; candidate / divisor would be better names.
    
    ; ecx = inner loop counter
    ; eax,edx = scratch for DIV
    
    
        mov    ebx, outCompare    ; first composite candidate
    
        mov    ebp, 10            ; down-counter to trigger newlines
        push   '  '          ; two spaces followed by 2 zero bytes (i.e. null-terminated string)
       ; Use mov edx, esp instead of mov edx, OFFSET spaces in the print block
    
       ;;; note that code before this isn't part of the size that LOOP has to jump over.
    
    outerLoop:
            test    bl, 1
            jz      Print    ;; multiple of 2
            ;; inner loop now only has to check the odd divisors
    
            mov     edi, 3    ; first trial divisor = first odd prime (other than 1)
            mov     ecx, innerLoopCount     ; ideally set to sqrt(ebx), e.g. cvtsi2ss xmm0, ebx / sqrtss xmm0,xmm0 / cvtss2si xmm0, ecx might be correct :P
                       ; could instead use ecx = ebx directly if you don't care about speed, so try divisors all the way up to the candidate.
    
        ;; inner loop
        isComposite:
            mov     eax, ebx         ; edx:eax = dividend = candidate
            xor     edx, edx
            div     edi              ; trial division
    
            add     edi, 2           ; next trial divisor = next odd number
            test    edx, edx         ; equivalent to cmp edx,0 to check remainder
            jz      Print            ; you could maybe fold this into a LOOPNZ and check ZF after the loop
       ;If the remainder is not 0, we have not proven that the number in question is composite, so do not print
    
            ;;cmp     edi, ebx        ;  divisor, candidate
            ;;jae     exitInnerLoop   ; what's the point of this?  isn't ecx set so that loop falls through after this many iterations?
    
            loop   isComposite     ;Retest with new divisor if there is a possibility
    
            jmp     NoPrint        ; else it was prime
    
    ;;; I put this outside the inner loop in the first place instead of needing to multiple jumps out of this to exitInnerLoop
        Print:   ;Print out a composite value if its been proven
            mov     eax, ebx
            call    WriteDec
    
            dec     ebp              ; new line when the down-counter reaches zero
            jz      ResetWriteCounter
    
            ; print spaces only if not printing a newline
            mov     edx, esp        ; OFFSET spaces;  Use stack memory we pushed earlier
            call    WriteString     ;Write spaces between the composites
    
        ;; tail of outer loop.  Prime numbers jump here without printing.
        exitInnerLoop:
        NoPrint:
            inc     ebx         ;Next number to check for being composite
                                ; TODO: increment by 2 and don't even check the even numbers, just print them after every odd prime/composite
    
            mov     ecx, esi    ; ecx = outer loop counter for the LOOP insn
            dec     esi         ; update outer counter instead of having  mov esi, ecx at the top of the loop
            loop    outerLoop
    
        add    esp, 4    ; clear the '  ' we pushed earlier.  pop esi  is smaller and about as efficiently
    
        pop    esi
        pop    edi
        pop    ebp
        pop    ebx
        ret
    
      ; after the end of the function is a good place for infrequently-use blocks that you jump to and then jump back.
     ResetWriteCounter:
            mov     ebp, 10         ; reset counter
            ; start a new line after writing 10 composites
            call    CrLf
            ; could have done this branchlessly by selecting a string of '  ',0 or 13,10,0 based on ebp being zero
            ; but doing the ebp update branchlessly too is worse than this highly predictable branch.  Maybe if the count was a power of 2, DEC/AND
    
            jmp  NoPrint          ; back where we came from, skipping the write spaces
           ; alternative: duplicate inc / mov / dec / loop block here
    

    There's more you could do in terms of branch structure. e.g. I'd like to eliminate the jmp NoPrint after falling out of the loop. That seems clunky. Putting the Print block somewhere else would do the trick, but then Print would have to jump instead of falling through.

    TODO: use ecx as the trial divisor. Probably much slower to count down than up, though, because small prime factors are more common. Also need to avoid 1.


    I was going to just apply some of the local optimizations, but by the time I was done understanding your branch structure, I was having fun golfing it for code size. golf as in code-golf = fewest bytes wins.

    See some of my x86 machine code golf answers for code-size tricks (including at the expense of speed, sometimes by a lot):