performance assembly gcc optimization perf

How is it possible to explain large execution speed difference between two processors?

I wrote a fortran program to simulate molecular system. I developed it on a desktop computer whose processor is a Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz. After, to launch large scale simulation, I used compute blades whose processors are AMD Opteron(tm) Processor 6176 @ 2.3 GHz. I was surprised because the execution time is increased by a factor of about 3.

So, I decided to learn how to use perf, asm, ... to optimize program. After a lot of stuff, I finally wrote this short program and I still have a factor of about 3.

program simple_pgm

    integer :: i, res
    
    res = 0
    do i=1,1000000000
        res = res + i
    enddo
    write(*,*) res

end program simple_pgm

Compilation command : gfortran -g -Wall -O2 simple.f90 -o simple

When I looked at annotate MAIN__ in the perf report -n, the assembler code is more or less the same.

On the i7 processor :

     │    res = 0
     │    do i=1,1000000000
     │      mov  $0x1,%eax
     │      xchg %ax,%ax
     │    res = res + i
1095 │10:   add  %eax,%edx
     │    do i=1,1000000000
 1   │      add  $0x1,%eax
     │      cmp  $0x3b9aca01,%eax
     │    ↑ jne  10
     │    enddo

And on the Opteron one :

     │    res = 0                                                                                                                                  
     │    do i=1,1000000000                                                                                                                         
     │      mov    $0x1,%eax                                                                                                                        
     │    program simple_pgm                                                                                                                        
     │      sub    $0x220,%rsp                                                                                                                      
     │      nop                                                                                                                                     
     │    res = res + i
1972 │10:   add    %eax,%edx                                          
     │    do i=1,1000000000
1524 │      add    $0x1,%eax                                          
     │      cmp    $0x3b9aca01,%eax                                                                                                                 
     │    ↑ jne    10                                                                                                                               
     │    enddo

I wonder why in the sample column, for the instruction add $0x1,%eax, the value is very large for the Opteron processor (1524). And could it explain the factor of about in execution speed ?

Thanks for answer. As I am learning ASM, processor and computer architecture, perf, ... (many things for a beginner), any comments, suggestions would be appreciated. I am aware that I could be on the wrong way.

Solution

AMD Opteron 6176 used the K10 microarchitecture, versus Intel Core i7 6700 using the Skylake microarchitecture. K10 is very old, too old apparently to have its information listed on uops.info or to be available in code analyzers.

Going by Agner Fogs microarchitecture information, it seems that K10 needed at least 2 cycles per iteration of small loops. Skylake does not have that limitation, and in this particular case should be able reach 1 cycle per iteration.

0.25ns per iteration (assuming the i7 6700 runs at 4GHz turbo) is almost (but not quite) 3 times as fast as 0.63ns per iteration (assuming the Operon runs at 3.2GHz turbo, that information isn't very well supported, and perhaps the maximum turbo frequency was not used).