pythonperformancedivisionmodulodivmod

Is divmod() faster than using the % and // operators?


I remember from assembly that integer division instructions yield both the quotient and remainder. So, in python will the built-in divmod() function be better performance-wise than using the % and // operators (suppose of course one needs both the quotient and the remainder)?

q, r = divmod(n, d)
q, r = (n // d, n % d)

Solution

  • To measure is to know (all timings on a Macbook Pro 2.8Ghz i7):

    >>> import sys, timeit
    >>> sys.version_info
    sys.version_info(major=2, minor=7, micro=12, releaselevel='final', serial=0)
    >>> timeit.timeit('divmod(n, d)', 'n, d = 42, 7')
    0.1473848819732666
    >>> timeit.timeit('n // d, n % d', 'n, d = 42, 7')
    0.10324406623840332
    

    The divmod() function is at a disadvantage here because you need to look up the global each time. Binding it to a local (all variables in a timeit time trial are local) improves performance a little:

    >>> timeit.timeit('dm(n, d)', 'n, d = 42, 7; dm = divmod')
    0.13460898399353027
    

    but the operators still win because they don't have to preserve the current frame while a function call to divmod() is executed:

    >>> import dis
    >>> dis.dis(compile('divmod(n, d)', '', 'exec'))
      1           0 LOAD_NAME                0 (divmod)
                  3 LOAD_NAME                1 (n)
                  6 LOAD_NAME                2 (d)
                  9 CALL_FUNCTION            2
                 12 POP_TOP             
                 13 LOAD_CONST               0 (None)
                 16 RETURN_VALUE        
    >>> dis.dis(compile('(n // d, n % d)', '', 'exec'))
      1           0 LOAD_NAME                0 (n)
                  3 LOAD_NAME                1 (d)
                  6 BINARY_FLOOR_DIVIDE 
                  7 LOAD_NAME                0 (n)
                 10 LOAD_NAME                1 (d)
                 13 BINARY_MODULO       
                 14 BUILD_TUPLE              2
                 17 POP_TOP             
                 18 LOAD_CONST               0 (None)
                 21 RETURN_VALUE        
    

    The // and % variant uses more opcodes, but the CALL_FUNCTION bytecode is a bear, performance wise.


    In PyPy, for small integers there isn't really much of a difference; the small speed advantage the opcodes have melts away under the sheer speed of C integer arithmetic:

    >>>> import platform, sys, timeit
    >>>> platform.python_implementation(), sys.version_info
    ('PyPy', (major=2, minor=7, micro=10, releaselevel='final', serial=42))
    >>>> timeit.timeit('divmod(n, d)', 'n, d = 42, 7', number=10**9)
    0.5659301280975342
    >>>> timeit.timeit('n // d, n % d', 'n, d = 42, 7', number=10**9)
    0.5471200942993164
    

    (I had to crank the number of repetitions up to 1 billion to show how small the difference really is, PyPy is blazingly fast here).

    However, when the numbers get large, divmod() wins by a country mile:

    >>>> timeit.timeit('divmod(n, d)', 'n, d = 2**74207281 - 1, 26', number=100)
    17.620037078857422
    >>>> timeit.timeit('n // d, n % d', 'n, d = 2**74207281 - 1, 26', number=100)
    34.44323515892029
    

    (I now had to tune down the number of repetitions by a factor of 10 compared to hobbs' numbers, just to get a result in a reasonable amount of time).

    This is because PyPy no longer can unbox those integers as C integers; you can see the striking difference in timings between using sys.maxint and sys.maxint + 1:

    >>>> timeit.timeit('divmod(n, d)', 'import sys; n, d = sys.maxint, 26', number=10**7)
    0.008622884750366211
    >>>> timeit.timeit('n // d, n % d', 'import sys; n, d = sys.maxint, 26', number=10**7)
    0.007693052291870117
    >>>> timeit.timeit('divmod(n, d)', 'import sys; n, d = sys.maxint + 1, 26', number=10**7)
    0.8396248817443848
    >>>> timeit.timeit('n // d, n % d', 'import sys; n, d = sys.maxint + 1, 26', number=10**7)
    1.0117690563201904