pythonperformanceoptimizationlanguage-implementation

Why is the fastest way to print a list of ints so unintuitive?


I have a list of ints (ints) and I want to print each one in their own line as fast as possible.

My first shot was this:

list(map(print,ints))

and running it on 107 ints together with a function to read them from standard in this takes an average of 1708 ms from 30 repetitions.

As this fairly well-received answer points out, using list() to force-expand a map-result is inefficient because it has to create a long list of irrelevant data, specifically in our case a list of None values, returned by print. That makes intuitive sense, right?

So I tried a number of other approaches:

# n print calls
list(map(print,ints))                         # map
for n in ints: print(n)                       # loop
for i in range(0,len(ints)): print(ints[i])   # range
[print(n) for n in ints]                      # compr
deque(map(print,ints),0)                      # deque

# one print call
print('\n'.join(map(str,ints)))               # cat
print(*ints, sep='\n')                        # sep

Measured in-situ together with ints = list(map(int,sys.stdin.readlines())) the following are the runtimes.

map    1707 ms
loop   1926 ms
range  2027 ms
compr  1828 ms
deque  1699 ms
--------------
cat    1084 ms
sep    1462 ms

This is not a fluke. I don't want to overload this question with too much statistics, but these times are not outliers.

This is entirely counter intuitive. What I would have guessed to be the fastest (from the n-print set) is significantly slower than my first version ('map') despite that one creating a [None,...,None] list with ten million entries. Does python optimise that list construction out? List comprehension is in between and not hugely surprising, but the most astonishing is that in 'cat' creating a huge str and calling print on that is by far the fastest. I would have assumed that the memory allocations on that one str would have destroyed the runtime entirely, even if having only one io operation looks better on paper (note that flush=False is the default).

Main question: Why is the 'map' variant so much faster than other variants with n print calls?
And why is 'cat' so much faster yet?


Example map.py:

#!/usr/bin/python3

import sys

if __name__ == '__main__':
  ints = list(map(int,sys.stdin.readlines()))
  list(map(print,ints))

Run as

/usr/bin/time -f '%e' -o map.run-time -a ./map.py < input.log | wc -l > /dev/null

Solution

  • list(map(print,ints)) does take extra time to build a large list, but all the hard work is done in C.

    [print(n) for n in ints] does the same but in Python, which is slower than C.

    for n in ints: print(n) doesn't build a list, but it's still interpreting Python, and evidently that's more detrimental than building the list. Also, n is a global variable. That's probably why it's slower than the list comprehension. You should see some speedup if you did it inside a function so it'd be a faster local variable.

    for i in range(0,len(ints)): print(ints[i]) is just silly, that's trying to be slow with all the index objects and list subscriptions.

    deque(map(print,ints),0) uses fastest possible consumption of the map iterator and doesn't build a large list, so it's fastest. I'm actually a bit disappointed its advantage is so small.


    You can see more details by looking at the bytecode. For example import dis; dis.dis('list(map(print,ints))') shows that it just loads and calls the relevant pieces:

      1           LOAD_NAME                0 (list)
                  PUSH_NULL
                  LOAD_NAME                1 (map)
                  PUSH_NULL
                  LOAD_NAME                2 (print)
                  LOAD_NAME                3 (ints)
                  CALL                     2
                  CALL                     1
    

    Whereas the "loop" code has a Python-level loop that requires lots of interpreter steps:

      1           LOAD_NAME                0 (ints)
                  GET_ITER
          L1:     FOR_ITER                11 (to L2)
                  STORE_NAME               1 (n)
                  LOAD_NAME                2 (print)
                  PUSH_NULL
                  LOAD_NAME                1 (n)
                  CALL                     1
                  POP_TOP
                  JUMP_BACKWARD           13 (to L1)
          L2:     END_FOR
                  POP_TOP