stringperformanceperl

How to prevent automatic handling of newlines killing performance in Perl text output?


TL;DR:

Automatic line end handling in perl5 seemingly degrades performance. Is there a possibility to mitigate this without sacrificing convenience by having to add newlines manually instead of using say or -l?


Long version:

I repeatedly wondered why my own (subjective) experience of string processing speed in comparison of perl5 (feels fast) vs python so grossly disagrees with the presumably objective performance comparisons on the benchmark game web page (e.g. in the fasta example perl is slow). What I found striking is: All world seems to use "say" for ordinary output, while the strange guy here sticks with "print" and manually adding "\n".

So I set up a minimal example:

  1. Count from 1 to an arbitrary number in the language itself.
  2. For every step output the number in a new line.
  3. Repeat the whole process three times
  4. Take the minimum time of these three runs while stdout is redirected to /dev/null

The whole thing was done for three different approaches:

  1. Use perl's ordinary "print" statement and append newline explicitly every single time
  2. alternatively call perl with "-l", which enables automatic newline handling (again with print)
  3. 3rd alternative: use "say" feature, which also handles this automatically

For every method I contrasted output buffering states (enabled vs disabled).

I was stunned, when I saw, that disabling the output buffer is far less detrimental for performance than enabling automatic line end handling -- the latter more than doubled the run time.

Records/sec  n_total Time Command
 7120052.46 10000000 1.40 perl -e '$|=0; for (my $i=1; $i<=10000000;$i++) {print $i . "\n";};'          
 4961279.63 10000000 2.02 perl -e '$|=1; for (my $i=1; $i<=10000000;$i++) {print $i . "\n";};'          
 2821234.16 10000000 3.54 perl -l -e '$|=0; for (my $i=1; $i<=10000000;$i++) {print $i;};'                
 2462226.44 10000000 4.06 perl -l -e '$|=1; for (my $i=1; $i<=10000000;$i++) {print $i;};' 
 2898328.96 10000000 3.45 perl -e 'use feature "say"; $|=0; for (my $i=1; $i<=10000000;$i++) {say $i;};'  
 2452486.23 10000000 4.08 perl -e 'use feature "say"; $|=1; for (my $i=1; $i<=10000000;$i++) {say $i;};'  

Question:

While for the time being it seems worthwile to add the line ends manually, this is still somewhat cumbersome: Is there any switch that I have missed here and that could prevent this degradation in output performance when using either "say" or the "-l" switch?


FWIW:

Couldn't help myself when I was already going down the rabbit hole. So I went on...

To get a more comprehensive picture I finally continued and compared the above to other tools I either use daily (mainly gawk, since recently also a bit of jq) - or did so in the past (python and even earlier tcl) or just dabbled around with at some time (lisp flavours). All on a windows 10 machine running cygwin.

Main criteria now was to choose the number so that it runs within 10 seconds on my mychine and still leaves the run numbers somewhat comparable (only powers of 10 considered). Also contrasted using line buffer vs flushing output after every single line for some of the other.

The complete story is: disabling the output buffer in perl is less painful than in other languages (in particular python). For some I wouldn't even think trying it (see below).

 Records/sec   n_total Time    Command
 46772945.30 100000000 2.14 -- seq 100000000
  7102358.15  10000000 1.41 -- perl -e '$|=0; for (my $i=1; $i<=10000000;$i++) {print $i . "\x0A";};'
  4851147.42  10000000 2.06 -- php -r 'for ($i=1; $i<=10000000;$i++) {echo $i . PHP_EOL;};'
  4524956.61  10000000 2.21 -- perl -e '$|=1; for (my $i=1; $i<=10000000;$i++) {print $i . "\x0A";};'
  2643538.99  10000000 3.78 -- perl -l -e '$|=0; for (my $i=1; $i<=10000000;$i++) {print $i;};'
  2542994.96  10000000 3.93 -- awk -e "BEGIN{for (i=1;i<=10000000;i++) print i}"
  2537951.85  10000000 3.94 -- perl -e '$|=0; use feature "say"; for (my $i=1; $i<=10000000;$i++) {say $i;};'
  2407226.07  10000000 4.15 -- luajit -e 'for i = 1,10000000 do print(i) end'
  2347240.89  10000000 4.26 -- perl -e '$|=1; use feature "say"; for (my $i=1; $i<=10000000;$i++) {say $i;};'
  2342141.92  10000000 4.27 -- perl -l -e '$|=1; for (my $i=1; $i<=10000000;$i++) {print $i;};'
  1869504.49  10000000 5.35 -- python -c 'for i in range(1,round(10000000)+1): print(i)'
  1641012.48  10000000 6.09 -- awk -e "BEGIN{for (i=1;i<=10000000;i++) {print i; fflush()}}"
  1591311.72  10000000 6.28 -- jq -n 'range(1;10000000+1)'
  1395488.79  10000000 7.17 -- lua -e 'for i = 1,10000000 do print(i) end'
   905889.95   1000000 1.10 -- echo 'for {set i 1} {$i <= 1000000} {incr i} {puts $i}' | tclsh
   516841.54   1000000 1.93 -- python -uc 'for i in range(1,round(1000000)+1): print(i)'
   263998.59   1000000 3.79 -- clisp -x '(loop for x from 1 to 1000000 do (print x))'
   139282.21   1000000 7.18 -- echo '(loop for x from 1 to 1000000 do (print x))' | sbcl

used versions: Perl 5.36.3; Python 3.9.16; PHP 7.3.7; GAWK 5.3.0; Lua 5.3.6; Luajit 2.0.4; jq 1.7.1; clisp 2.49.92; sbcl 2.4.9; tclsh 8.6.12; seq from GNU core utils 9.0


Solution

  • TL;DR:

    Simply quoting the numeric variable seems to resolve the issue with say and -l!


    Long version:

    After some comments above I carried out further tests. TLP and jhnc seem to be on to something:

    When using , instead of . to combine the variable value and the newline for the print statement, that slows down the print call like using "say" did in the original post.

    However, when the variable was quoted there was no slowdown. Likewise say and -l could be sped up again be quoting the variable with say being the fastest now. See tests below.

     Records/sec   n_total Time    Command
      6532779.83  10000000 1.53 -- perl -e '$|=0; for (my $i=1; $i<=10000000;$i++) {print $i . "\n";};'
      4812752.85  10000000 2.08 -- perl -e '$|=1; for (my $i=1; $i<=10000000;$i++) {print $i . "\n";};'
      2851817.48  10000000 3.51 -- perl -e '$|=0; for (my $i=1; $i<=10000000;$i++) {print $i, "\n";};'
      2292372.38  10000000 4.36 -- perl -e '$|=1; for (my $i=1; $i<=10000000;$i++) {print $i, "\n";};'
      7081385.63  10000000 1.41 -- perl -e '$|=0; for (my $i=1; $i<=10000000;$i++) {print "$i", "\n";};'
      5347508.00  10000000 1.87 -- perl -e '$|=1; for (my $i=1; $i<=10000000;$i++) {print "$i", "\n";};'
      2957947.16  10000000 3.38 -- perl -e '$|=0; use feature "say"; for (my $i=1; $i<=10000000;$i++) {say $i;};'
      2611186.85  10000000 3.83 -- perl -e '$|=1; use feature "say"; for (my $i=1; $i<=10000000;$i++) {say $i;};'
      7811974.89  10000000 1.28 -- perl -e '$|=0; use feature "say"; for (my $i=1; $i<=10000000;$i++) {say "$i";};'
      5600668.93  10000000 1.79 -- perl -e '$|=1; use feature "say"; for (my $i=1; $i<=10000000;$i++) {say "$i";};'
      2861258.14  10000000 3.49 -- perl -l -e '$|=0; for (my $i=1; $i<=10000000;$i++) {print $i;};'
      2341549.60  10000000 4.27 -- perl -l -e '$|=1; for (my $i=1; $i<=10000000;$i++) {print $i;};'
      7045289.66  10000000 1.42 -- perl -l -e '$|=0; for (my $i=1; $i<=10000000;$i++) {print "$i";};'
      4676765.18  10000000 2.14 -- perl -l -e '$|=1; for (my $i=1; $i<=10000000;$i++) {print "$i";};'