bashrandompipebrace-expansion

Why is generating a higher amount of random data much slower?


I want to generate a high amount of random numbers. I wrote the following bash command (note that I am using cat here for demonstrational purposes; in my real use case, I am piping the numbers into a process):

for i in {1..99999999}; do echo -e "$(cat /dev/urandom | tr -dc '0-9' | fold -w 5 | head -n 1)"; done | cat

The numbers are printed at a very low rate. However, if I generate a smaller amount, it is much faster:

for i in {1..9999}; do echo -e "$(cat /dev/urandom | tr -dc '0-9' | fold -w 5 | head -n 1)"; done | cat

Note that the only difference is 9999 instead of 99999999.

Why is this? Is the data buffered somewhere? Is there a way to optimize this, so that the random numbers are piped/streamed into cat immediately?


Solution

  • Why is this?

    Generating {1..99999999} 100000000 arguments and then parsing them requires a lot of memory allocation from bash. This significantly stalls the whole system.

    Additionally, large chunks of data are read from /dev/urandom, and about 96% of that data are filtered out by tr -dc '0-9'. This significantly depletes the entropy pool and additionally stalls the whole system.

    Is the data buffered somewhere?

    Each process has its own buffer, so:

    That's 6 buffering places. Even ignoring input buffering from head -n1 and from the right side of the pipeline | cat, that's 4 output buffers.

    Also, save animals and stop cat abuse. Use tr </dev/urandom, instead of cat /dev/urandom | tr. Fun fact - tr can't take filename as a argument.

    Is there a way to optimize this, so that the random numbers are piped/streamed into cat immediately?

    Remove the whole code.

    Take only as little bytes from the random source as you need. To generate a 32-bit number you only need 32 bits - no more. To generate a 5-digit number, you only need 17 bits - rounding to 8-bit bytes, that's only 3 bytes. The tr -dc '0-9' is a cool trick, but it definitely shouldn't be used in any real code.

    Strangely recently I answered I guess a similar question, copying the code from there, you could:

    for ((i=0;i<100000000;++i)); do echo "$((0x$(dd if=/dev/urandom of=/dev/stdout bs=4 count=1 status=none | xxd -p)))"; done | cut -c-5
    # cut to take first 5 digits
    

    But that still would be unacceptably slow, as it runs 2 processes for each random number (and I think just taking the first 5 digits will have a bad distribution).

    I suggest to use $RANDOM, available in bash. If not, use $SRANDOM if you really want /dev/urandom (and really know why you want it). If not, I suggest to write the random number generation from /dev/urandom in a real programming language, like C, C++, python, perl, ruby. I believe one could write it in awk.

    The following looks nice, but still converting binary data to hex, just to convert them to decimal later is a workaround for that shell just can't work with binary data:

    count=10;
    # take count*4 bytes from input
    dd if=/dev/urandom of=/dev/stdout bs=4 count=$count status=none |
    # Convert bytes to hex 4 bytes at a time
    xxd -p -c 4 |
    # Convert hex to decimal using GNU awk
    awk --non-decimal-data '{printf "%d\n", "0x"$0}'