pythonnumpynumpy-random

Why is np.random.default_rng().permutation(n) preferred over the original np.random.permutation(n)?


Numpy documentation on np.random.permutation suggests all new code use np.random.default_rng() from the Random Generator package. I see in the documentation that the Random Generator package has standardized the generation of a wide variety of random distributions around the BitGenerator vs using Mersenne Twister, which I'm vaguely familiar with.

I see one downside, what used to be a single line of code to do simple permutations:

np.random.permutation(10)

turns into two lines of code now, which feels a little awkward for such a simple task:

rng = np.random.default_rng()
rng.permutation(10)

Solution

  • Some context:

    To your questions, in a logical order:

    And why wouldn't existing methods like np.random.permutation just wrap this new preferred method?

    Probably because of backwards compatibility concerns. Even if the "top-level" API would not be changing, its internals would be significantly enough to be deemed a break in compatability.

    Why is this new approach an improvement over the previous approach?

    "By default, Generator uses bits provided by PCG64 which has better statistical properties than the legacy MT19937 used in RandomState." (source). The PCG64 docstring provides more technical detail.

    Is there a good reason not to use this new method as a one-liner np.random.default_rng().permutation(10), assuming it's not being called at high volumes?

    I very much agree that it's a slightly awkward added line of code if it's done at the module-start. I would only point out that the NumPy docs do directly use this form in docstring examples, such as:

    n = np.random.default_rng().standard_exponential((3, 8000))
    

    The slight difference would be that one is instantiating a class at module load/import time, whereas in your form it might come later. But that should be a minuscule difference (again, assuming it's only used once or a handful of times). If you look at the default_rng(seed) source, when called with None, it just returns Generator(PCG64(seed)) after a few quick checks on seed.

    Is there an argument for switching existing code to this method?

    Going to pass on this one since I don't have anywhere near the depth of technical knowledge to give a good comparison of the algorithms, and also because it depends on some other variables such as whether you're concerned about making your downstream code compatibility with older versions of NumPy, where default_rng() simply doesn't exist.