pythonnumpyindexingnumpy-ndarraynumpy-random

Incoherence in complementary indices extracted from a np.array


The problem is very simple, I have a vector of indices from which I want to extract one set randomly chosen and its complement. So I write the following code:

import numpy as np    
vec = np.arange(0,25000)
idx = np.random.choice(vec,5000)
idx_r = np.delete(vec,idx)

However, when I print the length of vec, idx, and idx_r they do not match. The sum between idx and idx_r return values higher than len(vec). For example, the following code:

print(len(idx))
print(len(idx_r))
print(len(idx_r)+len(idx))
print(len(vec))

returns:

5000 20462 25462 25000

Python version is 3.8.1 and GCC is 9.2.0.


Solution

  • The np.random.choice has a keyword argument replace. Its default value is True. If you set the value to False, I think you will get the desired result.

    import numpy as np
    
    vec = np.arange(0, 25000)
    
    idx = np.random.choice(vec, 5000, replace=False)
    
    idx_r = np.delete(vec, idx)
    
    print([len(item) for item in (vec, idx, idx_r)])
    

    Out:

    [25000, 5000, 20000]
    

    However, numpy.random.choice with replace=False is extremely inefficient due to poor implementation choices they're stuck with for backward compatibility - it generates a permutation of the whole input just to take a small sample. You should use the new Generator API instead, which doesn't have this issue:

    rng = np.random.default_rng()
    
    idx = rng.choice(vec, 5000, replace=False)