The problem is very simple, I have a vector of indices from which I want to extract one set randomly chosen and its complement. So I write the following code:
import numpy as np
vec = np.arange(0,25000)
idx = np.random.choice(vec,5000)
idx_r = np.delete(vec,idx)
However, when I print the length of vec, idx, and idx_r they do not match. The sum between idx and idx_r return values higher than len(vec). For example, the following code:
print(len(idx))
print(len(idx_r))
print(len(idx_r)+len(idx))
print(len(vec))
returns:
5000 20462 25462 25000
Python version is 3.8.1 and GCC is 9.2.0.
The np.random.choice
has a keyword argument replace
. Its default value is True
. If you set the value to False
, I think you will get the desired result.
import numpy as np
vec = np.arange(0, 25000)
idx = np.random.choice(vec, 5000, replace=False)
idx_r = np.delete(vec, idx)
print([len(item) for item in (vec, idx, idx_r)])
Out:
[25000, 5000, 20000]
However, numpy.random.choice
with replace=False
is extremely inefficient due to poor implementation choices they're stuck with for backward compatibility - it generates a permutation of the whole input just to take a small sample. You should use the new Generator API instead, which doesn't have this issue:
rng = np.random.default_rng()
idx = rng.choice(vec, 5000, replace=False)