[SOLVED] Incoherence in complementary indices extracted from a np.array

Incoherence in complementary indices extracted from a np.array

The problem is very simple, I have a vector of indices from which I want to extract one set randomly chosen and its complement. So I write the following code:

import numpy as np    
vec = np.arange(0,25000)
idx = np.random.choice(vec,5000)
idx_r = np.delete(vec,idx)

However, when I print the length of vec, idx, and idx_r they do not match. The sum between idx and idx_r return values higher than len(vec). For example, the following code:

print(len(idx))
print(len(idx_r))
print(len(idx_r)+len(idx))
print(len(vec))

returns:

5000 20462 25462 25000

Python version is 3.8.1 and GCC is 9.2.0.

Solution

The np.random.choice has a keyword argument replace. Its default value is True. If you set the value to False, I think you will get the desired result.

import numpy as np

vec = np.arange(0, 25000)

idx = np.random.choice(vec, 5000, replace=False)

idx_r = np.delete(vec, idx)

print([len(item) for item in (vec, idx, idx_r)])

Out:

[25000, 5000, 20000]

However, numpy.random.choice with replace=False is extremely inefficient due to poor implementation choices they're stuck with for backward compatibility - it generates a permutation of the whole input just to take a small sample. You should use the new Generator API instead, which doesn't have this issue:

rng = np.random.default_rng()

idx = rng.choice(vec, 5000, replace=False)