I have a numpy array such as this one:
arr = np.array([
[1, 2, 3],
[4, -5, 6],
[-1, -1, -1]
])
And I would like to argsort it, but with a arr <= 0
mask. The output should be:
array([[0, 1, 2],
[0, 2], # (Note that the indices are still relative to original un-masked array)
[]])
However, the output I get using np.ma.argsort()
is:
array([[0, 1, 2],
[0, 2, 1],
[0, 1, 2]])
The approach needs to be very efficient because the real array has millions of columns. I am thinking this needs to be a synthesis of a few operations, but I don't know which ones.
The np.where
approach:
Input array
arr = np.array([
[1, 2, 3],
[4, -5, 6],
[-1, -1, -1]
])
Mask of valid elements
mask = arr > 0
Preallocate result as an object array to hold variable-length indices
result = np.empty(arr.shape[0], dtype=object)
Efficient masked argsort
for each row
for i in range(arr.shape[0]):
valid_indices = np.where(mask[i])[0] # Get indices of valid (masked) elements
result[i] = valid_indices[np.argsort(arr[i, valid_indices])] # Sort valid indices by their values
Output:
[array([0, 1, 2]) array([0, 2]) array([], dtype=int64)]
The np.flatnonzero
approach:
A more optimised approach using vectorised operations:
def optimized_masked_argsort(arr, mask):
result = np.empty(arr.shape[0], dtype=object)
for i in range(arr.shape[0]):
row = arr[i]
valid_indices = np.flatnonzero(mask[i]) # Faster than np.where(mask[i])[0]
valid_values = row[valid_indices]
sorted_order = np.argsort(valid_values)
result[i] = valid_indices[sorted_order]
return result
Comparison:
Timings for given example:
np.where Time: 0.000034 seconds
np.flatnonzero Time: 0.000017 seconds
Timings for larger array (1000 rows):
np.where Time: 0.001856 seconds
np.flatnonzero Time: 0.001754 seconds
I tried a few other methods but they fell short in efficiency.