how to extract overlapping sub-arrays with a window size and flatten them

I am trying to get better at using numpy functions and methods to run my programs in python faster

I want to do the following:

I create an array 'a' as:

a=np.random.randint(-10,11,10000).reshape(-1,10)

a.shape: (1000,10)

I create another array which takes only the first two columns in array a

b=a[:,0:2]

b,shape: (1000,2)

now I want to create an array c which has 990 rows containing flattened slices of 10 rows of array 'b'. So the first row of array 'c' will have 20 columns which is a flattened slice of 0 to 10 rows of array 'b'. The next row of array 'c' will have 20 columns of flattened rows 1 to 11 of array 'b' etc.

I can do this with a for loop. But I want to know if there is much faster way to do this using numpy functions and methods like strides or something else

Thanks for your time and your help.

Solution

This loops over shifts rather than rows (loop of size 10):

N = 10
c = np.hstack([b[i:i-N] for i in range(N)])

Explanation: b[i:i-N] is b's rows from i to m-(N-i)(excluding m-(N-i) itself) where m is number of rows in b. Then np.hstack stacks those selected sub-arrays horizontally(stacks b[0:m-10], b[1:m-9], b[2:m-8], ..., b[10:m]) (as question explains).

c.shape: (990, 20)

Also I think you may be looking for a shape of (991, 20) if you want to include all windows.

you can also use strides, but if you want to do operations on it, I would advise against that, since the memory is tricky using them. Here is a strides solution if you insist:

from skimage.util.shape import view_as_windows
c = view_as_windows(b, (10,2)).reshape(-1, 20)

c.shape: (991, 20)

If you don't want the last row, simply remove it by calling c[:-1].
A similar solution applies with numpy's as_strides function (they basically operate similar, not sure of internals of them).

UPDATE: if you want to find unique values and their frequencies in each row of c you can do:

unique_values = []
unique_counts = []
for row in c:
  unique, unique_c = np.unique(row, return_counts=True)
  unique_values.append(unique)
  unique_counts.append(unique_c)

Note that numpy arrays have to be rectangular, meaning the number of elements per each(dimension) row must be the same. Since different rows in c can have different number of unique values, you cannot create a numpy array for unique values of each row (Alternative would be to make a structured numpy array). Therefore, a solution is to make a list/array of arrays, each including unique values of different rows in c. unique_values are list of arrays of unique values and unique_counts is their frequency in the same order.