pythonnumpy

What is the fastest way to stack NumPy arrays in a loop?


I have a code that generates in a for loop two NumPy arrays (data_transform). In the first loop, it generates a NumPy array of (40, 2), and in the second loop, one of (175, 2). I want to concatenate these two arrays into one, to give me an array of (215, 2). I tried with np.concatenate() and np.append(), but it gives me an error, since the arrays must be the same size. Here is an example of how I'm doing the code:

result_arr = np.array([])

for label in labels_set:
    data = [index for index, value in enumerate(labels_list) if value == label]

    for i in data:
        sub_corpus.append(corpus[i])
    data_sub_tfidf = vec.fit_transform(sub_corpus) 
    data_transform = pca.fit_transform(data_sub_tfidf) 

    # Append array
    sub_corpus = []

I have also used np.row_stack() but nothing else gives me a value of (175, 2) which is the second array I want to concatenate.


Solution

  • What @hpaulj was trying to say with

    Stick with list append when doing loops.

    is

    #use a normal list
    result_arr = []
    
    for label in labels_set:
    
        data_transform = pca.fit_transform(data_sub_tfidf) 
    
        # append the data_transform object to that list
        # Note: this is not np.append(), which is slow here
        result_arr.append(data_transform)
    
    # and stack it after the loop
    # This prevents slow memory allocation in the loop. 
    # So only one large chunk of memory is allocated since
    # the final size of the concatenated array is known.
    
    result_arr = np.concatenate(result_arr)
    
    # or 
    result_arr = np.stack(result_arr, axis=0)
    
    # or
    result_arr = np.vstack(result_arr)
    

    Your arrays don't really have different dimensions. They have one different dimension, the other one is identical. And in that case you can always stack along the "different" dimension.