pythonarrayspandasnumpyperformance

Efficient way to insert similar rows (with only one column changing) right after each row in Numpy or Pandas


Say I have a Pandas dataframe with 4 rows and 5 columns. For simplicity I will convert it into a Numpy array, which looks like this:

import numpy as np

A = np.array([[23, 43, 23, 110, 5],
              [83, 32, 12, 123, 4],
              [58, 41, 59, 189, 1],
              [93, 77, 22, 170, 3]])

For each row, I want to insert some similar rows right after the row, with only the column 4 decreasing by 1 each time until 0. The expected output should look like this:

np.array([[23, 43, 23, 110, 5],
          [23, 43, 23, 110, 4],
          [23, 43, 23, 110, 3],
          [23, 43, 23, 110, 2],
          [23, 43, 23, 110, 1],
          [23, 43, 23, 110, 0],

          [83, 32, 12, 123, 4],
          [83, 32, 12, 123, 3],
          [83, 32, 12, 123, 2],
          [83, 32, 12, 123, 1],
          [83, 32, 12, 123, 0],

          [58, 41, 59, 189, 1],
          [58, 41, 59, 189, 0],

          [93, 77, 22, 170, 3],
          [93, 77, 22, 170, 2],
          [93, 77, 22, 170, 1],
          [93, 77, 22, 170, 0]])

Below is the code that I have come up with:

new_rows = []
for i, row in enumerate(A):
    new = A[i, 4] - 1
    while new >= 0:
        new_row = row.copy()
        new_row[4] = new
        new_rows.append(new_row)
        new -= 1
new_A = np.vstack([A, np.array(new_rows)])
print(new_A)

Output

[[ 23  43  23 110   5]
 [ 83  32  12 123   4]
 [ 58  41  59 189   1]
 [ 93  77  22 170   3]
 [ 23  43  23 110   4]
 [ 23  43  23 110   3]
 [ 23  43  23 110   2]
 [ 23  43  23 110   1]
 [ 23  43  23 110   0]
 [ 83  32  12 123   3]
 [ 83  32  12 123   2]
 [ 83  32  12 123   1]
 [ 83  32  12 123   0]
 [ 58  41  59 189   0]
 [ 93  77  22 170   2]
 [ 93  77  22 170   1]
 [ 93  77  22 170   0]]

Obviously, the code is not efficient since it doesn't use any Numpy vectorization. In reality, I have more than 4,000 original rows so a speed-up is definitely needed. Moreover, I cannot insert new rows right after each row. Is there any efficient way to do this in Numpy or Pandas?


Solution

  • arr = A[:, -1] + 1
    
    temp = np.repeat(A, arr, axis = 0)
    
    # depending on your array size
    # you can build the range here with a much faster implementation
    from this link : # https://stackoverflow.com/a/47126435/7175713
    arr = np.concatenate([np.arange(ent) for ent in arr])
    
    temp[:, -1] = temp[:, -1] - arr
    
    temp
    
    array([[ 23,  43,  23, 110,   5],
           [ 23,  43,  23, 110,   4],
           [ 23,  43,  23, 110,   3],
           [ 23,  43,  23, 110,   2],
           [ 23,  43,  23, 110,   1],
           [ 23,  43,  23, 110,   0],
           [ 83,  32,  12, 123,   4],
           [ 83,  32,  12, 123,   3],
           [ 83,  32,  12, 123,   2],
           [ 83,  32,  12, 123,   1],
           [ 83,  32,  12, 123,   0],
           [ 58,  41,  59, 189,   1],
           [ 58,  41,  59, 189,   0],
           [ 93,  77,  22, 170,   3],
           [ 93,  77,  22, 170,   2],
           [ 93,  77,  22, 170,   1],
           [ 93,  77,  22, 170,   0]])