pythonarraysnumpypaddingrows

numpy padding matrix of different row size


I have a numpy array of different row size

a = np.array([[1,2,3,4,5],[1,2,3],[1]])

and I would like to become this one into a dense (fixed n x m size, no variable rows) matrix. Until now I tried with something like this

size = (len(a),5)    
result = np.zeros(size)
result[[0],[len(a[0])]]=a[0]

But I receive an error telling me

shape mismatch: value array of shape (5,) could not be broadcast to indexing result of shape (1,)

I also tried to do padding wit np.pad, but according to the documentation of numpy.pad it seems I need to specify in the pad_width, the previous size of the rows (which is variable and produced me errors trying with -1,0, and biggest row size).

I know I can do it padding padding lists per row as it's shown here, but I need to do that with a much bigger array of data.

If someone can help me with the answer to this question, I would be glad to know of it.


Solution

  • There's really no way to pad a jagged array such that it would loose its jaggedness, without having to iterate over the rows of the array. You'll have to iterate over the array twice even: once to find out the maximum length you need to pad to, another to actually do the padding.

    The code proposal you've linked to will get the job done, but it's not very efficient, because it adds zeroes in a python for-loop that iterates over the elements of the rows, whereas that appending could have been precalculated, thereby pushing more of that code to C.

    The code below precomputes an array of the required minimal dimensions, filled with zeroes and then simply adds the row from the jagged array M in place, which is far more efficient.

    import random
    import numpy as np
    M = [[random.random() for n in range(random.randint(0,m))] for m in range(10000)] # play-data
    
    def pad_to_dense(M):
        """Appends the minimal required amount of zeroes at the end of each 
         array in the jagged array `M`, such that `M` looses its jagedness."""
    
        maxlen = max(len(r) for r in M)
    
        Z = np.zeros((len(M), maxlen))
        for enu, row in enumerate(M):
            Z[enu, :len(row)] += row 
        return Z
    

    To give you some idea for speed:

    from timeit import timeit
    n = [10, 100, 1000, 10000]
    s = [timeit(stmt='Z = pad_to_dense(M)', setup='from __main__ import pad_to_dense; import numpy as np; from random import random, randint; M = [[random() for n in range(randint(0,m))] for m in range({})]'.format(ni), number=1) for ni in n]
    print('\n'.join(map(str,s)))
    # 7.838103920221329e-05
    # 0.0005027339793741703
    # 0.01208890089765191
    # 0.8269036808051169
    

    If you want to prepend zeroes to the arrays, rather than append, that's a simple enough change to the code, which I'll leave to you.