pythonnumpyvectorizationone-hot-encoding

Convert a 2d matrix to a 3d one hot matrix numpy


I have np matrix and I want to convert it to a 3d array with one hot encoding of the elements as third dimension. Is there a way to do with without looping over each row eg

a=[[1,3],
   [2,4]]

should be made into

b=[[1,0,0,0], [0,0,1,0],
   [0,1,0,0], [0,0,0,1]]

Solution

  • Approach #1

    Here's a cheeky one-liner that abuses broadcasted comparison -

    (np.arange(a.max()) == a[...,None]-1).astype(int)
    

    Sample run -

    In [120]: a
    Out[120]: 
    array([[1, 7, 5, 3],
           [2, 4, 1, 4]])
    
    In [121]: (np.arange(a.max()) == a[...,None]-1).astype(int)
    Out[121]: 
    array([[[1, 0, 0, 0, 0, 0, 0],
            [0, 0, 0, 0, 0, 0, 1],
            [0, 0, 0, 0, 1, 0, 0],
            [0, 0, 1, 0, 0, 0, 0]],
    
           [[0, 1, 0, 0, 0, 0, 0],
            [0, 0, 0, 1, 0, 0, 0],
            [1, 0, 0, 0, 0, 0, 0],
            [0, 0, 0, 1, 0, 0, 0]]])
    

    For 0-based indexing, it would be -

    In [122]: (np.arange(a.max()+1) == a[...,None]).astype(int)
    Out[122]: 
    array([[[0, 1, 0, 0, 0, 0, 0, 0],
            [0, 0, 0, 0, 0, 0, 0, 1],
            [0, 0, 0, 0, 0, 1, 0, 0],
            [0, 0, 0, 1, 0, 0, 0, 0]],
    
           [[0, 0, 1, 0, 0, 0, 0, 0],
            [0, 0, 0, 0, 1, 0, 0, 0],
            [0, 1, 0, 0, 0, 0, 0, 0],
            [0, 0, 0, 0, 1, 0, 0, 0]]])
    

    If the one-hot enconding is to cover for the range of values ranging from the minimum to the maximum values, then offset by the minimum value and then feed it to the proposed method for 0-based indexing. This would be applicable for rest of the approaches discussed later on in this post as well.

    Here's a sample run on the same -

    In [223]: a
    Out[223]: 
    array([[ 6, 12, 10,  8],
           [ 7,  9,  6,  9]])
    
    In [224]: a_off = a - a.min() # feed a_off to proposed approaches
    
    In [225]: (np.arange(a_off.max()+1) == a_off[...,None]).astype(int)
    Out[225]: 
    array([[[1, 0, 0, 0, 0, 0, 0],
            [0, 0, 0, 0, 0, 0, 1],
            [0, 0, 0, 0, 1, 0, 0],
            [0, 0, 1, 0, 0, 0, 0]],
    
           [[0, 1, 0, 0, 0, 0, 0],
            [0, 0, 0, 1, 0, 0, 0],
            [1, 0, 0, 0, 0, 0, 0],
            [0, 0, 0, 1, 0, 0, 0]]])
    

    If you are okay with a boolean array with True for 1's and False for 0's, you can skip the .astype(int) conversion.

    Approach #2

    We can also initialize a zeros arrays and index into the output with advanced-indexing. Thus, for 0-based indexing, we would have -

    def onehot_initialization(a):
        ncols = a.max()+1
        out = np.zeros(a.shape + (ncols,), dtype=int)
        out[all_idx(a, axis=2)] = 1
        return out
    

    Helper func -

    # https://stackoverflow.com/a/46103129/ @Divakar
    def all_idx(idx, axis):
        grid = np.ogrid[tuple(map(slice, idx.shape))]
        grid.insert(axis, idx)
        return tuple(grid)
    

    This should be especially more performant when dealing with larger range of values.

    For 1-based indexing, simply feed in a-1 as the input.

    Approach #3 : Sparse matrix solution

    Now, if you are looking for sparse array as output and AFAIK since scipy's inbuilt sparse matrices support only 2D formats, you can get a sparse output that is a reshaped version of the output shown earlier with the first two axes merging and the third axis being kept intact. The implementation for 0-based indexing would look something like this -

    from scipy.sparse import coo_matrix
    def onehot_sparse(a):
        N = a.size
        L = a.max()+1
        data = np.ones(N,dtype=int)
        return coo_matrix((data,(np.arange(N),a.ravel())), shape=(N,L))
    

    Again, for 1-based indexing, simply feed in a-1 as the input.

    Sample run -

    In [157]: a
    Out[157]: 
    array([[1, 7, 5, 3],
           [2, 4, 1, 4]])
    
    In [158]: onehot_sparse(a).toarray()
    Out[158]: 
    array([[0, 1, 0, 0, 0, 0, 0, 0],
           [0, 0, 0, 0, 0, 0, 0, 1],
           [0, 0, 0, 0, 0, 1, 0, 0],
           [0, 0, 0, 1, 0, 0, 0, 0],
           [0, 0, 1, 0, 0, 0, 0, 0],
           [0, 0, 0, 0, 1, 0, 0, 0],
           [0, 1, 0, 0, 0, 0, 0, 0],
           [0, 0, 0, 0, 1, 0, 0, 0]])
    
    In [159]: onehot_sparse(a-1).toarray()
    Out[159]: 
    array([[1, 0, 0, 0, 0, 0, 0],
           [0, 0, 0, 0, 0, 0, 1],
           [0, 0, 0, 0, 1, 0, 0],
           [0, 0, 1, 0, 0, 0, 0],
           [0, 1, 0, 0, 0, 0, 0],
           [0, 0, 0, 1, 0, 0, 0],
           [1, 0, 0, 0, 0, 0, 0],
           [0, 0, 0, 1, 0, 0, 0]])
    

    This would be much better than previous two approaches if you are okay with having sparse output.

    Runtime comparison for 0-based indexing

    Case #1 :

    In [160]: a = np.random.randint(0,100,(100,100))
    
    In [161]: %timeit (np.arange(a.max()+1) == a[...,None]).astype(int)
    1000 loops, best of 3: 1.51 ms per loop
    
    In [162]: %timeit onehot_initialization(a)
    1000 loops, best of 3: 478 µs per loop
    
    In [163]: %timeit onehot_sparse(a)
    10000 loops, best of 3: 87.5 µs per loop
    
    In [164]: %timeit onehot_sparse(a).toarray()
    1000 loops, best of 3: 530 µs per loop
    

    Case #2 :

    In [166]: a = np.random.randint(0,500,(100,100))
    
    In [167]: %timeit (np.arange(a.max()+1) == a[...,None]).astype(int)
    100 loops, best of 3: 8.51 ms per loop
    
    In [168]: %timeit onehot_initialization(a)
    100 loops, best of 3: 2.52 ms per loop
    
    In [169]: %timeit onehot_sparse(a)
    10000 loops, best of 3: 87.1 µs per loop
    
    In [170]: %timeit onehot_sparse(a).toarray()
    100 loops, best of 3: 2.67 ms per loop
    

    Squeezing out best performance

    To squeeze out the best performance, we could modify approach #2 to use indexing on a 2D shaped output array and also use uint8 dtype for memory efficiency and that leading to much faster assignments, like so -

    def onehot_initialization_v2(a):
        ncols = a.max()+1
        out = np.zeros( (a.size,ncols), dtype=np.uint8)
        out[np.arange(a.size),a.ravel()] = 1
        out.shape = a.shape + (ncols,)
        return out
    

    Timings -

    In [178]: a = np.random.randint(0,100,(100,100))
    
    In [179]: %timeit onehot_initialization(a)
         ...: %timeit onehot_initialization_v2(a)
         ...: 
    1000 loops, best of 3: 474 µs per loop
    10000 loops, best of 3: 128 µs per loop
    
    In [180]: a = np.random.randint(0,500,(100,100))
    
    In [181]: %timeit onehot_initialization(a)
         ...: %timeit onehot_initialization_v2(a)
         ...: 
    100 loops, best of 3: 2.38 ms per loop
    1000 loops, best of 3: 213 µs per loop