pythonarraysstringnumpyslice

How can I slice each element of a numpy array of strings?


Numpy has some very useful string operations, which vectorize the usual Python string operations.

Compared to these operation and to pandas.str, the numpy strings module seems to be missing a very important one: the ability to slice each string in the array. For example,

a = numpy.array(['hello', 'how', 'are', 'you'])
numpy.char.sliceStr(a, slice(1, 3))
>>> numpy.array(['el', 'ow', 're' 'ou'])

Am I missing some obvious method in the module with this functionality? Otherwise, is there a fast vectorized way to achieve this?


Solution

  • Here's a vectorized approach -

    def slicer_vectorized(a,start,end):
        b = a.view((str,1)).reshape(len(a),-1)[:,start:end]
        return np.fromstring(b.tostring(),dtype=(str,end-start))
    

    Sample run -

    In [68]: a = np.array(['hello', 'how', 'are', 'you'])
    
    In [69]: slicer_vectorized(a,1,3)
    Out[69]: 
    array(['el', 'ow', 're', 'ou'], 
          dtype='|S2')
    
    In [70]: slicer_vectorized(a,0,3)
    Out[70]: 
    array(['hel', 'how', 'are', 'you'], 
          dtype='|S3')
    

    Runtime test -

    Testing out all the approaches posted by other authors that I could run at my end and also including the vectorized approach from earlier in this post.

    Here's the timings -

    In [53]: # Setup input array
        ...: a = np.array(['hello', 'how', 'are', 'you'])
        ...: a = np.repeat(a,10000)
        ...: 
    
    # @Alberto Garcia-Raboso's answer
    In [54]: %timeit slicer(1, 3)(a)
    10 loops, best of 3: 23.5 ms per loop
    
    # @hapaulj's answer
    In [55]: %timeit np.frompyfunc(lambda x:x[1:3],1,1)(a)
    100 loops, best of 3: 11.6 ms per loop
    
    # Using loop-comprehension
    In [56]: %timeit np.array([i[1:3] for i in a])
    100 loops, best of 3: 12.1 ms per loop
    
    # From this post
    In [57]: %timeit slicer_vectorized(a,1,3)
    1000 loops, best of 3: 787 µs per loop