pythonnumpynumbanumba-pro

How to Append list of type String Int and Float using Numba


I am using Numba to improve the speed of the below loop. without Numba it takes 135 sec to execute and with Numba it takes 0.30 sec :) which is very fast.

In the below loop I comparing the array with a threshold of 0.85. If the condition turns out to be True I am inserting the data into the List which will be returned by the function.

The data which is getting inserted into the List looks like this.

['Source ID', 'Source TEXT', 'Similar ID', Similar TEXT, 'Score']

idd = df['ID'].to_numpy()
txt = df['TEXT'].to_numpy()

Column = 'TEXT'
df = preprocessing(dataresult, Column) # removing special characters of 'TEXT' column
message_embeddings = model_url(np.array(df['DescriptionNew']))  #passing df to universal sentence encoder model to create sentence embedding.
cos_sim = cosine_similarity(message_embeddings) #len(cos_sim) > 8000

# Below function finds duplicates amoung rows.
@numba.jit(nopython=True)
def similarity(nid, txxt, cos_sim, threshold):

  numba_list = List()
  for i in range(cos_sim.shape[0]):
    for index in range(i, cos_sim.shape[1]):
      if (cos_sim[i][index] > threshold) & (i!=index):
        numba_list.append([nid[i], nid[index], cos_sim[i][index]]) # either this works
        # numba_list.append([txxt[i], txxt[index]]) # or either this works
        # numba_list.append([nid[i], txxt[i], nid[index], txxt[index], cos_sim[i][index]]) # I want this to work.
              
  return numba_list

print(similarity(idd, txt, cos_sim, 0.85))

In the above code during appending List either columns with numbers get appended or either Text. I want all the columns with both numbers and text to get inserted into the numba_list.

I am getting below Error


1 frames
/usr/local/lib/python3.7/dist-packages/numba/core/dispatcher.py in error_rewrite(e, issue_type)
    359                 raise e
    360             else:
--> 361                 raise e.with_traceback(None)
    362 
    363         argtypes = []

TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Poison type used in arguments; got Poison<LiteralList((int64, [unichr x 12], int64, [unichr x 12], float32))>
During: resolving callee type: BoundFunction((<class 'numba.core.types.containers.ListType'>, 'append') for ListType[undefined])
During: typing of call at <ipython-input-179-6ee851edb6b1> (14)


File "<ipython-input-179-6ee851edb6b1>", line 14:
def zero(nid, txxt, cos_sim, threshold):
    <source elided>
        # print(i+1)
        numba_list.append([nid[i], txxt[i], nid[index], txxt[index], cos_sim[i][index]])
        ^

Solution

  • The problem you are facing comes from typing issues: Numba cannot infer the type of the list. The root of the problem is that you are dealing with list containing different item types (which is AFAIK not supported by Numba yet and would not be efficient anyway). However, tuples are made for that. Here is an untested example:

    @numba.njit
    def similarity(nid, txxt, cos_sim, threshold):
      numba_list = List()
      for i in range(cos_sim.shape[0]):
        for index in range(i, cos_sim.shape[1]):
          if (cos_sim[i][index] > threshold) & (i!=index):
            numba_list.append((nid[i], nid[index], cos_sim[i][index]))
      return numba_list
    

    Since the condition is often true, you can use pre-allocated Numpy arrays with direct indexing rather than slow list append calls to strongly speed up the computation. However, the return type will be different with this solution. The idea is to return a tuple of 3 arrays in the example rather than a list of tuples with 3 item each. This solution also benefit from taking significantly less memory.