pythonnlpn-gram

How to make my python code more effective?


I am constructing my word n-gram training vector later to be used by SVM. I ran my code, but it took me too much time, above 10 hours. Do you have any method to make it faster?

def wordNgrams(s,n):
    """Calculate word n-grams and return a dictionary"""
    input = s.split(' ')
    output = {}
    for i in xrange(len(input)-n+1):
        g = " ".join(input[i:i+n]).lower()
        output.setdefault(g,0)
        output[g] += 1
    return output

res is a list of 10000 strings, whereas each string contains 300 words on average. global_vector is a sorted list of 200000 2-grams.

X = []
for i in xrange(10000):
     word_dic = wordNgrams(res[i], 2)
     vector1 = {key : 1 if key in word_dic else 0 for key in global_vector}
     od = OrderedDict(sorted(vector1.items(), key = lambda t : t[0]))
     X.append(od.values())

To be honest, it is common if it takes 2 or 3 hours. But it took me over 10 hours. I really do not what I should do. Please help me.


Solution

  • I think the sorted call you are running in your loop is the only thing that can be optimized out. Everything else appears to be asymptotically optimal (it may be possible to improve things by small factors in various places, but not by huge amounts).

    If your global_vector is already sorted, you can avoid sorting the results by creating the results list directly, rather than going through an unordered dictionary first. Here's a quick version that uses a list comprehension:

    X = []
    for text in res:
        word_dic = wordNgrams(text, 2)
        X.append([1 if bigram in word_dic else 0 for bigram in global_vector])
    

    This should be O(M*N) + O(M*P) where M is the length of res, N is the length of global_vector and P is the average length of the texts. Your original code included an extra log(N) factor in the first term, due to the extra sorting.