I am constructing my word n-gram training vector later to be used by SVM
. I ran my code, but it took me too much time, above 10 hours. Do you have any method to make it faster?
def wordNgrams(s,n):
"""Calculate word n-grams and return a dictionary"""
input = s.split(' ')
output = {}
for i in xrange(len(input)-n+1):
g = " ".join(input[i:i+n]).lower()
output.setdefault(g,0)
output[g] += 1
return output
res
is a list
of 10000 strings, whereas each string contains 300 words on average. global_vector
is a sorted list
of 200000 2-grams.
X = []
for i in xrange(10000):
word_dic = wordNgrams(res[i], 2)
vector1 = {key : 1 if key in word_dic else 0 for key in global_vector}
od = OrderedDict(sorted(vector1.items(), key = lambda t : t[0]))
X.append(od.values())
To be honest, it is common if it takes 2 or 3 hours. But it took me over 10 hours. I really do not what I should do. Please help me.
I think the sorted
call you are running in your loop is the only thing that can be optimized out. Everything else appears to be asymptotically optimal (it may be possible to improve things by small factors in various places, but not by huge amounts).
If your global_vector
is already sorted, you can avoid sorting the results by creating the results list directly, rather than going through an unordered dictionary first. Here's a quick version that uses a list comprehension:
X = []
for text in res:
word_dic = wordNgrams(text, 2)
X.append([1 if bigram in word_dic else 0 for bigram in global_vector])
This should be O(M*N) + O(M*P)
where M
is the length of res
, N
is the length of global_vector
and P
is the average length of the texts. Your original code included an extra log(N)
factor in the first term, due to the extra sorting.