I have been examining different sources on the web and have tried various methods but could only find how to count the frequency of unique words but not unique phrases. The code I have so far is as follows:
import collections
import re
wanted = set(['inflation', 'gold', 'bank'])
cnt = collections.Counter()
words = re.findall('\w+', open('02.2003.BenBernanke.txt').read().lower())
for word in words:
if word in wanted:
cnt [word] += 1
print (cnt)
If possible, I would also like to count the number of times the phrases 'central bank' and 'high inflation' is used in this text. I appreciate any suggestion or guidance you can give.
First of all, this is how I would generate the cnt
that you do (to reduce memory overhead)
def findWords(filepath):
with open(filepath) as infile:
for line in infile:
words = re.findall('\w+', line.lower())
yield from words
cnt = collections.Counter(findWords('02.2003.BenBernanke.txt'))
Now, on to your question about phrases:
from itertools import tee
phrases = {'central bank', 'high inflation'}
fw1, fw2 = tee(findWords('02.2003.BenBernanke.txt'))
next(fw2)
for w1,w2 in zip(fw1, fw2)):
phrase = ' '.join([w1, w2])
if phrase in phrases:
cnt[phrase] += 1
Hope this helps