I'm trying to convert a Python implementation of the Viterbi algorithm found in this Stack Overflow answer into Ruby. The full script can be found at the bottom of this question with my comments.
Unfortunately I know very little about Python so the translation is is proving more difficult than I'd like. Still, I have made some progress. Right now, the only line which is totally melting my brain is this one:
prob_k, k = max((probs[j] * word_prob(text[j:i]), j) for j in range(max(0, i - max_word_length), i))
Could someone please explain what it is doing?
Here is the full Python script:
import re
from itertools import groupby
# text will be a compound word such as 'wickedweather'.
def viterbi_segment(text):
probs, lasts = [1.0], [0]
# Iterate over the letters in the compound.
# eg. [w, ickedweather], [wi, ckedweather], and so on.
for i in range(1, len(text) + 1):
# I've no idea what this line is doing and I can't figure out how to split it up?
prob_k, k = max((probs[j] * word_prob(text[j:i]), j) for j in range(max(0, i - max_word_length), i))
# Append values to arrays.
probs.append(prob_k)
lasts.append(k)
words = []
i = len(text)
while 0 < i:
words.append(text[lasts[i]:i])
i = lasts[i]
words.reverse()
return words, probs[-1]
# Calc the probability of a word based on occurrences in the dictionary.
def word_prob(word):
# dictionary.get(key) will return the value for the specified key.
# In this case, thats the number of occurances of thw word in the
# dictionary. The second argument is a default value to return if
# the word is not found.
return dictionary.get(word, 0) / total
# This ensures we ony deal with full words rather than each
# individual letter. Normalize the words basically.
def words(text):
return re.findall('[a-z]+', text.lower())
# This gets us a hash where the keys are words and the values are the
# number of ocurrances in the dictionary.
dictionary = dict((w, len(list(ws)))
# /usr/share/dixt/words is a file of newline delimitated words.
for w, ws in groupby(sorted(words(open('/usr/share/dict/words').read()))))
# Assign the length of the longest word in the dictionary.
max_word_length = max(map(len, dictionary))
# Assign the total number of words in the dictionary. It's a float
# because we're going to divide by it later on.
total = float(sum(dictionary.values()))
# Run the algo over a file of newline delimited compound words.
compounds = words(open('compounds.txt').read())
for comp in compounds:
print comp, ": ", viterbi_segment(comp)
You are looking at a list comprehension.
The expanded version looks something like this:
all_probs = []
for j in range(max(0, i - max_word_length), i):
all_probs.append((probs[j] * word_prob(text[j:i]), j))
prob_k, k = max(all_probs)
I hope that helps explain it. If it doesn't, feel free to edit your question and point to statements that you don't understand.