I don't understand the point of this function returning two variables, which are the same:
def construct_shingles(doc,k,h):
#print 'antes -> ',doc,len(doc)
doc = doc.lower()
doc = ''.join(doc.split(' '))
#print 'depois -> ',doc,len(doc)
shingles = {}
for i in xrange(len(doc)):
substr = ''.join(doc[i:i+k])
if len(substr) == k and substr not in shingles:
shingles[substr] = 1
if not h:
return doc,shingles.keys()
ret = tuple(shingles_hashed(shingles))
return ret,ret
Seems redundant, but there must be a good reason for it, I just don't see why. Perhaps because there are two return statements? If 'h' is true, does it return both return statements? The calling functions look like:
def construct_set_shingles(docs,k,h=False):
shingles = []
for i in xrange(len(docs)):
doc = docs[i]
doc,sh = construct_shingles(doc,k,h)
docs[i] = doc
shingles.append(sh)
return docs,shingles
and
def shingles_hashed(shingles):
global len_buckets
global hash_table
shingles_hashed = []
for substr in shingles:
key = hash(substr)
shingles_hashed.append(key)
hash_table[key].append(substr)
return shingles_hashed
The data set and function call look like:
k = 3 #number of shingles
d0 = "i know you"
d1 = "i think i met you"
d2 = "i did that"
d3 = "i did it"
d4 = "she says she knows you"
d5 = "know you personally"
d6 = "i think i know you"
d7 = "i know you personally"
docs = [d0,d1,d2,d3,d4,d5,d6,d7]
docsChange,shingles = construct_set_shingles(docs[:],k)
The github location: lsh/LHS
Your guess is correct, and regarding why return ret,ret
, the answer is that return statement is meant to return a pair of equalling values rather than one.
It is more of a style of coding rather than algorithm, because this can be done by other syntaxes. However this one is advantageous in some cases, e.g. if we write
def func(x, y, z):
...
return ret
a = func(x, y, z)
b = func(x, y, z)
then func
would be executed twice. But if:
def func(x, y, z):
...
return ret, ret
a, b = func(x, y, z)
then func
can be executed only once while being able to return to both a
and b
Also in your particular case:
If h
is false
then the program until executes until the line return doc,shingles.keys()
, and then the variables doc
and sh
in construct_set_shingles
respectively take values of doc
and shingles.keys()
.
Otherwise, the first return statement is omitted, the second one is executed and then both doc
and sh
take equal values, particularly equalling to the value of tuple(shingles_hashed(shingles))