Question: I am required to use this indexing of tokens in my difflib call:
difflib.get_close_matches(tokens[0], jobList, n=1, cutoff=0.85)
in order to get my required output. If I use what I expect. Which is tokens[j]
then my output is affected by having the token Asst
still appearing before the address Wyndrum
. Why?
# Short test removes job descriptions from in front of address trailing address strings
testList = ['21 Sharp Crescent _Wainuiomata Shop Asst','Shop Asst Wyndrum Avenue _Lower_Hutt Housewife','Housewife']
jobList = ['Asst','Housewife','Shop']
import difflib
newList = []
for i in range(len(testList)):
tokens = testList[i].split()
for j in range(len(tokens)):
print("tokens[j]",tokens[j],"tokens[0]",tokens[0])
result = difflib.get_close_matches(tokens[0], jobList, n=1, cutoff=0.85)
if result:
while tokens and tokens[0] == result[0]:
tokens.pop(0)
else:
newString = ' '.join(tokens)
newList.append(newString)
break
for i in range(len(newList)):
print(newList[i])
Expected/Correct Output
21 Sharp Crescent _Wainuiomata Shop Asst
Wyndrum Avenue _Lower_Hutt Housewife
Debug print lines
tokens[j] 21 tokens[0] 21
tokens[j] Shop tokens[0] Shop
tokens[j] Wyndrum tokens[0] Asst
tokens[j] _Lower_Hutt tokens[0] Wyndrum
tokens[j] Housewife tokens[0] Housewife
There is rule in Python: if you use for
-loop to iterate some list
then don't remove elements from this list - don't use remove()
or pop()
. Work on copy of original list or create new list with elements which you want to keep.
When you remove element from list then other elements move on list - and they change indexes. And later for
skips some element because it doesn't know that elements changed indexes.
You should work on copy of tokens - tokens.copy()
tokens = text.split()
copy = tokens.copy() # <-- create copy
for j in range(len(copy)): # <-- use copy
print("tokens[j]", copy[j], "tokens[0]", tokens[0]) # <-- use copy
Full working code with other changes:
import difflib
# PEP8: `lower_case_names` for variables
test_list = [
'21 Sharp Crescent _Wainuiomata Shop Asst',
'Shop Asst Wyndrum Avenue _Lower_Hutt Housewife',
'Housewife'
]
job_list = ['Asst', 'Housewife', 'Shop']
new_list = [] # PEP8: `lower_case_names` for variables
for text in test_list:
print(f'\n>>> text: {text} <<<\n')
tokens = text.split()
copy = tokens.copy()
# loop copy of tokens.
for j in range(len(copy)):
#for j, tok in enumerate(tokens.copy()):
print(f"tokens[j]: {copy[j]:10} | tokens[0]: {tokens[0]}")
#result = difflib.get_close_matches(tokens[0], job_list, n=1, cutoff=0.85)
result = difflib.get_close_matches(copy[j], job_list, n=1, cutoff=0.85)
if result:
if tokens[0] == result[0]:
print(' remove:', tokens.pop(0))
else:
break
new_list.append(' '.join(tokens))
print('\n--- results ---\n')
for old, new in zip(test_list, new_list):
print(old, '--->', new)
Result:
>> text: 21 Sharp Crescent _Wainuiomata Shop Asst <<<
tokens[j]: 21 | tokens[0]: 21
>>> text: Shop Asst Wyndrum Avenue _Lower_Hutt Housewife <<<
tokens[j]: Shop | tokens[0]: Shop
remove: Shop
tokens[j]: Asst | tokens[0]: Asst
remove: Asst
tokens[j]: Wyndrum | tokens[0]: Wyndrum
>>> text: Housewife <<<
tokens[j]: Housewife | tokens[0]: Housewife
remove: Housewife
--- results ---
21 Sharp Crescent _Wainuiomata Shop Asst ---> 21 Sharp Crescent _Wainuiomata Shop Asst
Shop Asst Wyndrum Avenue _Lower_Hutt Housewife ---> Wyndrum Avenue _Lower_Hutt Housewife
Housewife --->