I have in Python lists like below:
x1 = ['lock-service',
'jenkins-service',
'xyz-reporting-service',
'ansible-service',
'harbor-service',
'version-service',
'jira-service',
'kubernetes-service',
'capo-service',
'permission-service',
'artifactory-service',
'vault-service',
'harbor-service-prod',
'rundeck-service',
'cruise-control-service',
'artifactory-service.xyz.abc.cloud',
'helm-service',
'Capo Service',
'rocket-chat-service',
'reporting-service',
'bitbucket-service',
'rocketchat-service']
or
x2 = ['journal-service',
'lock-service',
'jenkins-service',
'xyz-reporting-service',
'ansible-service',
'harbor-service',
'version-service',
'jira-service',
'kubernetes-service',
'capo-service',
'permission-service',
'artifactory-service',
'vault-service',
'rundeck-service',
'cruise-control-service',
'helm-service',
'database-ticket-service',
'rocket-chat-service',
'ansible-dpservice',
'reporting-service',
'bitbucket-service',
'rocketchat-service']
As you can see in both lists, duplicate values appear in different forms, for example:
in the list 1:
in the list 2:
I need a universal solution that does not only on these sample lists:
How can I do that in Python 3.11 ?
My solution adds cleanup steps before and after fuzzy matching. Shoutout to @Scott Boston, I learned about variable naming within list comprehension from his answer.
!pip install RapidFuzz
import re
from rapidfuzz import fuzz, utils
def dedup(lst):
lst = list(set([re.sub(r'-service.*$', r'-service', x) for x in lst])) #clean up values with extra characters after "-service"
vals = {val1:{val2:ratio for val2 in lst
if val1!=val2 #avoid matching to self
and (ratio:=fuzz.WRatio(val1, val2, processor=utils.default_process))>=90} #fuzzy match
for val1 in lst
if len(subs:=val1.split('-'))==2 #name-service format requested by OP
and subs[-1]=='service'} #check if ends in "-service"
not_captured = [x for x in lst if x not in list(vals.keys())+sum([list(x.keys()) for x in vals.values()], [])] #vals from original list not in match dict keys or values
new_x = list(vals.keys())+[''.join(x.replace('-service', '').split('-'))+'-service' for x in not_captured] #deduplicated list forcing name-service format for longer values with extra "-"
return new_x #returns only deduplicated list
dedup(x1)
['jenkins-service',
'artifactory-service',
'rocketchat-service',
'rundeck-service',
'harbor-service',
'bitbucket-service',
'lock-service',
'reporting-service',
'permission-service',
'capo-service',
'jira-service',
'version-service',
'ansible-service',
'vault-service',
'helm-service',
'kubernetes-service',
'cruisecontrol-service']
dedup(x2)
['jenkins-service',
'artifactory-service',
'rocketchat-service',
'rundeck-service',
'harbor-service',
'bitbucket-service',
'lock-service',
'reporting-service',
'permission-service',
'capo-service',
'jira-service',
'version-service',
'journal-service',
'ansible-service',
'vault-service',
'helm-service',
'kubernetes-service',
'cruisecontrol-service',
'databaseticket-service']