pythonpandaslistduplicates

How to remove duplicates and unify values in lists where values are very close to each other in Python?


I have in Python lists like below:

x1 = ['lock-service',
 'jenkins-service',
 'xyz-reporting-service',
 'ansible-service',
 'harbor-service',
 'version-service',
 'jira-service',
 'kubernetes-service',
 'capo-service',
 'permission-service',
 'artifactory-service',
 'vault-service',
 'harbor-service-prod',
 'rundeck-service',
 'cruise-control-service',
 'artifactory-service.xyz.abc.cloud',
 'helm-service',
 'Capo Service',
 'rocket-chat-service',
 'reporting-service',
 'bitbucket-service',
 'rocketchat-service']

or

x2 = ['journal-service',
 'lock-service',
 'jenkins-service',
 'xyz-reporting-service',
 'ansible-service',
 'harbor-service',
 'version-service',
 'jira-service',
 'kubernetes-service',
 'capo-service',
 'permission-service',
 'artifactory-service',
 'vault-service',
 'rundeck-service',
 'cruise-control-service',
 'helm-service',
 'database-ticket-service',
 'rocket-chat-service',
 'ansible-dpservice',
 'reporting-service',
 'bitbucket-service',
 'rocketchat-service']

As you can see in both lists, duplicate values appear in different forms, for example:

in the list 1:

in the list 2:

I need a universal solution that does not only on these sample lists:

How can I do that in Python 3.11 ?


Solution

  • My solution adds cleanup steps before and after fuzzy matching. Shoutout to @Scott Boston, I learned about variable naming within list comprehension from his answer.

    !pip install RapidFuzz
    
    
    import re
    from rapidfuzz import fuzz, utils
    
    def dedup(lst):
        lst = list(set([re.sub(r'-service.*$', r'-service', x) for x in lst])) #clean up values with extra characters after "-service"
        vals = {val1:{val2:ratio for val2 in lst
                      if val1!=val2 #avoid matching to self
                      and (ratio:=fuzz.WRatio(val1, val2, processor=utils.default_process))>=90} #fuzzy match
                for val1 in lst 
                if len(subs:=val1.split('-'))==2 #name-service format requested by OP
                and subs[-1]=='service'} #check if ends in "-service"
        not_captured = [x for x in lst if x not in list(vals.keys())+sum([list(x.keys()) for x in vals.values()], [])] #vals from original list not in match dict keys or values
        new_x = list(vals.keys())+[''.join(x.replace('-service', '').split('-'))+'-service' for x in not_captured] #deduplicated list forcing name-service format for longer values with extra "-"
        return new_x #returns only deduplicated list
    
    
    dedup(x1)
    
    ['jenkins-service',
     'artifactory-service',
     'rocketchat-service',
     'rundeck-service',
     'harbor-service',
     'bitbucket-service',
     'lock-service',
     'reporting-service',
     'permission-service',
     'capo-service',
     'jira-service',
     'version-service',
     'ansible-service',
     'vault-service',
     'helm-service',
     'kubernetes-service',
     'cruisecontrol-service']
    
    dedup(x2)
    
    ['jenkins-service',
     'artifactory-service',
     'rocketchat-service',
     'rundeck-service',
     'harbor-service',
     'bitbucket-service',
     'lock-service',
     'reporting-service',
     'permission-service',
     'capo-service',
     'jira-service',
     'version-service',
     'journal-service',
     'ansible-service',
     'vault-service',
     'helm-service',
     'kubernetes-service',
     'cruisecontrol-service',
     'databaseticket-service']