pythoncase-insensitiveset-comprehension

Python: How to remove/discard a string from a set of strings using case-insensitive match?


I have a case from Wikidata where the string Articles containing video clips shows up in a set of categories and needs to be removed. Trouble is, it also shows up in other sets as articles containing video clips (lowercase "a").

The simple/safe way to remove it seems to be

   setA.discard("Articles containing video clips").discard("articles containing video clips")

Perfectly adequate, but doesn't scale in complex cases. Is there any way to do this differently, other than the obvious loop or list/set comprehension using, say, casefold for the comparison?

  unwantedString = 'Articles containing video clip'
  setA = {'tsunami', 'articles containing video clip'}

  reducedSetA = {nonmatch for nonmatch in setA if nonmatch.casefold() != 
      unwantedString.casefold }

  print(reducedSetA)
  {'tsunami'}

Note that this is not a string replacement situation - it is removal of a string from a set of strings.


Solution

  • You can also use regex.

    import re
    
    unwantedStrings = {"Articles containing video clip", "asdf"}
    setA = {"tsunami", "articles containing video clip", "asdf", "asdfasdf", "asdfasddf"}
    
    # remove the unwanted strings from the set
    regex = re.compile("|".join(map(lambda s: "^" + s + "$", unwantedStrings)), re.IGNORECASE)
    reducedSetA = set(filter(lambda x: not regex.search(x), setA))
    
    print(reducedSetA)
    # {'tsunami', 'asdfasddf', 'asdfasdf'}
    

    The above code will remove only the exact matches. If you also want to remove the "asdfasdf" because you have "asdf" in unwanted string. You can change the regex line to this line.

    ...
    regex = re.compile("|".join(unwantedStrings), re.IGNORECASE)
    ...
    # {'tsunami'}