pythonlistnestedsubstring

Python function to remove strings from a list if a substring already exists


I'm wondering if they're is a "pythonic" way of removing elements from a list, if that element contains a substring of another element.

For example say we have a list like this:

["/var/html/site1", "/var/html/site1/nested_web_root1", "/var/html/site1/nested_web_root2"]

/var/html/site1 is contained within both: /var/html/site1/nested_web_root1 & /var/html/site1/nested_web_root2 As such I'd like them removed from the list.

I've already written a function to do this, and it does mostly work, but the code is just horrible and overly-complex. There are also edge-cases where it just fails.

Here is what I've written so far:

def substringSieve(string_list):
    string_list.sort(key=lambda s: len(s), reverse=False)
    out = []
    bad_list = []
    for outer_string in string_list:
        for inner_string in string_list:
            if outer_string != inner_string:
                if outer_string in inner_string:
                    out.append(outer_string)
                    bad_list.append(inner_string)
        if outer_string not in out and outer_string not in bad_list:
            out.append(outer_string)
    return out

Can anyone offer some insight?

Thanks.


Solution

  • You appear to be comparing file paths and want to compare the start of the strings rather than comparing sub-strings at any location.

    Sort the list and then, for each value in the list, do not include it in the output if it starts with any previous value:

    def substring_sieve(data):
        output = []
        for value in sorted(data):
            if not any(value.startswith(prev) for prev in output):
                output.append(value)
        return output
    
    data = ["/var/html/site1", "/var/html/site1/nested_web_root1", "/var/html/site1/nested_web_root2"]
    
    print(substring_sieve(data))
    

    Which outputs:

    ['/var/html/site1']
    

    You can make it more efficient by just checking that the latest value does not start with the most recently output value:

    def substring_sieve(data):
        if not data:
            return data
        prev, *remaining = sorted(data)
        output = [prev]
        for value in remaining:
            if not value.startswith(prev):
                output.append(value)
                prev = value
        return output
    

    Which outputs the same.

    @nocomment gave an alternate version using list comprehension:

    def substring_sieve(data):
        prev = ()
        return [
            prev := value
            for value in sorted(data)
            if not value.startswith(prev)
        ]
    

    If you do want to match sub-strings rather than the start of strings then use:

    def substring_sieve(data):
        output = []
        for value in sorted(data, key=len):
            if not any(prev in value for prev in output):
                output.append(value)
        return output