I'm wondering if they're is a "pythonic" way of removing elements from a list, if that element contains a substring of another element.
For example say we have a list like this:
["/var/html/site1", "/var/html/site1/nested_web_root1", "/var/html/site1/nested_web_root2"]
/var/html/site1
is contained within both: /var/html/site1/nested_web_root1
& /var/html/site1/nested_web_root2
As such I'd like them removed from the list.
I've already written a function to do this, and it does mostly work, but the code is just horrible and overly-complex. There are also edge-cases where it just fails.
Here is what I've written so far:
def substringSieve(string_list):
string_list.sort(key=lambda s: len(s), reverse=False)
out = []
bad_list = []
for outer_string in string_list:
for inner_string in string_list:
if outer_string != inner_string:
if outer_string in inner_string:
out.append(outer_string)
bad_list.append(inner_string)
if outer_string not in out and outer_string not in bad_list:
out.append(outer_string)
return out
Can anyone offer some insight?
Thanks.
You appear to be comparing file paths and want to compare the start of the strings rather than comparing sub-strings at any location.
Sort the list and then, for each value in the list, do not include it in the output if it starts with any previous value:
def substring_sieve(data):
output = []
for value in sorted(data):
if not any(value.startswith(prev) for prev in output):
output.append(value)
return output
data = ["/var/html/site1", "/var/html/site1/nested_web_root1", "/var/html/site1/nested_web_root2"]
print(substring_sieve(data))
Which outputs:
['/var/html/site1']
You can make it more efficient by just checking that the latest value does not start with the most recently output value:
def substring_sieve(data):
if not data:
return data
prev, *remaining = sorted(data)
output = [prev]
for value in remaining:
if not value.startswith(prev):
output.append(value)
prev = value
return output
Which outputs the same.
@nocomment gave an alternate version using list comprehension:
def substring_sieve(data): prev = () return [ prev := value for value in sorted(data) if not value.startswith(prev) ]
If you do want to match sub-strings rather than the start of strings then use:
def substring_sieve(data):
output = []
for value in sorted(data, key=len):
if not any(prev in value for prev in output):
output.append(value)
return output