pythonurltrielongest-prefix

Longest Prefix Matches for URLs


I need information about any standard python package which can be used for "longest prefix match" on URLs. I have gone through the two standard packages http://packages.python.org/PyTrie/#pytrie.StringTrie & 'http://pypi.python.org/pypi/trie/0.1.1' but they don't seem to be useful for longest prefix match task on URLs.

Examlple, if my set has these URLs 1->http://www.google.com/mail , 2->http://www.google.com/document, 3->http://www.facebook.com, etc..

Now if I search for 'http://www.google.com/doc' then it should return 2 and search for 'http://www.face' should return 3.

I wanted to confirm if there is any standard python package which can help me in doing this or should I implement a Trie for prefix matching.

I am not looking for a regular-expression kind of solution since it is not scalable as the number of URL's increases.

Thanks a lot.


Solution

  • This example is good for small url lists but does not scale well.

    def longest_prefix_match(search, urllist):
        matches = [url for url in urllist if url.startswith(search)]
        if matches:
            return max(matches, key=len)
        else:
            raise Exception("Not found")
    

    An implementation using the trie module.

    import trie
    
    
    def longest_prefix_match(prefix_trie, search):
        # There may well be a more elegant way to do this without using
        # "hidden" method _getnode.
        try:
            return list(node.value for node in prefix_trie._getnode(search).walk())
        except KeyError:
            return list()
    
    url_list = [ 
        'http://www.google.com/mail',
        'http://www.google.com/document',
        'http://www.facebook.com',
    ]
    
    url_trie = trie.Trie()
    
    for url in url_list:
        url_trie[url] = url 
    
    searches = ("http", "http://www.go", "http://www.fa", "http://fail")
    
    for search in searches:
        print "'%s' ->" % search, longest_prefix_match(url_trie, search)
    

    Result:

    'http' -> ['http://www.facebook.com', 'http://www.google.com/document', 'http://www.google.com/mail']
    'http://www.go' -> ['http://www.google.com/document', 'http://www.google.com/mail']
    'http://www.fa' -> ['http://www.facebook.com']
    'http://fail' -> []
    

    or using PyTrie which gives the same result but the lists are ordered differently.

    from pytrie import StringTrie
    
    
    url_list = [ 
        'http://www.google.com/mail',
        'http://www.google.com/document',
        'http://www.facebook.com',
    ]
    
    url_trie = StringTrie()
    
    for url in url_list:
        url_trie[url] = url 
    
    searches = ("http", "http://www.go", "http://www.fa", "http://fail")
    
    for search in searches:
        print "'%s' ->" % search, url_trie.values(prefix=search)
    

    I'm beginning to think a radix tree / patricia tree would be better from a memory usage point of view. This is what the a radix tree would look like:

    Radix tree of example URLs

    Whereas the trie looks more like: trie of example URLs