web-scrapingurlsimilaritysentence-similarity

How to check the similarity score between two web urls?


I'm working on a project that frequently needs to check the similarity score between two web url, initially i did this by scraping all the text from the web page and then calculated the document similarity. However this is really time consuming, instead what i'm looking for is a way which can detect the similarity between urls by just using the contents of the url instead of going through all the text.

eg:
url1:  https://en.wikipedia.org/wiki/Tic-tac-toe
url2:  https://en.wikipedia.org/wiki/Chess
a rough similarity estimate : 67% (since both are from wiki and both are related to games)

Solution

  • You are probably better off comparing individual pieces of URL as foo.com/a/b/c and boo.com/a/b/c would have similar sequence score but would probably have very different contents.

    For this you can use:

    from difflib import SequenceMatcher
    from w3lib.url import canonicalize_url
    from urllib.parse import urlparse
    
    
    def compare_urls(url1, url2):
        url1 = canonicalize_url(url1)
        url2 = canonicalize_url(url2)
        url1_parsed = urlparse(url1)
        url2_parsed = urlparse(url2)
        domain = SequenceMatcher(None, url1_parsed.netloc, url2_parsed.netloc).ratio()
        path = SequenceMatcher(None, url1_parsed.path, url2_parsed.path).ratio()
        query = SequenceMatcher(None, url1_parsed.query, url2_parsed.query).ratio()
        return {
            "domain": domain,
            "path": path,
            "query": query,
        }
    
    if __name__ == "__main__":
        print(compare_urls(
            "https://en.wikipedia.org/wiki/Tic-tac-toe",
            "https://en.wikipedia.org/wiki/Chess"
        ))
    # prints: {'domain': 1.0, 'path': 0.5, 'query': 1.0}
    

    By separating sequence comparison to netloc (domain), path and parameters you can assign scores weights to each one of them to design a more successful comparison algorithm.