pythonurlserverpackage

Extract domain name from URL in Python


I am tring to extract the domain names out of a list of URLs. Just like in https://stackoverflow.com/questions/18331948/extract-domain-name-from-the-url
My problem is that the URLs can be about everything, few examples:
m.google.com => google
m.docs.google.com => google
www.someisotericdomain.innersite.mall.co.uk => mall
www.ouruniversity.department.mit.ac.us => mit
www.somestrangeurl.shops.relevantdomain.net => relevantdomain
www.example.info => example
And so on..
The diversity of the domains doesn't allow me to use a regex as shown in how to get domain name from URL (because my script will be running on enormous amount of urls from real network traffic, the regex will have to be enormous in order to catch all kinds of domains as mentioned).
Unfortunately my web research the didn't provide any efficient solution.
Does anyone have an idea of how to do this ?
Any help will be appreciated !
Thank you


Solution

  • Use tldextract, which is more efficient version of urlparse.

    tldextract accurately separates the gTLD or ccTLD (generic or country code top-level domain) from the registered domain and subdomains of a URL.

    >>> import tldextract
    >>> ext = tldextract.extract('http://forums.news.cnn.com/')
    ExtractResult(subdomain='forums.news', domain='cnn', suffix='com')
    >>> ext.domain
    'cnn'