pythonparsingurlurllib

How can I remove 'www.' from original URL through [urllib] parse in python?


Original URL ▶ https://www.exeam.org/index.html

I want to extract exeam.org/ or exeam.org from original URL.

To do this, I used urllib the most powerful parser in Python that I know, but unfortunately urllib (url.scheme, url.netloc ...) couldn't give me the type of format I wanted.


Solution

  • to extract the domain name from a url using `urllib):

    from urllib.parse import urlparse
    surl = "https://www.exam.org/index.html"
    urlparsed = urlparse(surl)
    # network location from parsed url
    print(urlparsed.netloc)
    # ParseResult Object
    print(urlparsed)
    

    this will give you www.exam.org, but you want to further decompose this to registered domain if you are after just the exam.org part. so besides doing simple splits, which could be sufficient, you could also use library such as tldextract which knows how to parse subdmains, suffixes and more:

    from  tldextract import extract
    
    ext = extract(surl)
    print(ext.registered_domain)
    

    this will produce:

    exam.org