I've a couple of websites that are subdomains (e.g., Wordpress, Altervista, Blogpress
,...).
I'm currently using url parse for splitting URLs into their elements. However it seems that does not allow to distinguish subdomains, but only tld.
Alternatively, I'd use a vocabulary to include all the subdomain suffixes and, based on that, assign 1
or 0
. But since I don't know all the blogs, I'm wondering if there is a way to make automatically the detection.
For example, I was thinking of looking at the dots, but many websites can have a dot in between not being subdomains, so this approach is not good.
I think this library should do the trick https://pypi.org/project/tld/.
Here's an example:
from tld import get_tld
url = "https://artgateblog.altervista.org/"
res = get_tld(url, as_object=True)
blogname, blog_domain = res.domain, res
print(blogname, blog_domain)
Out:
artgateblog altervista.org
EDIT after comments:
For domains that don't include protocol, I think you need to add it with something like the below:
from tld import get_tld
urls = ["12story.altervista.org", "fantasy_story.blogspot.com"]
for url in urls:
res = get_tld(url, as_object=True, fix_protocol=True)
blogname, blog_domain = res.domain, res