pythonpandasurltldurlparse

Extract urls information from pandas column


I need to keep some parts of a link:

Link             
www.xxx.co.uk/path1
www.asx_win.com/path2
www.asdfe.aer.com
...

Desired output:

Link2
xxx.co.uk
asx_win.com
asdfe.aer.com
...

I used urlparse and tldextract but I get either

Netloc
www.xxx.co.uk
www.asx_win.com
www.asdfe.aer.com
...

or

TLDEXTRACT

xxx
asx_win
asdfe.aer
...

By using strings, some issues can come from the following:

9     https://www.facebook.com/login/?next=https%3A%...
10    https://pt-br.facebook.com/114546123419/pos...
11    https://www.facebook.com/login/?next=https%3A%...
20    http://fsareq.media/?pg=article&id=s...
22    https://www.wq-wq.com/lrq-rqwrq-...
24    https://faseqrq.it/2020/05/28/...

My attempt would be to consider differences between what I get from url parse (Netloc) and from tldextract (i.,e., ending part). For example, from Netloc I get www.xxx.co.uk and from tldextract I get xxx. This means that if I subtract tldextract from Netloc I get www and co.uk. I would use as a cut-off point the part in common and keep the part after (i.e., .co.uk), that is what I am looking for.

The difference would be given by something like df['Link2'] = [a.replace(b, '').strip() for a, b in zip(df['Netloc'], df['TLDEXTRACT'])]. This works only because of the ending part (suffix) that I need to consider. Now I need to understand how to consider only the ending part to get the expected output. You can use the columns Netloc and TLDEXTRACT in the sample above.


Solution

  • tldextract.extract() returns a named tuple of (subdomain, domain, suffix):

    tldextract.extract('www.xxx.co.uk')
    
    # ExtractResult(subdomain='www', domain='xxx', suffix='co.uk')
    

    So you can just join indexes [1:]:

    import tldextract
    df['Extracted'] = df.Link.apply(lambda x: '.'.join(tldextract.extract(x)[1:]))
    
    #                                                 Link     Extracted
    # 0                                www.xxx.co.uk/path1     xxx.co.uk
    # 1                              www.asx_win.com/path2   asx_win.com
    # 2                                  www.asdfe.aer.com       aer.com
    # 3  https://www.facebook.com/login/?next=https%3A%...  facebook.com
    # 4     https://pt-br.facebook.com/114546123419/pos...  facebook.com
    # 5  https://www.facebook.com/login/?next=https%3A%...  facebook.com
    # 6            http://fsareq.media/?pg=article&id=s...  fsareq.media
    # 7                https://www.wq-wq.com/lrq-rqwrq-...     wq-wq.com
    # 8                  https://faseqrq.it/2020/05/28/...    faseqrq.it