I need to keep some parts of a link:
Link
www.xxx.co.uk/path1
www.asx_win.com/path2
www.asdfe.aer.com
...
Desired output:
Link2
xxx.co.uk
asx_win.com
asdfe.aer.com
...
I used urlparse
and tldextract
but I get either
Netloc
www.xxx.co.uk
www.asx_win.com
www.asdfe.aer.com
...
or
TLDEXTRACT
xxx
asx_win
asdfe.aer
...
By using strings, some issues can come from the following:
9 https://www.facebook.com/login/?next=https%3A%...
10 https://pt-br.facebook.com/114546123419/pos...
11 https://www.facebook.com/login/?next=https%3A%...
20 http://fsareq.media/?pg=article&id=s...
22 https://www.wq-wq.com/lrq-rqwrq-...
24 https://faseqrq.it/2020/05/28/...
My attempt would be to consider differences between what I get from url parse (Netloc) and from tldextract (i.,e., ending part).
For example, from Netloc I get www.xxx.co.uk
and from tldextract I get xxx
. This means that if I subtract tldextract from Netloc I get www
and co.uk
. I would use as a cut-off point the part in common and keep the part after (i.e., .co.uk
), that is what I am looking for.
The difference would be given by something like df['Link2'] = [a.replace(b, '').strip() for a, b in zip(df['Netloc'], df['TLDEXTRACT'])]
. This works only because of the ending part (suffix) that I need to consider.
Now I need to understand how to consider only the ending part to get the expected output. You can use the columns Netloc and TLDEXTRACT in the sample above.
tldextract.extract()
returns a named tuple of (subdomain, domain, suffix)
:
tldextract.extract('www.xxx.co.uk')
# ExtractResult(subdomain='www', domain='xxx', suffix='co.uk')
So you can just join indexes [1:]
:
import tldextract
df['Extracted'] = df.Link.apply(lambda x: '.'.join(tldextract.extract(x)[1:]))
# Link Extracted
# 0 www.xxx.co.uk/path1 xxx.co.uk
# 1 www.asx_win.com/path2 asx_win.com
# 2 www.asdfe.aer.com aer.com
# 3 https://www.facebook.com/login/?next=https%3A%... facebook.com
# 4 https://pt-br.facebook.com/114546123419/pos... facebook.com
# 5 https://www.facebook.com/login/?next=https%3A%... facebook.com
# 6 http://fsareq.media/?pg=article&id=s... fsareq.media
# 7 https://www.wq-wq.com/lrq-rqwrq-... wq-wq.com
# 8 https://faseqrq.it/2020/05/28/... faseqrq.it