I'm trying to extract length and suffix (tld) from a list of websites in a pandas data frame.
Website. Label
18egh.com 1
fish.co.uk 0
www.description.com 1
http://world.com 1
My desired output should be
Website Label Length Tld
18egh.com 1 5 com
fish.co.uk 0 4 co.uk
www.description.com 1 11 com
http://world.com 1 5 com
I've tried first with the length as shown as follows:
def get_domain(df):
my_list=[]
for x in df['Website'].tolist():
domain = urlparse(x).netloc
my_list.append(domain)
df['Domain'] = my_list
df['Length']=df['Domain'].str.len()
return df
but when I check the list is empty. I know that for extracting information on domain and tld it'd probably enough to use url parse, but if I am wrong I'd appreciate if you'd point me on the right direction.
Update:
To extract the domains, etc. try tldextract
to do the work.
Example:
import pandas as pd
import tldextract # pip install tldextract | # conda install -c conda-forge tldextract
df = pd.DataFrame({'Website.': {0: '18egh.com',
1: 'fish.co.uk',
2: 'www.description.com',
3: 'http://world.com',
4: 'http://forums.news.cnn.com/'},
'Label': {0: 1, 1: 0, 2: 1, 3: 1, 4: 0}})
df[['subdomin', 'domain', 'suffix']] = df.apply(lambda x: pd.Series(tldextract.extract(x['Website.'])), axis=1)
print(df)
Website. Label subdomin domain suffix
0 18egh.com 1 18egh com
1 fish.co.uk 0 fish co.uk
2 www.description.com 1 www description com
3 http://world.com 1 world com
4 http://forums.news.cnn.com/ 0 forums.news cnn com
Original answer below
Try:
import pandas as pd
df = pd.DataFrame({'Website.': {0: '18egh.com',
1: 'fish.co.uk',
2: 'www.description.com',
3: 'http://world.com'},
'Label': {0: 1, 1: 0, 2: 1, 3: 1}})
pattern = r'(?:https?:\/\/|www\.|https?:\/\/www\.)?(.*?)\.'
df['Domain'] = df['Website.'].str.extract(pattern)
df['Domain_Len'] = df['Domain'].str.len()
print(df)
Website. Label Domain Domain_Len
0 18egh.com 1 18egh 5
1 fish.co.uk 0 fish 4
2 www.description.com 1 description 11
3 http://world.com 1 world 5
Alternatively:
pattern = r'(?:https?:\/\/|www\.|https?:\/\/www\.)?(.*?)\.(.*?)$'
df[['Domain', 'TLD']] = df['Website.'].str.extract(pattern, expand=True)
df['TLD_Len'] = df['TLD'].str.len()
df['Domain_Len'] = df['Domain'].str.len()
print(df)
Website. Label TLD TLD_Len Domain Domain_Len
0 18egh.com 1 com 3 18egh 5
1 fish.co.uk 0 co.uk 5 fish 4
2 www.description.com 1 com 3 description 11
3 http://world.com 1 com 3 world 5