pythonpython-3.xpandasdataframe

How to pass an entire column as a parameter to tldextract function?


tldextract is used to extract domain names from the URLs. Here, 'url' is one of the column name in the data frame 'df'. It is possible to pass one value of 'url' as a parameter. However, I am not able to pass the entire column as a parameter. The url being passed here is 'https://www.google.com/search?source=hp&ei=7iE'

listed = tldextract.extract(df['url'][0])
dom_name = listed.domain
print(dom_name)

Output: google

What I want is to create a new column in the data frame named 'Domain' having the extracted domain names from the URL.

Something like:

df['Domain'] = tldextract.extract(df['url'])

But this isn't working

Here is the code:

# IMPORTING PANDAS
import pandas as pd
from IPython.display import display

import tldextract

# Read data sample
df = pd.read_csv("bookcsv.csv")

df['Domain'] = df['url'].apply(lambda url: tldextract.extract(url).domain)

Here is the input data:

The dataframe looks like this I am not able to put the data directly here. So, I am posting a snapshot.


Solution

  • Using apply with apply the function to every element in the column and will keep everything neatly lined up.

    df['Domain'] = df['url'].apply(lambda url: tldextract.extract(url).domain)
    

    Here's the full code I used for testing:

    import pandas as pd, tldextract
    
    df = pd.DataFrame([{'url':'https://google.com'}]*12)
    df['Domain'] = df['url'].apply(lambda url: tldextract.extract(url).domain)
    print(df)
    

    Output:

                       url  Domain
    0   https://google.com  google
    1   https://google.com  google
    2   https://google.com  google
    3   https://google.com  google
    4   https://google.com  google
    5   https://google.com  google
    6   https://google.com  google
    7   https://google.com  google
    8   https://google.com  google
    9   https://google.com  google
    10  https://google.com  google
    11  https://google.com  google