pythonpandasurlparse

How to use str.contains() in a conditional statement to apply a function to some elements of a dataframe column?


I have a column in a pandas dataframe that holds various URLs to websites:

df:
    ID   URL
0   1    https://www.Facebook.com/fr
1   2    https://Twitter.com/de
2   3    https://www.Youtube.com
3   4    www.Microsoft.com
4   5    https://www.Stackovervlow.com

I am using urlparse().netloc to clean the URLs to only have the domain names (e.g., from https://www.Facebook.com/fr to www.Facebook.com). Some of the URLs are already in a clean format (www.Microsoft.com above), and applying urlparse().netloc to these clean URLs results in an empty cell. Therefore, I am trying to apply the urlparse().netloc function to elements of the URL column where the element contains the string 'http', else it should return the original URL. Here is the code I have be trying to use:

df['URL'] = df['URL'].apply(
    lambda x: urlparse(x).netloc if x.str.contains("http", na=False) else x
)

However, I get this error message: AttributeError: 'str' object has no attribute 'str'. Any help on how I can overcome this to complete the task would be much appreciated!


Solution

  • You are using pandas.Series.apply therefore your function (lambda) receives element (str) itself, so you might simply us in as follows

    df['URL'] = df['URL'].apply(
        lambda x: urlparse(x).netloc if "http" in x else x
    )