I have a column in a pandas dataframe that holds various URLs to websites:
df:
ID URL
0 1 https://www.Facebook.com/fr
1 2 https://Twitter.com/de
2 3 https://www.Youtube.com
3 4 www.Microsoft.com
4 5 https://www.Stackovervlow.com
I am using urlparse().netloc
to clean the URLs to only have the domain names (e.g., from https://www.Facebook.com/fr to www.Facebook.com). Some of the URLs are already in a clean format (www.Microsoft.com above), and applying urlparse().netloc
to these clean URLs results in an empty cell. Therefore, I am trying to apply the urlparse().netloc
function to elements of the URL column where the element contains the string 'http', else it should return the original URL. Here is the code I have be trying to use:
df['URL'] = df['URL'].apply(
lambda x: urlparse(x).netloc if x.str.contains("http", na=False) else x
)
However, I get this error message: AttributeError: 'str' object has no attribute 'str'
. Any help on how I can overcome this to complete the task would be much appreciated!
You are using pandas.Series.apply
therefore your function (lambda) receives element (str
) itself, so you might simply us in
as follows
df['URL'] = df['URL'].apply(
lambda x: urlparse(x).netloc if "http" in x else x
)