I need to write a utility method which will take a url and check whether given url is valid or not ?
URL can be anything with/without protocol like http, https or with can contain the relative url like if domain is example
and url is "abc.com" then its a relative URL. also it can be invalid url as well and can contain simple string.
Also i've list of whitelisted URL and can change runtime like youtube.com
, facebook.com
etc.
How can i check if given url is valid in my case, some basic check which i am doing is below :-
String url = "http://youtube.com";
if(!StringUtil.isEmpty(url))
{
if (url.startsWith("http:") || pathToImage.startsWith("https://")) {
// check if url is from whitlist domains
} else {
// do nothing, url is not internal domain.
}
}
Now my question is how do i properly extract the domain name from the URL which will be after the http
or https
.
Note:- I am using apache StringUtils and its quite possible that url can be like https://absdsbsb
or https://anmds.txt
. Also let me know if its a good case for regex matching ?
The proper way to do this is to use the URI
class.
You can treat them as strings and look for particular patterns or substrings, but there are various "tricky" ways to write URLs that could be used to pass through URLs that shouldn't. (Though, if you are using a whitelist rather than a blacklist, that makes it more difficult to be tricky.)
Anyhow, the approach should be to use the URI
class to parse the URL string, then get and match the protocol
and host
components.
Once you have the domain name, it is a bit of a toss-up how you efficiently match it against a white-list, but I would look at using a TreeSet, and consider using its floor
and ceiling
methods to accelerate domain prefix matching.
(I would be surprised if regex matching would give you good performance.)