javastringurlapache-stringutils

How to efficiently check whether given URL(String) contains whitelist domain(String) in Java


I need to write a utility method which will take a url and check whether given url is valid or not ?

URL can be anything with/without protocol like http, https or with can contain the relative url like if domain is example and url is "abc.com" then its a relative URL. also it can be invalid url as well and can contain simple string.

Also i've list of whitelisted URL and can change runtime like youtube.com, facebook.com etc.

How can i check if given url is valid in my case, some basic check which i am doing is below :-

 String url = "http://youtube.com";
    if(!StringUtil.isEmpty(url))
    {
        if (url.startsWith("http:") || pathToImage.startsWith("https://")) {
            // check if url is from whitlist domains

        } else {
            // do nothing, url is not internal domain.
        }
    }

Now my question is how do i properly extract the domain name from the URL which will be after the http or https.

Note:- I am using apache StringUtils and its quite possible that url can be like https://absdsbsb or https://anmds.txt. Also let me know if its a good case for regex matching ?


Solution

  • The proper way to do this is to use the URI class.

    You can treat them as strings and look for particular patterns or substrings, but there are various "tricky" ways to write URLs that could be used to pass through URLs that shouldn't. (Though, if you are using a whitelist rather than a blacklist, that makes it more difficult to be tricky.)

    Anyhow, the approach should be to use the URI class to parse the URL string, then get and match the protocol and host components.

    Once you have the domain name, it is a bit of a toss-up how you efficiently match it against a white-list, but I would look at using a TreeSet, and consider using its floor and ceiling methods to accelerate domain prefix matching.

    (I would be surprised if regex matching would give you good performance.)