I have a URL which can be any of the following formats:
http://example.com
https://example.com
http://example.com/foo
http://example.com/foo/bar
www.example.com
example.com
foo.example.com
www.foo.example.com
foo.bar.example.com
http://foo.bar.example.com/foo/bar
example.net/foo/bar
Essentially, I need to be able to match any normal URL. How can I extract example.com
(or .net, whatever the tld happens to be. I need this to work with any TLD.) from all of these via a single regex?
Well you can use parse_url
to get the host:
$info = parse_url($url);
$host = $info['host'];
Then, you can do some fancy stuff to get only the TLD and the Host
$host_names = explode(".", $host);
$bottom_host_name = $host_names[count($host_names)-2] . "." . $host_names[count($host_names)-1];
Not very elegant, but should work.
If you want an explanation, here it goes:
First we grab everything between the scheme (http://
, etc), by using parse_url
's capabilities to... well.... parse URL's. :)
Then we take the host name, and separate it into an array based on where the periods fall, so test.world.hello.myname
would become:
array("test", "world", "hello", "myname");
After that, we take the number of elements in the array (4).
Then, we subtract 2 from it to get the second to last string (the hostname, or example
, in your example)
Then, we subtract 1 from it to get the last string (because array keys start at 0), also known as the TLD
Then we combine those two parts with a period, and you have your base host name.