phpvar-dump

var_dump $_SERVER['HTTP_HOST'] shows expected string, but unexpected result when comparing or parsing


I have a strange problem. Someone has created a url similar to my website (www.greatwebsite.com) and has been scraping the content in real time to show on their website (www.bestwebsite.com) (I have changed URLs for my client's privacy). I have been trying to redirect requests made through this bad actor url, but have not been able to as the $_SERVER['HTTP_HOST'] variable has the wrong value in it....

<?php
$host = $_SERVER['HTTP_HOST'];
var_dump($host);
?>

When I visit the official website at www.greatwebsite.com the HTTP_HOST variable outputs string(20) "www.greatwebsite.com" as it should and when I compare the string value to "www.greatwebsite.com" everything works properly.

However, when I visit the bad actor website at www.bestwebsite.com the var_dump outputs
string(20) "www.bestwebsite.com" but the character count is 20 instead of 19. If I try to compare the string to "www.bestwebsite.com" it returns false. So I then printed out each character in the string and even though var_dump shows www.bestwebsite.com the string inside actually is www.greatwebsite.com. If I echo the contents of $_SERVER['HTTP_HOST'] it shows www.bestwebsite.com so I tried to capture it via output buffering like this

ob_start();
echo $host;
$output = ob_get_clean(); 
ob_end_clean();

echo $output; //outputs www.bestwebsite.com
echo substr($output, 4, 4); //outputs great

echo stills shows www.bestwebsite.com, but when I compare what is in $output it still acts as if the value is www.greatwebsite.com so I am unable to write logic to detect when the request is coming from the bad actor website.

Does anyone know why the HTTP_HOST value is doing this and how I can compare the value successfully to determine if the request is coming from this bad actor website so I can redirect it somewhere else so they will stop stealing my client's content?


Solution

  • This is a classic problem in informatics - the boundary between "always zero" and "one or more". Once it is possible to have at least 1 bad actor - it is very possible to have many more. If you try to protect by blacklisting each bad actor - you will exhaust your resources. A better strategy is to use a whitelist - i.e. if the HTTP_HOST is not www.greatwebsite.com then do not serve any content at all. This can be achieved by a simple header('HTTP/1.1 444 Go home', TRUE, 444)