phpstringurltruncatesanitization

Truncate string before ampersand


I have a site crawler which displays a list of urls, but the problem is I cannot for the life of me get the last regex quite right. all urls end up listed as:

http://www.website.org/page1.html&--EFTTIUGJ4ITCyh0Frzb_LFXe_eHw
http://website.net/page2/&--EyqBLeFeCkSfmvA7p0cLrsy1Zm1g
http://foobar.website.com/page3.php&--E5WRBxuTOQikDIyBczaVXveOdRFg

The Urls can all be different and the only thing which seems static is the & symbol. How would go abouts getting rid of the & symbol and everything beyond it to the right?

Here is what I have tried with the above results:

function getresults($sterm) {
$html = file_get_html($sterm);
$result = "";
// find all span tags with class=gb1
foreach($html->find('h3[class="r"]') as $ef)
{   
$result .=  $ef->outertext . '<br>';
}
return $result;
}

function geturl($url) {
  $var = $url;
  $result = "";

preg_match_all ("/a[\s]+[^>]*?href[\s]?=[\s\"\/url?q=\']+".
               "(.*?)[\"\']+.*?>"."([^<]+|.*?)?<\/a>/",              
              
               $var, $matches);
    
$matches = $matches[1];

foreach($matches as $var)
{    
    $result .= $var."<br>";
}

echo preg_replace('/sa=U.*?usg=.*?AFQjCN/', "--" , $result);

}

Solution

  • if url are ALWAYS in the same format, use explode :

    <?php
    $tmp = explode("&", "http://foobar.website.com/page3.php&--E5WRBxuTOQikDIyBczaVXveOdRFg");
    ?>
    

    $tmp[0] should content "http://foobar.website.com/page3.php" and $tmp[1] should content "--E5WRBxuTOQikDIyBczaVXveOdRFg"