phphtmlweb-scrapingreplacesanitization

Remove <img> tag from an HTML string if the tag's src value contains a nominated string


I have a little problem, how can I find the <img> src string which ends with dealer.jpg and remove only this tag from my content? for example:

<?php
$content = '<b>this is a content</b><img src=http://adress.com/as5.jpg><br> this is a content <img src=http://www.another-adress.com/dealer.jpg>';
$inf = explode("/dealer.jpg", $content);
$string = str_replace("<img src=\"$inf[0]/dealer.jpg\">", "", $content);
?>

I use this because I don't know the full image link; the full link is unpredictable, but I know the unwanted img's src value ends with dealer.jpg.

My script is not working... Can someone help me to correct it? This will help me to remove ads from the page that I've scraped.


Solution

  • If i understood correctly you are trying to remove the img tag that ends with "dealer.jpg" (no matter the domain), right? try this:

    $content = '<b>this is a content</b><img src=http://adress.com/as5.jpg><br> this is a content <img src=http://www.another-adress.com/dealer.jpg>';
    $content = preg_replace('/<img src=[A-z0-9-_":\.\/]+\/dealer\.jpg>/', '', $content);
    var_dump($content);
    

    Edit

    This second example will match the img tag even if it has another attributes such as alt, width, etc (but again, must end with "dealer.jpg")

    $content = '<b>this is a content</b><img src="http://adress.com/as5.jpg"><br> this is a content <img alt="dealer-image" width="120" height="40" src="http://www.another-adress.com/dealer.jpg">';
    $content = preg_replace('/<img[A-z0-9-_:="\.\/ ]+src="[A-z0-9-_:\.\/]+\/dealer\.jpg">/', '', $content);
    var_dump($content);
    

    Obs: I changed the initial $content because i've noticed it was missing the double quotation for src attribute. Not sure if was a typo or your string really looks like this.

    Edit 2

    Here is a example using DOM (a guess that is the best aproach here since the order of attributes could change):

    $content = '<b>this is a content</b><img src="http://adress.com/as5.jpg"><br> this is a content <img alt="dealer-image" width="120" height="40" src="http://www.another-adress.com/dealer.jpg">';
    
    // creates a DOMDocument based on your string, and wraps it in a div
    $dom = new DOMDocument();
    $dom->loadHTML("<div>{$content}</div>", LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
    
    // get all img tags
    $imgs = $dom->getElementsByTagName('img');
    foreach ($imgs as $img) { // if they have that src, remove it from $dom
        if (strpos($img->getAttribute('src'), 'dealer.jpg')) {
            $img->parentNode->removeChild($img);
        };
    }
    
    // get all the content of my first div, and print it
    $newContent = $dom->getElementsByTagName('div')->item(0);
    foreach ($newContent->childNodes as $childNode) {
        var_dump($dom->saveHTML($childNode));
    }