phphtmlregexhtml-parsinghtml-content-extraction

How to extract img src, title and alt from html using php?


I would like to create a page where all images which reside on my website are listed with title and alternative representation.

I already wrote me a little program to find and load all HTML files, but now I am stuck at how to extract src, title and alt from this HTML:

<img src="/image/fluffybunny.jpg" title="Harvey the bunny" alt="a cute little fluffy bunny" />

I guess this should be done with some regex, but since the order of the tags may vary, and I need all of them, I don't really know how to parse this in an elegant way (I could do it the hard char by char way, but that's painful).


Solution

  • EDIT : now that I know better

    Using regexp to solve this kind of problem is a bad idea and will likely lead in unmaintainable and unreliable code. Better use an HTML parser.

    Solution With regexp

    In that case it's better to split the process into two parts :

    I will assume your doc is not xHTML strict so you can't use an XML parser. E.G. with this web page source code :

    /* preg_match_all match the regexp in all the $html string and output everything as 
    an array in $result. "i" option is used to make it case insensitive */
    
    preg_match_all('/<img[^>]+>/i',$html, $result); 
    
    print_r($result);
    Array
    (
        [0] => Array
            (
                [0] => <img src="/Content/Img/stackoverflow-logo-250.png" width="250" height="70" alt="logo link to homepage" />
                [1] => <img class="vote-up" src="/content/img/vote-arrow-up.png" alt="vote up" title="This was helpful (click again to undo)" />
                [2] => <img class="vote-down" src="/content/img/vote-arrow-down.png" alt="vote down" title="This was not helpful (click again to undo)" />
                [3] => <img src="http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG" height=32 width=32 alt="gravatar image" />
                [4] => <img class="vote-up" src="/content/img/vote-arrow-up.png" alt="vote up" title="This was helpful (click again to undo)" />
    
    [...]
            )
    
    )
    

    Then we get all the img tag attributes with a loop :

    $img = array();
    foreach( $result as $img_tag)
    {
        preg_match_all('/(alt|title|src)=("[^"]*")/i',$img_tag, $img[$img_tag]);
    }
    
    print_r($img);
    
    Array
    (
        [<img src="/Content/Img/stackoverflow-logo-250.png" width="250" height="70" alt="logo link to homepage" />] => Array
            (
                [0] => Array
                    (
                        [0] => src="/Content/Img/stackoverflow-logo-250.png"
                        [1] => alt="logo link to homepage"
                    )
    
                [1] => Array
                    (
                        [0] => src
                        [1] => alt
                    )
    
                [2] => Array
                    (
                        [0] => "/Content/Img/stackoverflow-logo-250.png"
                        [1] => "logo link to homepage"
                    )
    
            )
    
        [<img class="vote-up" src="/content/img/vote-arrow-up.png" alt="vote up" title="This was helpful (click again to undo)" />] => Array
            (
                [0] => Array
                    (
                        [0] => src="/content/img/vote-arrow-up.png"
                        [1] => alt="vote up"
                        [2] => title="This was helpful (click again to undo)"
                    )
    
                [1] => Array
                    (
                        [0] => src
                        [1] => alt
                        [2] => title
                    )
    
                [2] => Array
                    (
                        [0] => "/content/img/vote-arrow-up.png"
                        [1] => "vote up"
                        [2] => "This was helpful (click again to undo)"
                    )
    
            )
    
        [<img class="vote-down" src="/content/img/vote-arrow-down.png" alt="vote down" title="This was not helpful (click again to undo)" />] => Array
            (
                [0] => Array
                    (
                        [0] => src="/content/img/vote-arrow-down.png"
                        [1] => alt="vote down"
                        [2] => title="This was not helpful (click again to undo)"
                    )
    
                [1] => Array
                    (
                        [0] => src
                        [1] => alt
                        [2] => title
                    )
    
                [2] => Array
                    (
                        [0] => "/content/img/vote-arrow-down.png"
                        [1] => "vote down"
                        [2] => "This was not helpful (click again to undo)"
                    )
    
            )
    
        [<img src="http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG" height=32 width=32 alt="gravatar image" />] => Array
            (
                [0] => Array
                    (
                        [0] => src="http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG"
                        [1] => alt="gravatar image"
                    )
    
                [1] => Array
                    (
                        [0] => src
                        [1] => alt
                    )
    
                [2] => Array
                    (
                        [0] => "http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG"
                        [1] => "gravatar image"
                    )
    
            )
    
       [..]
            )
    
    )
    

    Regexps are CPU intensive so you may want to cache this page. If you have no cache system, you can tweak your own by using ob_start and loading / saving from a text file.

    How does this stuff work ?

    First, we use preg_ match_ all, a function that gets every string matching the pattern and ouput it in it's third parameter.

    The regexps :

    <img[^>]+>
    

    We apply it on all html web pages. It can be read as every string that starts with "<img", contains non ">" char and ends with a >.

    (alt|title|src)=("[^"]*")
    

    We apply it successively on each img tag. It can be read as every string starting with "alt", "title" or "src", then a "=", then a ' " ', a bunch of stuff that are not ' " ' and ends with a ' " '. Isolate the sub-strings between ().

    Finally, every time you want to deal with regexps, it handy to have good tools to quickly test them. Check this online regexp tester.

    EDIT : answer to the first comment.

    It's true that I did not think about the (hopefully few) people using single quotes.

    Well, if you use only ', just replace all the " by '.

    If you mix both. First you should slap yourself :-), then try to use ("|') instead or " and [^ø] to replace [^"].