phphtmlhtml-parsingmeta-tagstext-extraction

Get content value and preceding attribute's value of all meta tags with a name or http-equiv attribute


I have the following string.

$data = "<meta charset='UTF-8'>
<meta name='keywords' content='your, tags'>
<meta name='description' content='150 words'>
<meta name='subject' content='your website's subject'>
<meta name='copyright' content='company name'>
<meta name='language' content='ES'>
<meta name='robots' content='index,follow'>
<meta name='revised' content='Sunday, July 18th, 2010, 5:15 pm'>
<meta name='abstract' content=''>
<meta name='topic' content=''>
<meta name='summary' content=''>
<meta name='Classification' content='Business'>
<meta name='author' content='name, email@hotmail.com'>
<meta name='designer' content=''>
<meta name='reply-to' content='email@hotmail.com'>
<meta name='owner' content=''>
<meta name='url' content='http://www.websiteaddrress.com'>
<meta name='identifier-URL' content='http://www.websiteaddress.com'>
<meta name='directory' content='submission'>
<meta name='pagename' content='jQuery Tools, Tutorials and Resources - O'Reilly Media'>
<meta name='category' content=''>
<meta name='coverage' content='Worldwide'>
<meta name='distribution' content='Global'>
<meta name='rating' content='General'>
<meta name='revisit-after' content='7 days'>
<meta name='subtitle' content='This is my subtitle'>
<meta name='target' content='all'>
<meta name='HandheldFriendly' content='True'>
<meta name='MobileOptimized' content='320'>
<meta name='date' content='Sep. 27, 2010'>
<meta name='search_date' content='2010-09-27'>
<meta name='DC.title' content='Unstoppable Robot Ninja'>
<meta name='ResourceLoaderDynamicStyles' content=''>
<meta name='medium' content='blog'>
<meta name='syndication-source' content='https://mashable.com/2008/12/24/free-brand-monitoring-tools/'>
<meta name='original-source' content='https://mashable.com/2008/12/24/free-brand-monitoring-tools/'>
<meta name='verify-v1' content='dV1r/ZJJdDEI++fKJ6iDEl6o+TMNtSu0kv18ONeqM0I='>
<meta name='y_key' content='1e39c508e0d87750'>
<meta name='pageKey' content='guest-home'>
<meta itemprop='name' content='jQTouch'>
<meta http-equiv='Expires' content='0'>
<meta http-equiv='Pragma' content='no-cache'>
<meta http-equiv='Cache-Control' content='no-cache'>
<meta http-equiv='imagetoolbar' content='no'>
<meta http-equiv='x-dns-prefetch-control' content='off'>";

I want to extract the values for the listed meta tags, including both name meta tags and httpequiv meta tags

This is where I'm at with this:

// explode the string by newline
$parts = explode("\n" ,$data);

// loop through each meta tag line
foreach ($parts as $part) {

    // match inside the name attribute and the content attribute
    preg_match("/<meta name=\"(.*)\" content=\"(.*)\" \/>/i", $part, $matches);
  
    // returns "</pre><pre>Array()"
    print "<pre>" . print_r($matches, true) . "</pre>";
}

I think there's something wrong with my regular expression.


Solution

  • The atttributes using single quotes, not double quotes. The closing tag is not /> but > without space:

    preg_match("/<meta name='([^']*)' content='([^']*)'\s?\/?>/i", $part, $matches);
    

    Explanation :

    [^']* # get all data until ' is reached
    \s?   # with whitespace character (\s), or not (?)
    \/?   # with slash (/) or not (?) 
    

    Here is a version that use also double quotes, and multiple spaces:

    "/<meta\s*name=['\"]([^']*)['\"]\s*content=['\"]([^']*)['\"]\s?\/?>/i"
    

    -> online demo

    But, it is always better to use a DOM parser to check HTML elements.