phpregexnewlinedotmodifier

How to make a dot in a regex pattern match newline characters?


I am having difficulty doing regular expressions when there is whitespace and carriage returns in between the text.

For example in this case below, how can I get the regular expression to get "<div id="contentleft">"?

<div id="content"> 


<div id="contentleft">  <SCRIPT language=JavaScript>

I tried

id="content">(.*?)<SCRIPT

but it doesn't work.


Solution

  • $s = '<div id="content">
    
    <div id="contentleft">  <SCRIPT language=JavaScript>';
    
    if( preg_match('/id="content">(.*?)<SCRIPT/s', $s, $matches) )
        print $matches[1]."\n";
    

    Dot, by default, matches everything but newlines. /s makes it match everything.

    But really, use a DOM parser. You can walk the tree or you can use an XPath query. Think of it like regexes for XML.

    $s = '<div id="content">
    
    <div id="contentleft">  <SCRIPT language=JavaScript>';
    
    // Load the HTML
    $doc = new DOMDocument();
    $doc->loadHTML($s);
    
    // Use XPath to find the <div id="content"> tag's descendants.
    $xpath = new DOMXPath($doc);
    $entries = $xpath->query("//div[@id='content']/descendant::*");
    
    foreach( $nodes as $node ) {
        // Stop when we see <script ...>
        if( $node->nodeName == "script" )
            break;
    
        // do what you want with the content
    }
    

    XPath is extremely powerful. Here's some examples.

    PS I'm sure (I hope) the above code can be tightened up some.