I am having difficulty doing regular expressions when there is whitespace and carriage returns in between the text.
For example in this case below, how can I get the regular expression to get "<div id="contentleft">
"?
<div id="content">
<div id="contentleft"> <SCRIPT language=JavaScript>
I tried
id="content">(.*?)<SCRIPT
but it doesn't work.
$s = '<div id="content">
<div id="contentleft"> <SCRIPT language=JavaScript>';
if( preg_match('/id="content">(.*?)<SCRIPT/s', $s, $matches) )
print $matches[1]."\n";
Dot, by default, matches everything but newlines. /s
makes it match everything.
But really, use a DOM parser. You can walk the tree or you can use an XPath query. Think of it like regexes for XML.
$s = '<div id="content">
<div id="contentleft"> <SCRIPT language=JavaScript>';
// Load the HTML
$doc = new DOMDocument();
$doc->loadHTML($s);
// Use XPath to find the <div id="content"> tag's descendants.
$xpath = new DOMXPath($doc);
$entries = $xpath->query("//div[@id='content']/descendant::*");
foreach( $nodes as $node ) {
// Stop when we see <script ...>
if( $node->nodeName == "script" )
break;
// do what you want with the content
}
XPath is extremely powerful. Here's some examples.
PS I'm sure (I hope) the above code can be tightened up some.