phpregexrecursive-regex

What reg expression patten to I need to match everything between {{ and }}


What reg expression patten to I need to match everything between {{ and }}

I'm trying to parse wikipedia, but im ending up with orphan }} after running the rexex code. Here's my PHP script.

<?php

$articleName='england';

$url = "http://en.wikipedia.org/wiki/Special:Export/" . $articleName;
ini_set('user_agent','custom agent'); //required so that Wikipedia allows our request.

$feed = file_get_contents($url);
$xml = new SimpleXmlElement($feed);

$wikicode = $xml->page->revision->text;



$wikicode=str_replace("[[", "", $wikicode);
$wikicode=str_replace("]]", "", $wikicode);
$wikicode=preg_replace('/\{\{([^}]*(?:\}[^}]+)*)\}\}/','',$wikicode);

print($wikicode);

?>

I think the problem is I have nested {{ and }} e.g.

{{ something {{ something else {{ something new }}{{ something old }} something blue }} something green }}


Solution

  • Your edit shows that you're trying to do a recursive match, which is very different from the original question. If you weren't just deleting the matched text I would advise you not to use regexes at all, but this should do what you want:

    $wikicode=preg_replace('~{{(?:(?:(?!{{|}}).)++|(?R))*+}}~s',
                           '', $wikicode);
    

    After the first {{ matches an opening delimiter, (?:(?!{{|}}).)++ gobbles up everything until the next delimiter. If it's another opening delimiter, the (?R) takes over and applies the whole regex again, recursively.

    (?R) is about as non-standard as regex features get. It's unique to the PCRE library, which is what powers PHP's regex flavor. Some other flavors have their own ways of matching recursive structures, all of them very different from each other.