I have this tag in a string:
<?xml:namespace prefix = o /?>
How do I remove that and similar tags from the string with PHP and regex?
I tried:
$clean = preg_replace('/<\?xml[^>]+\/>/im', '', $dirty);
What you have in that string is a Processing Instruction (PI, see XML 1.0).
If you want to remove those PIs from a string that you expect to be UTF-8 encoded w/o making use of the PCRE UTF-8 modifier, you can use the following pattern:
~
<\?
(?: [A-Za-z_:] | [^\x00-\x7F] ) (?: [A-Za-z_:.-] | [^\x00-\x7F] )*
(?: \?> | \s (?: [^?]* \?+ ) (?: [^>?] [^?]* \?+ )* >)
~x
It is a translation from a REX expression for XML Processing Instructions to a PCRE expression as used in PHP.
A code example:
$str = "some string <?xml:namespace prefix = o /?> that is";
$pattern = '~
<\?
(?: [A-Za-z_:] | [^\x00-\x7F] ) (?: [A-Za-z_:.-] | [^\x00-\x7F] )*
(?: \?> | \s (?: [^?]* \?+ ) (?: [^>?] [^?]* \?+ )* >)
~x';
echo preg_replace($pattern, '', $str);
The output:
some string that is
Different to the previous answer given is that this regular expression does ...
?>
") correctly into account. Especially a ">
" can be allowed in a processing instruction.xml
" only.Some notes worth to mention about the limitations:
<?xml
" as well. This can be changed by not looking for XML reserved names after the opening "<?
" with a negative lookahead like "(?! [xX][mM][lL] (?: \?> | \s ) )
".Because of these limitations it's perhaps worth to consider
First of all, it can be much easier to just use PHP's strip_tags
to strip the processing instructions. It will remove other tags and comments, too. This might not be always wanted, it's just really straight forward:
strip_tags($str)
Much more explicit as both the regular expression and strip_tags
is using one of the XML parsers that ship with PHP to strip the processing instructions. For example PHP's DOM extension. It can be wrapped in a function to be easily applied on a string:
dom_strip_pis($str)
Such an exemplary function also works with the XML string you have which is using the reserved name "xml
" as prefix which is actually not really correct in XML. But the parser won't choke on it:
/**
* remove processing instructions from an XML string
*
* @author hakre <http://hakre.wordpress.com>
*
* @param string $xml
* @return string
*/
function dom_strip_pis($str) {
$doc = new DOMDocument;
$fragment = $doc->createDocumentFragment();
$saved = libxml_use_internal_errors(true);
$fragment-> appendXML($str);
libxml_use_internal_errors($saved);
foreach($fragment->childNodes as $node) {
if ($node instanceof DOMProcessingInstruction) {
$node->parentNode->removeChild($node);
}
}
return $doc->saveXML($fragment);
}
Using an XML parser as given in the last example won't have you to deal with shallow parsing.