I have an XML file which structured as below:
<tag1>
<tag2>This is<>a<AA>text</tag2>
<ABC>0123-</xyz>-89</ABC>
</tag1>
How can i change all the illegal <> to < and > The result should be as below:
<tag1>
<tag2>This is<>a<AA>text</tag2>
<ABC>0123-</xyz>-89</ABC>
</tag1>
this shouldn't be fixed after the XML is generated, this is a bug in the code that generates the xml in the first place. fix the generator that generates the invalid XML, don't fix the invalid xml afterwards.
for the encoding specifications, check the XML specifications at https://www.w3.org/TR/xml/#intern-replacement , but note that many programming languages already have functions or libraries for this stuff, for example, to XML-encode a string in PHP, do htmlspecialchars ( $str, ENT_QUOTES | ENT_SUBSTITUTE | ENT_DISALLOWED | ENT_XML1, 'UTF-8', true );
and for for many other languages, there's libxml2, check http://xmlsoft.org/ (it has bindings for, among others, C, C++, C#, Python, Delphi/Pascal, Ruby, Perl, PHP)