xmlxml-parsingxml-namespacesxmlstarlet

Is it possible to edit the text or HTML contents of an XML node using xmlstarlet?


The answer to my earier question How do I access an XML node that uses quote marks with xmlstarlet? shows how to access a node using the namespace, and in that case, deleting the entire node.

xmlstarlet edit -N ns="http://www.w3.org/2005/Atom" -d "//ns:content[@type='html']" input.xml > output.xml

But how would I edit the contents of the <content type='html'> node?

Let's say I want to delete all HTML tags in all the <content type='html'> nodes, but leave the text.

Is it possible to use xmlstarlet to edit a node?

input.xml:

<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:blogger="http://schemas.google.com/blogger/2018">
  <title>Testv1</title>
<entry>
    <author>
      <name>Author</name>
    </author>
    <title/>
    <content type='html'><p>Test Post 2</p><p></p><p>
Sed ut perspiciatis unde omnis iste natus error sit voluptatem,
eaque ipsa quae voluptas nulla pariatur?</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/kitten2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="360" data-original-width="360" height="320" src="https://blogger.googleusercontent.com/kitten2.png" width="320" 
/></a></div><br /><p></p><p></p><p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor</content>
  </entry>
<entry>
    <author>
      <name>Author</name>
    </author>
    <title/>
<content type='html'>....</content>
  </entry>

Desired output.xml:

<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:blogger="http://schemas.google.com/blogger/2018">
  <title>Testv1</title>
<entry>
    <author>
      <name>Author</name>
    </author>
    <title/>
    <content type='html'>Test Post 2 Sed ut perspiciatis unde omnis iste natus error sit voluptatem, eaque ipsa quae voluptas nulla pariatur? Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor</content>
  </entry>
<entry>
    <author>
      <name>Author</name>
    </author>
    <title/>
<content type='html'>....</content>
  </entry>

Solution

  • xmlstarlet edit -N ns="http://www.w3.org/2005/Atom" \
                    --update "//ns:content" --expr "normalize-space(string(.))" input.xml
    

    Output:

    <?xml version="1.0" encoding="utf-8"?>
    <feed xmlns="http://www.w3.org/2005/Atom" xmlns:blogger="http://schemas.google.com/blogger/2018">
      <title>Testv1</title>
      <entry>
        <author>
          <name>Author</name>
        </author>
        <title/>
        <content type="html">Test Post 2 Sed ut perspiciatis unde omnis iste natus error sit voluptatem, eaque ipsa quae voluptas nulla pariatur?Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor</content>
      </entry>
      <entry>
        <author>
          <name>Author</name>
        </author>
        <title/>
        <content type="html">....</content>
      </entry>
    </feed>
    

    Unlike your desired output, there is no space before the word Lorem.


    See: xmlstarlet edit