xmlsedgpx

How to remove XML Tag Blocks from the command line with one or more occurrences


I have an xml file that looks something like this:

<xml>
  <trkseg>
  <note>
    <to>A</to>
    <from>B</from>
    <body>
      keep this
    </body>
  </trkseg>
  <trkseg>
  </note>
  ...
  </trkseg>
</xml>

And I wanted to remove all the following code. This combination of tags can occur more than once in the file:

</trkseg>
<trkseg>

Any tips on how to fix this?

What I expected was this:

<xml>
  <trkseg>
  <note>
    <to>A</to>
    <from>B</from>
    <body>
      keep this
    </body>
  </note>
  ...
  </trkseg>
</xml>

I tried using this sed command but doesn't work the way I want:

sed -i '' -e '/<\/trkseg>/,/<trkseg>/d' my-file.xml

I get this result:

<xml>
  <trkseg>
  <note>
    <to>A</to>
    <from>B</from>
    <body>
      keep this
    </body>
  </note>
  ...


Solution

  • It can be done with GNU sed

    sample file

    <xml>
      <trkseg>
        one
        two
      </trkseg>
      <trkseg>
        three
        four
      </trkseg>
    </xml>
    

    sed script

    sed -znr '{
      :-A s/<[\/]trkseg>/&/2;t-B;b-C
      :-B s/[[:space:]]*<[\/]trkseg>//1;t-A
      :-C s/[[:space:]]*<trkseg>//2g;p
    }' file
    

    output:

    <xml>
      <trkseg>
        one
        two
        three
        four
      </trkseg>
    </xml>