javaregexparsingsgml

Remove the parent tag in sgml using java if it has the specific tag inside it


i want to remove the parent tag , if the tag has only note tag in it.

Example :

Input:

<data>
<subdata>
<l1item>
    <note>
        <para>hello
        </para>
    </note>
</l1item>
</subdata>
<subdata>
<l2item>
    <para> dont delete 
    </para>
</l2item>
<l3item>
    <note>
        <para>hello
        </para>
    </note>
    <para> dont delete 
    </para>
</l3item>
</subdata>
</data>

Expected Output:

<data>
<subdata>
<note>
<para>hello
</para>
</note>
</subdata>
<subdata>
<l2item>
<para> dont delete 
</para>
</l2item>
<l3item>
    <note>
        <para>hello
        </para>
    </note>
    <para> dont delete 
    </para>
</l3item>
</subdata>
</data>

In the above example the l1item tag is deleted has it has only note tag in it , l2item is not deleted as it has para tag and l3item is not deleted as it has note tag and para tag

so my requirement is like i want to delete the l1item or l2item or l3item if it has only note tag in it , and if it has some other tag or note tag with some other tag that should not be deleted.


Solution

  • You can use Jsoup here although it's not primarily an SGML parser.

    We are looking for note elements that are the only child of their respective parent. This can be translated as:

    note:only-child
    

    When we spot one of those notes, we can find its parent and replace this parent with the found note. We'll use the Node::replaceWith method for doing this:

    foundNote.parent().replaceWith(foundNote);
    

    Let's putting all together in the sample code below:

    SAMPLE CODE

    String sgml = "<data>\n<subdata>\n<l1item>\n    <note>\n        <para>hello\n        </para>\n    </note>\n</l1item>\n</subdata>\n<subdata>\n<l2item>\n    <para> dont delete \n    </para>\n</l2item>\n<l3item>\n    <note>\n        <para>hello\n        </para>\n    </note>\n    <para> dont delete \n    </para>\n</l3item>\n</subdata>\n</data>";
    
    Document doc = Parser.xmlParser().parseInput(sgml, "");
    
    System.out.println("BEFORE:\n" + doc.html());
    
    Elements onlyChildNotes = doc.select("note:only-child");
    
    for (Element note : onlyChildNotes) {
        Element noteParent = note.parent();
        if (noteParent != null) {
            noteParent.replaceWith(note);
        }
    }
    
    System.out.println("AFTER:\n" + doc.html());
    

    OUTPUT

    BEFORE:
    <data> 
     <subdata> 
      <l1item> 
       <note> 
        <para>
         hello 
        </para> 
       </note> 
      </l1item>
     </subdata>
      (...)
    
    AFTER:
    <data> 
     <subdata> 
      <note> 
       <para>
        hello 
       </para> 
      </note> 
     </subdata> 
     (...)