javaxmlsaxcore-media

Is it possible to merge XML-Elements with SAX (coremedia CAE filter)


Given is:

a XML structure like

<span class="abbreviation">AGB<span class"explanation">Allgemeine Geschäftsbedingungen</span></span>

and the result after the transformation should be:

<abbr title="Allgemeine Geschäftsbedingungen">AGB</abbr>

I know that SAX is an event-based XML-parser, and with methods like

I can capture events (like open-a-tag, close-a-tag) and with

I can extract the text between the tags.

My Question is:

Can i create a transformation mentioned above (is it possible)?

My Problem is:


Solution

  • The answer is yes it's possible!

    The main argument/hint you can get from this StackOverflow-link

    here is what has to be done:

    1. you have to remember the states, at which span-tag the sax parser is located ("class=abbreviation" or "class=explanation")
    2. you have to extract the content of the tags (this can be done with the #character method)
    3. When you know the state of the sax parser and the content, you can create a new abbr-tag
    4. all other tags, have to accede without any modification

    For completeness here is the source code of the coremedia cae filter:

    import com.coremedia.blueprint.cae.richtext.filter.FilterFactory;
    import com.coremedia.xml.Filter;
    import org.apache.commons.lang3.StringUtils;
    import org.xml.sax.Attributes;
    import org.xml.sax.SAXException;
    import org.xml.sax.helpers.AttributesImpl;
    
    import javax.servlet.http.HttpServletRequest;
    import javax.servlet.http.HttpServletResponse;
    
    public class GlossaryFilter extends Filter implements FilterFactory {
      private static final String SPAN = "span";
      private static final String CLASS = "class";
    
      private boolean isAbbreviation = false;
      private boolean isExplanation = false;
      private String abbreviation;
      private String currentUri;
      private boolean spanExplanationClose = false;
      private boolean spanAbbreviationClose = false;
    
      @Override
      public Filter getInstance(final HttpServletRequest request, final HttpServletResponse response) {
        return new GlossaryFilter();
      }
    
      @Override
      public void startElement(final String uri, final String localName, final String qName,
          final Attributes attributes) throws SAXException {
        if (isSpanAbbreviationTag(qName, attributes)) {
          isAbbreviation = true;
        } else if (isSpanExplanationTag(qName, attributes)) {
          isExplanation = true;
          currentUri = uri;
        } else {
          super.startElement(uri, localName, qName, attributes);
        }
      }
    
      private boolean isSpanExplanationTag(final String qName, final Attributes attributes) {
        //noinspection OverlyComplexBooleanExpression
        return StringUtils.isNotEmpty(qName) && qName.equalsIgnoreCase(SPAN) && (
            attributes.getLength() > 0) && attributes.getValue(CLASS).equals("explanation");
      }
    
      private boolean isSpanAbbreviationTag(final String qName, final Attributes attributes) {
        //noinspection OverlyComplexBooleanExpression
        return StringUtils.isNotEmpty(qName) && qName.equalsIgnoreCase(SPAN) && (
            attributes.getLength() > 0) && attributes.getValue(CLASS).equals("abbreviation");
      }
    
      @Override
      public void endElement(final String uri, final String localName, final String qName)
          throws SAXException {
        if (spanExplanationClose) {
          spanExplanationClose = false;
        } else if (spanAbbreviationClose) {
          spanAbbreviationClose = false;
        } else {
          super.endElement(uri, localName, qName);
        }
      }
    
      @Override
      public void characters(final char[] ch, final int start, final int length) throws SAXException {
        if (isAbbreviation && isExplanation) {
          final String explanation = new String(ch, start, length);
          final AttributesImpl newAttributes = createAttributes(explanation);
          writeAbbrTag(newAttributes);
          changeState();
        } else if (isAbbreviation && !isExplanation) {
          abbreviation = new String(ch, start, length);
        } else {
          super.characters(ch, start, length);
        }
      }
    
      private void changeState() {
        isExplanation = false;
        isAbbreviation = false;
        spanExplanationClose = true;
        spanAbbreviationClose = true;
      }
    
      @SuppressWarnings("TypeMayBeWeakened")
      private void writeAbbrTag(final AttributesImpl newAttributes) throws SAXException {
        super.startElement(currentUri, "abbr", "abbr", newAttributes);
        super.characters(abbreviation.toCharArray(), 0, abbreviation.length());
        super.endElement(currentUri, "abbr", "abbr");
      }
    
      private AttributesImpl createAttributes(final String explanation) {
        final AttributesImpl newAttributes = new AttributesImpl();
        newAttributes.addAttribute(currentUri, "title", "abbr:title", "CDATA", explanation);
        return newAttributes;
      }
    }
    

    The interesting stuff is in the methods:

    startElement(...)

    Here you store the state at which tag the sax-parser is located (more detailed: you store the state, which span-tag (the "class=abbreviation" or "class=explanation") was opened.

    You only store states. The mentioned span-tags will not be processed/filtered (the result is, they would be removed). Every other tag is processed with no filtering, they will be applied without modification (that's the else-block).

    endElement(...)

    Here you want only process every tag except (the mentioned span-tags). All these tags are applied without modification (the else-block). If the sax parser is located at a closed span-tag (with "class=abbreviation" or "class=explanation") you want to do nothing (except store the state)

    characters(...)

    In this method the magic (creating a tag with the parser) happens. Depending on the state:

    1. Sax parser is located at a span-tag with "class=explanation" (this means there was an open span-tag with "class=abbreviation" passed before) --> branch (isAbbreviation && isExplanation)
    2. Sax parser is located at the first span-tag (the span-tag with "class=abbreviation") --> branch (isAbbreviation && !isExplanation)
    3. every other character you find in any other tag --> branch else

    for state 3.

    simply copy the text you find

    for state 2.

    extract the content of the span-tag with "class=abbreviation" for later use

    for state 3.