javahtmlhtml-parsingjsoupjericho-html-parser

Text Extraction from HTML using Java including source line number and code


The Question how to extract Text from HTML using Java has been viewed and duplicated a zillion times: Text Extraction from HTML Java

Thanks to the answers found on Stackoverflow my current state of affairs is that I am using JSoup

<!-- Jsoup maven dependency -->
<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.7.3</version>
</dependency>

and this piece or code:

// parse the html from the givne string
Document doc = Jsoup.parse(html);
// loop over children elements of the body tag
for (Element el:doc.select("body").select("*")) {
  // loop over all textnodes of these children
  for (TextNode textNode:el.textNodes()) {
    // make sure there is some text other than whitespace
    if (textNode.text().trim().length()>0) {
        // show:
        //    the original node name
        //    the name of the subnode witht the text 
        //    the text 
        System.out.println(el.nodeName()+"."+textNode.nodeName()+":"+textNode.text());
    }
  }
}

Now I'd also like to show the line number and the original html source code the textNode at hand came from. I doubt JSoup can do this (e.g. see)

and trying a work around like:

int pos = html.indexOf(textNode.outerHtml());

does not reliably find the original html. So I assume I might have to switch to another Library or approach. Jericho-html: is it possible to extract text with reference to positions in source file? has an answer that says "Jericho can do it" as the link above also points out. But the pointer to real working code is missing.

Whith Jericho I got as far as:

Source htmlSource=new Source(html);
boolean bodyFound=false;
// loop over all elements
for (net.htmlparser.jericho.Element el:htmlSource.getAllElements()) {
    if (el.getName().equals("body")) {
        bodyFound=true;
    }
    if (bodyFound) {
        TagType tagType = el.getStartTag().getTagType();
        if (tagType==StartTagType.NORMAL) {
            String text=el.getTextExtractor().toString();
            if (!text.trim().equals("")) {
                int cpos = el.getBegin();               
                System.out.println(el.getName()+"("+tagType.toString()+") line "+   htmlSource.getRow(cpos)+":"+text);
            }
        } // if
    } // if
} // for

Which is pretty good already since it will give you output like:

body(normal) line 91: Some Header. Some Text
div(normal) line 93: Some Header
div(normal) line 95: Some Text

but now the followup problem is that TextExtractor outputs the whole text of all subnodes recursively so that text shows up multiple times.

What would be a working solution that filters as well as the above JSoup solution (please note the correct order of text elements) but shows source lines as the above Jericho Code snippet does?


Solution

  • Here is a Junit Test testing the expected output and a Jericho based SourceTextExtractor that makes the JUnit Test work which is based on the original Jericho TextExtractor source code.

    @Test
    public void testTextExtract() {
        // https://github.com/paepcke/CorEx/blob/master/src/extraction/HTMLUtils.java
        String htmls[] = {
                "<!DOCTYPE html>\n" + "<html>\n" + "<body>\n" + "\n"
                        + "<h1>My First Heading</h1>\n" + "\n"
                        + "<p>My first paragraph.</p>\n" + "\n" + "</body>\n" + "</html>",
                "<html>\n"
                        + "<body>\n"
                        + "\n"
                        + "<div id=\"myDiv\" name=\"myDiv\" title=\"Example Div Element\">\n"
                        + "  <h5>Subtitle</h5>\n"
                        + "  <p>This paragraph would be your content paragraph...</p>\n"
                        + "  <p>Here's another content article right here.</p>\n"
                        + "</div>" + "\n" + "Text at end of body</body>\n" + "</html>" };
        int expectedSize[] = { 2, 4 };
        String expectedInfo[][]={
            { 
                "line 5 col 5 to  line 5 col 21: My First Heading",
                "line 7 col 4 to  line 7 col 23: My first paragraph."
            },
            { 
                "line 5 col 7 to  line 5 col 15: Subtitle",
                "line 6 col 6 to  line 6 col 55: This paragraph would be your content paragraph...",
                "line 7 col 6 to  line 7 col 48: Here's another content article right here.",
                "line 8 col 7 to  line 9 col 20: Text at end of body"
            }
        };
        int i = 0;
        for (String html : htmls) {
            SourceTextExtractor extractor=new SourceTextExtractor();
            List<TextResult> textParts = extractor.extractTextSegments(html);
            // List<String> textParts = HTMLCleanerTextExtractor.extractText(html);
            int j=0;
            for (TextResult textPart : textParts) {
                System.out.println(textPart.getInfo());
                assertTrue(textPart.getInfo().startsWith(expectedInfo[i][j]));
                j++;
            }
            assertEquals(expectedSize[i], textParts.size());
            i++;
        }
    }
    

    This is an adapted TextExtractor see http://grepcode.com/file_/repo1.maven.org/maven2/net.htmlparser.jericho/jericho-html/3.3/net/htmlparser/jericho/TextExtractor.java/?v=source

    /**
     * TextExtractor that makes source line and col references available
     * http://grepcode.com/file_/repo1.maven.org/maven2/net.htmlparser.jericho/jericho-html/3.3/net/htmlparser/jericho/TextExtractor.java/?v=source
     */
    public class SourceTextExtractor {
    
        public static class TextResult {
            private String text;
            private Source root;
            private Segment segment;
            private int line;
            private int col;
    
            /**
             * get a textResult
             * @param root
             * @param segment
             */
            public TextResult(Source root,Segment segment) {
                this.root=root;
                this.segment=segment;
                final StringBuilder sb=new StringBuilder(segment.length());
                sb.append(segment);
                setText(CharacterReference.decodeCollapseWhiteSpace(sb));
                int spos = segment.getBegin();  
                line=root.getRow(spos);
                col=root.getColumn(spos);
    
            }
    
            /**
             * gets info about this TextResult
             * @return
             */
            public String getInfo() {
                int epos=segment.getEnd();
    
                String result=
                        " line "+   line+" col "+col+
                        " to "+
                        " line "+   root.getRow(epos)+" col "+root.getColumn(epos)+
                        ":"+getText();
                return result;
            }
    
            /**
             * @return the text
             */
            public String getText() {
                return text;
            }
    
            /**
             * @param text the text to set
             */
            public void setText(String text) {
                this.text = text;
            }
    
            public int getLine() {
                return line;
            }
    
            public int getCol() {
                return col;
            }
    
        }
    
        /**
         * extract textSegments from the given html
         * @param html
         * @return
         */
        public List<TextResult> extractTextSegments(String html) {
            Source htmlSource=new Source(html);
            List<TextResult> result = extractTextSegments(htmlSource);
            return result;
        }
    
        /**
         * get the TextSegments from the given root segment
         * @param root
         * @return
         */
        public List<TextResult> extractTextSegments(Source root) {
            List<TextResult> result=new ArrayList<TextResult>();
            for (NodeIterator nodeIterator=new NodeIterator(root); nodeIterator.hasNext();) {
                Segment segment=nodeIterator.next();
                if (segment instanceof Tag) {
                    final Tag tag=(Tag)segment;
                    if (tag.getTagType().isServerTag()) {
                        // elementContainsMarkup should be made into a TagType property one day.
                        // for the time being assume all server element content is code, although this is not true for some Mason elements.
                        final boolean elementContainsMarkup=false;
                        if (!elementContainsMarkup) {
                            final net.htmlparser.jericho.Element element=tag.getElement();
                            if (element!=null && element.getEnd()>tag.getEnd()) nodeIterator.skipToPos(element.getEnd());
                        }
                        continue;
                    }
                    if (tag.getTagType()==StartTagType.NORMAL) {
                        final StartTag startTag=(StartTag)tag;
                        if (tag.name==HTMLElementName.SCRIPT || tag.name==HTMLElementName.STYLE ||  (!HTMLElements.getElementNames().contains(tag.name))) {
                            nodeIterator.skipToPos(startTag.getElement().getEnd());
                            continue;
                        }
    
                    }
                    // Treat both start and end tags not belonging to inline-level elements as whitespace:
                    if (tag.getName()==HTMLElementName.BR || !HTMLElements.getInlineLevelElementNames().contains(tag.getName())) {
                        // sb.append(' ');
                    }
                } else {
                    if (!segment.isWhiteSpace())
                        result.add(new TextResult(root,segment));
                }
            }
            return result;
        }
    
        /**
         * extract the text from the given segment
         * @param segment
         * @return
         */
        public String extractText(net.htmlparser.jericho.Segment pSegment) {
    
            // http://grepcode.com/file_/repo1.maven.org/maven2/net.htmlparser.jericho/jericho-html/3.3/net/htmlparser/jericho/TextExtractor.java/?v=source
            // this would call the code above
            // String result=segment.getTextExtractor().toString();
            final StringBuilder sb=new StringBuilder(pSegment.length());
            for (NodeIterator nodeIterator=new NodeIterator(pSegment); nodeIterator.hasNext();) {
                Segment segment=nodeIterator.next();
                if (segment instanceof Tag) {
                    final Tag tag=(Tag)segment;
                    if (tag.getTagType().isServerTag()) {
                        // elementContainsMarkup should be made into a TagType property one day.
                        // for the time being assume all server element content is code, although this is not true for some Mason elements.
                        final boolean elementContainsMarkup=false;
                        if (!elementContainsMarkup) {
                            final net.htmlparser.jericho.Element element=tag.getElement();
                            if (element!=null && element.getEnd()>tag.getEnd()) nodeIterator.skipToPos(element.getEnd());
                        }
                        continue;
                    }
                    if (tag.getTagType()==StartTagType.NORMAL) {
                        final StartTag startTag=(StartTag)tag;
                        if (tag.name==HTMLElementName.SCRIPT || tag.name==HTMLElementName.STYLE ||  (!HTMLElements.getElementNames().contains(tag.name))) {
                            nodeIterator.skipToPos(startTag.getElement().getEnd());
                            continue;
                        }
    
                    }
                    // Treat both start and end tags not belonging to inline-level elements as whitespace:
                    if (tag.getName()==HTMLElementName.BR || !HTMLElements.getInlineLevelElementNames().contains(tag.getName())) {
                        sb.append(' ');
                    }
                } else {
                    sb.append(segment);
                }
            }
            final String result=net.htmlparser.jericho.CharacterReference.decodeCollapseWhiteSpace(sb);
            return result;
        }
    }