javaregexjsouptext-parsingtag-soup

Wrap a tag around plain html text


I have this structure in my html document:

<p>
"<em>You</em> began the evening well, Charlotte," said Mrs.&nbsp;Bennet with civil          self–command to Miss Lucas. "<em>You</em> were Mr.&nbsp;Bingley's first choice."
</p>

But i need my "plain text" to be wrappted in tags, to be able to process it :)

<p>
    <text>"</text>
    <em>You</em>
    <text> began the evening well, Charlotte," said Mrs.&nbsp;Bennet with civil self–command to Miss Lucas. "</text>
    <em>You</em>
    <text> were Mr.&nbsp;Bingley's first choice."</text>
</p>

Any ideas how to accomplish this? I've looked at tagsoup and jsoup but i dont seem a way to solve this easily. Maybe using some fancy regexp.

Thanks


Solution

  • Here's a suggestion:

    public static Node toTextElement(String str) {
        Element e = new Element(Tag.valueOf("text"), "");
        e.appendText(str);
        return e;
    }
    
    public static void replaceTextNodes(Node root) {
        if (root instanceof TextNode)
            root.replaceWith(toTextElement(((TextNode) root).text()));
        else
            for (Node child : root.childNodes())
                replaceTextNodes(child);
    }
    

    Test code:

    String html = "<p>\"<em>You</em> began the evening well, Charlotte,\" " +
             "said Mrs.&nbsp;Bennet with civil self–command to Miss Lucas." +
             " \"<em>You</em> were Mr.&nbsp;Bingley's first choice.\"</p>";
    
    Document doc = Jsoup.parse(html);
    
    for (Node n : doc.body().children())
        replaceTextNodes(n);
    
    System.out.println(doc);
    

    Output:

    <html>
     <head></head>
     <body>
      <p>
       <text>
        &quot;
       </text><em>
        <text>
         You
        </text></em>
       <text>
         began the evening well, Charlotte,&quot; said Mrs.&nbsp;Bennet with civil self–command to Miss Lucas. &quot;
       </text><em>
        <text>
         You
        </text></em>
       <text>
         were Mr.&nbsp;Bingley's first choice.&quot;
       </text></p>
     </body>
    </html>