htmlgotokenstringtokenizer

How to replace specific html tags using string tokenizer


I have a string with html markup in it (differMarkup) and would like to run that string through a tokenizer that would identify specific tags (like ins, dels, movs) and replace them with the span tag and add data attributes to it as well.

So the input looks like this:

`<h1>No Changes Here</h1>
    <p>This has no changes</p>
    <p id="1"><del>Delete </del>the first word</p>
    <p id="2"><ins>insertion </ins>Insert a word at the start</p>`

And intended output would be this:

`<h1>No Changes Here</h1>
    <p>This has no changes</p>
    <p id="1"><span class="del" data-cid=1>Delete</span>the first word</p>
    <p id="2"><span class="ins" data-cid=2>insertion</span>Insert a word at the start</p>
`

This is what I currently have. For some reason I'm not able to append the html tags to the finalMarkup var when setting it to span.

const (
    htmlTagStart = 60 // Unicode `<`
    htmlTagEnd   = 62 // Unicode `>`
    differMarkup = `<h1>No Changes Here</h1>
    <p>This has no changes</p>
    <p id="1"><del>Delete </del>the first word</p>
    <p id="2"><ins>insertion </ins>Insert a word at the start</p>`  // Differ Markup Output
)

func readDifferOutput(differMarkup string) string {

    finalMarkup := ""
    tokenizer := html.NewTokenizer(strings.NewReader(differMarkup))
    token := tokenizer.Token()
loopDomTest:
    for {
        tt := tokenizer.Next()
        switch {

        case tt == html.ErrorToken:
            break loopDomTest // End of the document,  done

        case tt == html.StartTagToken, tt == html.SelfClosingTagToken:
            token = tokenizer.Token()
            tag := token.Data

            if tag == "del" {
                tokenType := tokenizer.Next()

                if tokenType == html.TextToken {
                    tag = "span"
                    finalMarkup += tag
                }

                //And add data attributes
            }

        case tt == html.TextToken:
            if token.Data == "span" {
                continue
            }
            TxtContent := strings.TrimSpace(html.UnescapeString(string(tokenizer.Text())))
            finalMarkup += TxtContent
            if len(TxtContent) > 0 {
                fmt.Printf("%s\n", TxtContent)
            }
        }
    }

    fmt.Println("tokenizer text: ", finalMarkup)

    return finalMarkup

}
```golang

Solution

  • Basically you want to replace some nodes in your HTML text. For such tasks it's much easier to work with DOMs (Document Object Model) than to handle the tokens yourself.

    The package you're using golang.org/x/net/html also supports modeling HTML documents using the html.Node type. To acquire the DOM of an HTML document, use the html.Parse() function.

    So what you should do is traverse the DOM, and replace (modify) the nodes you want to. Once you're done with the modifications, you can get back the HTML text by rendering the DOM, for that use html.Render().

    This is how it can be done:

    const src = `<h1>No Changes Here</h1>
    <p>This has no changes</p>
    <p id="1"><del>Delete </del>the first word</p>
    <p id="2"><ins>insertion </ins>Insert a word at the start</p>`
    
    func main() {
        root, err := html.Parse(strings.NewReader(src))
        if err != nil {
            panic(err)
        }
    
        replace(root)
    
        if err = html.Render(os.Stdout, root); err != nil {
            panic(err)
        }
    }
    
    func replace(n *html.Node) {
        if n.Type == html.ElementNode {
            if n.Data == "del" || n.Data == "ins" {
                n.Attr = []html.Attribute{{Key: "class", Val: n.Data}}
                n.Data = "span"
            }
        }
    
        for child := n.FirstChild; child != nil; child = child.NextSibling {
            replace(child)
        }
    }
    

    This will output:

    <html><head></head><body><h1>No Changes Here</h1>
    <p>This has no changes</p>
    <p id="1"><span class="del">Delete </span>the first word</p>
    <p id="2"><span class="ins">insertion </span>Insert a word at the start</p></body></html>
    

    This is almost what you want, the "extra" thing is that the html package added wrapper <html> and <body> elements, along with an empty <head>.

    If you want to get rid of those, you may just render the content of the <body> element and not the entire DOM:

    // To navigate to the <body> node:
    body := root.FirstChild. // This is <html>
                    FirstChild. // this is <head>
                    NextSibling // this is <body>
    // Render everyting in <body>
    for child := body.FirstChild; child != nil; child = child.NextSibling {
        if err = html.Render(os.Stdout, child); err != nil {
            panic(err)
        }
    }
    

    This will output:

    <h1>No Changes Here</h1>
    <p>This has no changes</p>
    <p id="1"><span class="del">Delete </span>the first word</p>
    <p id="2"><span class="ins">insertion </span>Insert a word at the start</p>
    

    And we're done. Try the examples on the Go Playground.

    If you want the result as a string (instead of printed to the standard output), you may use bytes.Buffer as the output for rendering, and call its Buffer.String() method in the end:

    // Render everyting in <body>
    buf := &bytes.Buffer{}
    for child := body.FirstChild; child != nil; child = child.NextSibling {
        if err = html.Render(buf, child); err != nil {
            panic(err)
        }
    }
    
    fmt.Println(buf.String())
    

    This outputs the same. Try it on the Go Playground.