htmlgogoquery

How to extract the text of a custom html tag with goquery?


I am trying to extract the text a custom html tag (<prelogin-cookie>):

someHtml := `<html><body>Login Successful!</body><!-- <saml-auth-status>1</saml-auth-status><prelogin-cookie>4242424242424242</prelogin-cookie><saml-username>my-username</saml-username><saml-slo>no</saml-slo> --></html>`
query, _ := goquery.NewDocumentFromReader(strings.NewReader(someHtml))
sel:= query.Find("prelogin-cookie")
println(sel.Text())

But it does not return anything, just an empty string, how can I get the actual text of that html tag, aka 4242424242424242?


Solution

  • <prelogin-cookie> is not found because it's inside an HTML comment.

    Your comment is actually a series of XML or HTML tags, it may be processed as HTML if you use that as the input document.

    Warning. Only the first solution below handles "all" HTML documents properly. The other solutions are simpler and will also handle your case just fine, but they might not handle some edge cases. Decide if they worth using for you.

    1. By searching the HTML node tree

    One way to find and extract the comment would be to traverse the HTML node tree and look for a node with type html.CommentNode.

    For this, we'll use a recursive helper function to traverse a node tree:

    func findComment(n *html.Node) *html.Node {
        if n == nil {
            return nil
        }
        if n.Type == html.CommentNode {
            return n
        }
        if res := findComment(n.FirstChild); res != nil {
            return res
        }
        if res := findComment(n.NextSibling); res != nil {
            return res
        }
        return nil
    }
    

    And using it:

    doc, err := goquery.NewDocumentFromReader(strings.NewReader(someHtml))
    if err != nil {
        panic(err)
    }
    
    var comment *html.Node
    for _, node := range doc.Nodes {
        if comment = findComment(node); comment != nil {
            break
        }
    }
    if comment == nil {
        fmt.Println("no comment")
        return
    }
    
    doc, err = goquery.NewDocumentFromReader(strings.NewReader(comment.Data))
    if err != nil {
        panic(err)
    }
    
    sel := doc.Find("prelogin-cookie")
    fmt.Println(sel.Text())
    

    This will print (try it on the Go Playground):

    4242424242424242
    

    2. With strings

    If you just have to handle the "document at hand", a simpler solution may be to just use strings package to find the start and end indices of the comment:

    start := strings.Index(someHtml, "<!--")
    if start < 0 {
        panic("no comment")
    }
    end := strings.Index(someHtml[start:], "-->")
    if end < 0 {
        panic("no comment")
    }
    

    And using this as the input:

    doc, err := goquery.NewDocumentFromReader(strings.NewReader(someHtml[start+4 : end]))
    if err != nil {
        panic(err)
    }
    
    sel := doc.Find("prelogin-cookie")
    fmt.Println(sel.Text())
    

    This will output the same. Try it on the Go Playground).

    3. Using regexp

    A simpler (but less efficient) alternative of the previous solution is to use regexp to get the comment out of the original document:

    comments := regexp.MustCompile(`<!--(.*?)-->`).FindAllString(someHtml, -1)
    if len(comments) == 0 {
        fmt.Println("no comment")
        return
    }
    
    doc, err := goquery.NewDocumentFromReader(strings.NewReader(
        comments[0][4 : len(comments[0])-3]))
    

    Try this one on the Go Playground.