htmlgogoquery

retrieving text from a website with goquery


I have a html roughly looking like this:

<h4>Movies</h4>
    <h5><a href="external_link" target="_blank"> A Song For Jenny</a> (2015)</h5>
    Rating: PG<br/>
    Running Time (minutes): 77<br/>
    Description: This Drama, based on real life events, tells the story of a family affected directly by the 7/7 London bombings.  It shows love, loss, heartache and  ...<br/>
    <a href="/bmm/shop/Movie_Detail?movieid=2713288">More about  A Song For Jenny</a><br/>
        <a href="/bmm/shop/Edit_Movie?movieid=2713288">Edit  A Song For Jenny</a><br/>
    <br/>
    <h5><a href="link" target="_blank">#RealityHigh</a> (2017)</h5>
    Rating: PG<br/>
    Running Time (minutes): 99<br/>
    Description: High-achieving high-school senior Dani Barnes dreams of getting into UC Davis, the world's top  veterinary school. Then a glamorous new friend draws  ...<br/>
    <a href="/bmm/shop/Movie_Detail?movieid=4089906">More about #RealityHigh</a><br/>
        <a href="/bmm/shop/Edit_Movie?movieid=4089906">Edit #RealityHigh</a><br/>
    <br/>
    <h5><a href="link" target="_blank">1 Night</a> (2016)</h5>
    Rating: PG<br/>
    Running Time (minutes): 80<br/>
    Description: Bea, a worrisome teenager, reconnects with her introverted childhood friend, Andy. The two  overcome their differences in social status one night aft ...<br/>
    <a href="/bmm/shop/Movie_Detail?movieid=3959071">More about 1 Night</a><br/>
        <a href="/bmm/shop/Edit_Movie?movieid=3959071">Edit 1 Night</a><br/>
    <br/>
    <h5><a href="link" target="_blank">10 Cloverfield Lane</a> (2016)</h5>
    Rating: PG<br/>
    Running Time (minutes): 104<br/>
    Description: Soon after leaving her fiancé Michelle is involved in a car accident. She awakens
to find herself sharing an underground bunker with Howard and Emme ...<br/>
    <a href="/bmm/shop/Movie_Detail?movieid=3052189">More about 10 Cloverfield Lane</a><br/>
        <a href="/bmm/shop/Edit_Movie?movieid=3052189">Edit 10 Cloverfield Lane</a><br/>
    <br/>

I need to use goquery to get as much information out of this page as possible. I know how to extract the external links replaced by the word "link" in this fragment, I know how to get to the links with more details but I also want to extract the information only contained in text, i.e. year (in the headings), running time, shortened description and PG rating. I couldn't figure out how to do this in goquery because this text isn't surrounded by any divs or other tags. I tried looking for h5 tags and then calling .Next() on them but I could only find the <br> tags, not the text inbetween. How can I do that? If there's a better way to do it than using goquery, I'm fine with that. My code looks like this.

// Retrieve the page count:
    res, err = http.Get("myUrlAddress")
    if err != nil {
        fmt.Println(err)
        os.Exit(-1)
    }
    doc, err = goquery.NewDocumentFromResponse(res)
    if err != nil {
        fmt.Println(err)
        os.Exit(-1)
    }
    links := doc.Find(`a[href*="pageIndex"]`)
    fmt.Println(links.Length()) // Output page count
s := doc.Find("h5").First().Next() // I expect it to be the text after the heading.
fmt.Println(s.Text()) // But it's empty and if I check the node type it says br

Solution

  • I somehow don't like the idea of using regex to parse html. I feel it to be too fragile against minor changes like tags order or something like that.

    I think it is the best to fall back on html.Node(golang.org/x/net/html), which goquery is based on. The idea is to iterate over siblings until it runs out, or the next h5 is encountered. It might be a little trouble to deal with links or any other element tags as the html.Node provides a rather unfriendly api regarding attributes, but switching back to goquery from it is even more trouble.

    package main
    
    import (
        "fmt"
        "github.com/PuerkitoBio/goquery"
        "golang.org/x/net/html"
        "golang.org/x/net/html/atom"
        "os"
        "strings"
    )
    
    type Movie struct {
    }
    
    func (m Movie) addTitle(s string) {
        fmt.Println("Title", s)
    }
    
    func (m Movie) addProperty(s string) {
        if s == "" {
            return
        }
        fmt.Println("Property", s)
    }
    
    var M []*Movie
    
    func parseMovie(i int, s *goquery.Selection) {
        m := &Movie{}
        m.addTitle(s.Text())
    
    loop:
        for node := s.Nodes[0].NextSibling; node != nil; node = node.NextSibling {
            switch node.Type {
            case html.TextNode:
                m.addProperty(strings.TrimSpace(node.Data))
            case html.ElementNode:
                switch node.DataAtom {
                case atom.A:
                    //link, do something. You may want to transfer back to go query
                    fmt.Println(node.Attr)
                case atom.Br:
                    continue
                case atom.H5:
                    break loop
                }
            }
        }
    
        M = append(M, m)
    }
    
    func main() {
        r, err := os.Open("movie.html")
        if err != nil {
            panic(err)
        }
        doc, err := goquery.NewDocumentFromReader(r)
        if err != nil {
            panic(err)
        }
    
        doc.Find("h5").Each(parseMovie)
    }