goweb-scrapingweb-crawlergo-colly

How to use selectors properly


I'm writing a crawler to retrieve some data from some pages, the logic of how to build it is very clear for me but I am very confused in how to use the selectors properly.

I would like to get the title of some news using colly, I went to the page https://g1.globo.com/economia and inspected the title that I would like to extract information -> clicked inspect -> copy selector.

the selector is

body > div.glb-grid > main > div.row.content-head.non-featured > div.title > h1

How can I put it correctly in this line of code?

detailCollector.OnHTML("body >  div.glb-grid > main > div.row.content-head.non-featured > div.title > h1", func(element *colly.HTMLElement) {
    fmt.Println(element.Text)

})

How is the correct way to parse this selector in a way that colly can understand? I couldn't find it in the colly documentation anything related to that.


Solution

  • The selectors aren't specific to colly. It is using goquery's Find function:

    doc.Find(cc.Selector).Each(func(_ int, s *goquery.Selection)
    

    But the example you provided represented CSS selectors. So you can find the definitive reference for those in the standard here: https://www.w3.org/TR/selectors-3/#selectors

    BUT that particular web page does not seem to contain the selector you are looking for above.

    The example you provided is extremely specific which is probably why it is not matching anything. Breaking it down it reads as:

    body >  div.glb-grid > main > div.row.content-head.non-featured > div.title > h1
    

    Find an "h1" element that is a child of a div element with a classlist that contains title, that is itself a child of a div element that has a classlist that contains ALL of "row", "content-head", "non-featured" that is a child of main, that's a child of a div element with a classlist containing "glb-grid" that is a child of a body element.

    Contrasting this against the much simpler but more generic selector "h1", which yields only the web page title, as it seems to be the only "h1" element in the document, and this may explain your confusion.

    <h1 class="header-title"> 
    <div class="header-title-content">
    <a class="header-editoria--link" href="https://g1.globo.com/economia/">Economia</a>
    </div>
    </h1>
    

    Added to that the page adjusts the DOM using Javascript, and you have somewhat of a moving target about what actually lies on the page.

    However, it's not all bad news as I suspect that the items you are looking for might simply require:-

    package main
    
    import (
        "fmt"
    
        "github.com/gocolly/colly"
    )
    
    func main() {
        headlines := make(map[string]string)
        c := colly.NewCollector()
        c.OnHTML(".feed-post-link", func(e *colly.HTMLElement) {
            headlines[e.Text] = e.Attr("href")
        })
    
        c.Visit("https://g1.globo.com/economia")
        for hl, url := range headlines {
            fmt.Printf("'%v' - (%v)\n", hl, url)
        }
    }
    

    This uses a simple selector that chooses all HTML elements that have a class of "feed-post-link", which seems to include all of the headlines for that page. I've extracted the URLs as well as the corresponding titles in this example, but that was simple illustrative and you can ignore them if that is not what you require.