gogo-colly

Colly Max Depth and encoding/json - null


I have gone through the Go tour and I'm now going through some of the Colly tutorials. I understand the max depth and have been trying to implement it in a go program like so:

package main

import (
    "encoding/json"
    "log"
    "net/http"

    "github.com/gocolly/colly"
)

func ping(w http.ResponseWriter, r *http.Request) {
    log.Println("Ping")
    w.Write([]byte("ping"))
}

func getData(w http.ResponseWriter, r *http.Request) {
    //Verify the param "URL" exists
    URL := r.URL.Query().Get("url")
    if URL == "" {
        log.Println("missing URL argument")
        return
    }
    log.Println("visiting", URL)

    //Create a new collector which will be in charge of collect the data from HTML
    c := colly.NewCollector(
        // MaxDepth is 2, so only the links on the scraped page
        // and links on those pages are visited
        colly.MaxDepth(2),
        colly.Async(true),
    )

    // Limit the maximum parallelism to 2
    // This is necessary if the goroutines are dynamically
    // created to control the limit of simultaneous requests.
    //
    // Parallelism can be controlled also by spawning fixed
    // number of go routines.
    c.Limit(&colly.LimitRule{DomainGlob: "*", Parallelism: 2})

    //Slices to store the data
    var response []string

    //onHTML function allows the collector to use a callback function when the specific HTML tag is reached
    //in this case whenever our collector finds an
    //anchor tag with href it will call the anonymous function
    // specified below which will get the info from the href and append it to our slice
    c.OnHTML("a[href]", func(e *colly.HTMLElement) {
        link := e.Request.AbsoluteURL(e.Attr("href"))
        if link != "" {
            response = append(response, link)
        }
    })

    //Command to visit the website
    c.Visit(URL)

    // parse our response slice into JSON format
    b, err := json.Marshal(response)
    if err != nil {
        log.Println("failed to serialize response:", err)
        return
    }
    // Add some header and write the body for our endpoint
    w.Header().Add("Content-Type", "application/json")
    w.Write(b)
}

func main() {
    addr := ":7171"

    http.HandleFunc("/links", getData)
    http.HandleFunc("/ping", ping)

    log.Println("listening on", addr)
    log.Fatal(http.ListenAndServe(addr, nil))
}

When doing so the response is null. Taking out the MaxDepth and Async lines results in the expected response (with only the top level links).

Any help is appreciated!


Solution

  • When running in Async mode c.Visit will return before the requests are actually made (see here); the correct process is demonstrated in the Parallel demo. In your case this means:

    c.Visit(URL)
    c.Wait()
    

    Using async is not very useful when just making the one request. Check out the reddit example to see how this can be used to visit multiple URLs in one operation.

    Note: You really should be checking the error values returned by these functions and adding an error handler is also good practice.