goasynchronousgo-colly

What is the default mode in GoColly, sync or async?


What is the default mode in which network requests are executed in GoColly? Since we have the Async method in the collector I would assume that the default mode is synchronous. However, I see no particular difference when I execute these 8 requests in the program other than I need to use Wait for async mode. It seems as if the method only controls how the program is executed (the other code) and the requests are always asynchronous.

package main

import (
    "fmt"

    "github.com/gocolly/colly/v2"
)

func main() {

    urls := []string{
        "http://webcode.me",
        "https://example.com",
        "http://httpbin.org",
        "https://www.perl.org",
        "https://www.php.net",
        "https://www.python.org",
        "https://code.visualstudio.com",
        "https://clojure.org",
    }

    c := colly.NewCollector(
        colly.Async(true),
    )

    c.OnHTML("title", func(e *colly.HTMLElement) {
        fmt.Println(e.Text)
    })

    for _, url := range urls {

        c.Visit(url)
    }

    c.Wait()
}

Solution

  • The default collection is synchronous.

    The confusing bit is probably the collector option colly.Async() which ignores the actual param. In fact the implementation at the time of writing is:

    func Async(a ...bool) CollectorOption {
        return func(c *Collector) {
            c.Async = true // uh-oh...!
        }
    }
    

    Based on this issue, it was done this way for backwards compatibility, so that (I believe) you can pass an option with no param at it'll still work, e.g.:

    colly.NewCollector(colly.Async()) // no param, async collection
    

    If you remove the async option altogether and instantiate with just colly.NewCollector(), the network requests will be clearly sequential — i.e. you can also remove c.Wait() and the program won't exit right away.