goweb-scrapinggo-colly

gocolly: How to Prevent duplicate crawling, restrict to unique url crawling ONCE


I was experimenting with go-colly with below code, it seems to crawl same url multiple times, how do I restrict to one time crawling?

I suspected the 'Parallellsim:2' was causing the duplicates, however, some of the crawl message urls repeated more than 10 times each.

Reproducible across different websites.

gocolly is lean and great.

enter image description here

func main() {
    c := colly.NewCollector(
        colly.AllowedDomains( "www.coursera.org"),
        colly.Async(true),
    )

    c.Limit(&colly.LimitRule{
        DomainGlob: "*",
         Parallelism: 2,
    })

    c.OnHTML("a[href]", func(e *colly.HTMLElement) {
        link := e.Attr("href")
        e.Request.Visit(link)
    })
    pageCount :=0
    c.OnRequest(func(r *colly.Request) {
        r.Ctx.Put("url", r.URL.String())
    })

    // Set error handler
    c.OnError(func(r *colly.Response, err error) {
        log.Println("Request URL:", r.Request.URL, "failed with response:", r, "\nError:", err)
    })

    // Print the response
    c.OnResponse(func(r *colly.Response) {
        pageCount++
        urlVisited := r.Ctx.Get("url")
        log.Println(fmt.Sprintf("%d  DONE Visiting : %s", pageCount, urlVisited))
    })

    baseUrl := "https://www.coursera.org"
    c.Visit(baseUrl)
    c.Wait()
}

Solution

  • The Ctx is shared between requests if you use e.Request.Visit(link), so other requests may overwrite the data. Try to use c.Visit() in these situations. It creates new context for every request.

    Also, you don't need to store the URL in the context, it is always available in the OnResponse callback using r.Request.URL.

    Change your log messasge to the following to be able to see the real request url:

    log.Println(fmt.Sprintf("%d  DONE Visiting : %s", pageCount, r.Request.URL))