gogo-colly

How to make gocolly crawl slower


I am using gocolly for harvesting data from my website, the challenge is, gocolly is too aggressive when crawling the URLs. I have added a RandomDelay

Update

Based on the answer I changed

c.Limit(&colly.LimitRule{
        RandomDelay: 10 * time.Second,
})

To

c.Limit(&colly.LimitRule{
        RandomDelay: 10 * time.Second,
        Parallelism: 2,
        DomainGlob: "*mysite*",
})

But when it crawls it does it in less than a few seconds:

Original output

2021/02/04 08:17:33 Visiting https://www....
2021/02/04 08:17:33 Visiting https://www....
2021/02/04 08:17:34 Visiting https://www....
2021/02/04 08:17:34 Visiting https://www....
2021/02/04 08:17:34 Visiting https://www....
2021/02/04 08:17:34 Visiting https://www....

Output after the update

2021/02/04 09:37:00 Visiting https://www...
2021/02/04 09:37:07 Visiting https://www...
2021/02/04 09:37:16 Visiting https://www...

What I am looking for is a way to ensure that gocolly doesn't crawl these pages any faster than e.g. 5-10 seconds pr page. The reason is, I don't want to see a spike in performance on my site each time gocolly runs.

Adding a time.Sleep could be an option, but I'd rather use gocolly Limit() if possible.


Solution

  • You have forgot to set the DomainGlob parameter:

        c.Limit(&colly.LimitRule{
            DomainGlob:  "*",
            //Parallelism: 2,
            //Delay:      5 * time.Second,
        })