go

Tour of Go exercise #10: Crawler


I'm going through the Go Tour and I feel like I have a pretty good understanding of the language except for concurrency.

slide 10 is an exercise that asks the reader to parallelize a web crawler (and to make it not cover repeats but I haven't gotten there yet.)

Here is what I have so far:

func Crawl(url string, depth int, fetcher Fetcher, ch chan string) {
    if depth <= 0 {
        return
    }

    body, urls, err := fetcher.Fetch(url)
    if err != nil {
        ch <- fmt.Sprintln(err)
        return
    }

    ch <- fmt.Sprintf("found: %s %q\n", url, body)
    for _, u := range urls {
        go Crawl(u, depth-1, fetcher, ch)
    }
}

func main() {
    ch := make(chan string, 100)
    go Crawl("http://golang.org/", 4, fetcher, ch)

    for i := range ch {
        fmt.Println(i)
    }
}

My question is, where do I put the close(ch) call.

If I put a defer close(ch) somewhere in the Crawl method, then the program ends up writing to a closed channel from one of the spawned goroutines, because the call to Crawl will return before the spawned goroutines do.

If I omit the call to close(ch), as I demonstrate it, the program deadlocks in the main function ranging the channel because the channel is never closed when all goroutines has returned.


Solution

  • A look at the Parallelization section of Effective Go leads to ideas for the solution. Essentually you have to close the channel on each return route of the function. Actually this is a nice use case of the defer statement:

    func Crawl(url string, depth int, fetcher Fetcher, ret chan string) {
        defer close(ret)
        if depth <= 0 {
            return
        }
    
        body, urls, err := fetcher.Fetch(url)
        if err != nil {
            ret <- err.Error()
            return
        }
    
        ret <- fmt.Sprintf("found: %s %q", url, body)
    
        result := make([]chan string, len(urls))
        for i, u := range urls {
            result[i] = make(chan string)
            go Crawl(u, depth-1, fetcher, result[i])
        }
    
        for i := range result {
            for s := range result[i] {
                ret <- s
            }
        }
    
        return
    }
    
    func main() {
        result := make(chan string)
        go Crawl("http://golang.org/", 4, fetcher, result)
    
        for s := range result {
            fmt.Println(s)
        }
    }
    

    The essential difference to your code is that every instance of Crawl gets its own return channel and the caller function collects the results in its return channel.